Oracle Entity Resolution Contest at Polimi

This is our team repository to solve the Entity Resolution Contest hosted by Polimi and Oracle Labs; you can find in depth details here.

Getting Started

In this paragraph, we'll see all the steps needed to get the project up and running on your local machine.

Prerequisites

Python 3.7.4
Download the data from the Kaggle Competition page

How to run

First of all, clone the repo or download and unzip it. Then install the requirements:

foo@bar:~$ <PATH-TO-THE-REPO>pip install requirements.txt

After that, you need to create these folders:

After that you can choose to run the single modules or, in alternative, we set up a demo that runs all the files necessary to reproduce our best results in the run_all_model.py script, so you can simply run:

foo@bar:~$ <PATH-TO-THE-REPO>python run_all_model.py

And you will find our 0.55027 score submission.

Similarity Hybrid

To run the similarity hybrid model that computes the weighted sum of the similarities and creates the submission file, run:

foo@bar:~$ <PATH-TO-THE-REPO>python sym_hybrid.py

You can play with the different parameters and similarities. Moreover, you can find all the different kinds of similarities that we tried in the similarities folder.

Neural Network

The NN needs to compute the dataframe of LightGBM through the create_expanded_dataset.py script, after that:

foo@bar:~$ <PATH-TO-THE-REPO>python NN.py

You can add or remove layers or play with the parameters. This will output a scores_nn.csv file from which you can extract the prediction through the sub_from_predictions.py script.

Notebooks

Classifier

The classifier introduced in Classificator.ipynb is used as a-posteriori method to detect if a test record has a reference in the training set. It uses LightGBMClassifier to perform the work and it is based on the scores provided by the main model pipeline (both Hybrid and LightGBMRanker)

Online Learning

Some of test records do not appear in train-set and correspond to duplicated user within the test set. In a real scenario, it could be useful to add to the training-set the test records already evaluated. In this way if a new record B need to be evaluated and it refers to an entity A discovered in previous evaluations, the new element can be correctly reconducted to the entity A. A naive version is described in Online Learning.ipynb, while a faster version is Online Learning Fast.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
__pycache__		__pycache__
features		features
images		images
notebooks		notebooks
similarities		similarities
.gitignore		.gitignore
LICENSE		LICENSE
LightGBM.py		LightGBM.py
LightGBM_final.py		LightGBM_final.py
LightGBM_full.py		LightGBM_full.py
NN.py		NN.py
NN_dataset_generation.py		NN_dataset_generation.py
NN_simple.py		NN_simple.py
README.md		README.md
XGBoost.py		XGBoost.py
__init__.py		__init__.py
create_expanded_dataset.py		create_expanded_dataset.py
get_test_label.py		get_test_label.py
requirements.txt		requirements.txt
run_all_model.py		run_all_model.py
sim_hybrid.py		sim_hybrid.py
split_train_validation.py		split_train_validation.py
sub_evaluation.py		sub_evaluation.py
sub_from_predictions.py		sub_from_predictions.py
utils.py		utils.py
xgb_dataset_generation.py		xgb_dataset_generation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Oracle Entity Resolution Contest at Polimi

Getting Started

Prerequisites

How to run

Similarity Hybrid

Neural Network

Notebooks

Classifier

Online Learning

About

Releases

Packages

Languages

License

russointroitoa/Oracle_HPC_contest

Folders and files

Latest commit

History

Repository files navigation

Oracle Entity Resolution Contest at Polimi

Getting Started

Prerequisites

How to run

Similarity Hybrid

Neural Network

Notebooks

Classifier

Online Learning

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages