Finding Sub-types of Diseases

Some diseases may have sub-types that react differently to treatment. Therefore, finding a cure for such diseases requires identifying these subtypes. In this work we are taking a method that has been presented theoretically in previous work and implement it for the purpose of finding sub-types of diseases, using machine learning tools. The aim of the current study is to take advantage of the differences between separate populations, using an additional observed signal which is corelated with the unobserved target variable (sub-type), to better recover the underlying structure of a disease.

The Algorithm

In General, the algorithm builds a clustering tree assuming that the clusters are disjoint. Each node in the tree is a classifier trained to separate between two populations. At the beginning we take data of patients who having a known disease and create two samples from it, according to prior knowledge about the risk factors of the disease, reweight the patients such that the two samples will have the same cumulative weight, and train the classifier. The first classifier becomes the root of the tree and it splits to two sets, as the next step we take all of the patient in each set separately, reweigh them again and train another classifier to separate between the two samples. We keep going in the same fashion until all the patients associated with a leaf are from the same cluster in which case, no classifier can split the cluster any better than random.

The Dataset - Coronavirus

With the outbreak of the Corona epidemic in Israel, the Ministry of Health began publishing a database that includes characteristics of people who undergo Corona tests. The dataset used for this project is available as CSV at the attached link. In addition, pre-processing of the data before using the algorithm is necessary. More information and the various steps are available in the file 'Preprocessing_Corona.py'.

Data collection and sharing was supported by the Israeli Ministry of Health. You can learn more about the database at : https://data.gov.il/dataset/covid-19/resource/d337959a-020a-4ed3-84f7-fca182292308. The database is updated twice a week, so it is important to emphasize that the data used for this project includes the records between March 3 and April 11, 2020. For our work, only the positive subjects are taken, i.e. only those infected with Corona, in total 9937 patients.

'Classifier_Tree_Corona.py'

This is the main code including all the functions necessary for creating the clustering tree.

'build_tree' - The main function -

This function uses the other functions written in the code to build the clustering tree.

Input -

features - Pandas dataset with all the necessary features
ground_truth - The class of each record in the features dataset, shouls be with the same number of rows. This algorithm is suitable for binary classification, ie for 2 classes marked as [-1, 1]
alpha - The deviation we allow from 0.5 error
min_size_leaf - The minimum number of records per leaf

Output -

left_list - Each index represent a node in the clustering tree, the value each index recieves represents the number of it's left child. In the case its a leaf the corresponding value will be -1
right_list - Each index represent a node in the clustering tree, the value each index recieves represents the number of it's right child. In the case its a leaf the corresponding value will be -1
model_list - Details of the classifier trained at each of the internal nodes, according to the index of the list in which the classifier appears
records_per_leaf - Dictionary in which the keys are the number of the leaf and the values are the original indexes from the database of the patients that belonged to this leaf

'get_class_weights' -

Calculating the weight of each class in order to perform the reweigh in the classifier

'set_SVM_model' -

For our model we use SVM. This function defines the classification according to which the clustering tree will be built, can be changed as needed

'get_split' -

Split to two child nodes

'split' -

Create child splits for a node or make terminal

'Classifier_Tree_Corona-FULL_EXAMPLE.py'

This is a complete example of the creation of the clustering tree of the corona data set. We used the threshold of age 60 to divede the patient into two populations. This code includes pre-processing of the data as well as visual display after of the clusters after using t-SNE for dimension reduction.

For Example - The tree format obtained at the end is:

left_list = [1, -1, 3, 5, -1, -1, -1]

right_list = [2, -1, 4, 6, -1, -1, -1]

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
Synthetic Data		Synthetic Data
Classifier_Tree_Corona-FULL_EXAMPLE.py		Classifier_Tree_Corona-FULL_EXAMPLE.py
Classifier_Tree_Corona.py		Classifier_Tree_Corona.py
Final_Tree.png		Final_Tree.png
Preprocessing_Corona.py		Preprocessing_Corona.py
Project Report.pdf		Project Report.pdf
README.md		README.md
presentation.pdf		presentation.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finding Sub-types of Diseases

The Algorithm

The Dataset - Coronavirus

'Classifier_Tree_Corona.py'

'build_tree' - The main function -

Input -

Output -

'get_class_weights' -

'set_SVM_model' -

'get_split' -

'split' -

'Classifier_Tree_Corona-FULL_EXAMPLE.py'

About

Releases

Packages

Languages

TAU-MLwell/sub-types

Folders and files

Latest commit

History

Repository files navigation

Finding Sub-types of Diseases

The Algorithm

The Dataset - Coronavirus

'Classifier_Tree_Corona.py'

'build_tree' - The main function -

Input -

Output -

'get_class_weights' -

'set_SVM_model' -

'get_split' -

'split' -

'Classifier_Tree_Corona-FULL_EXAMPLE.py'

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages