Some diseases may have sub-types that react differently to treatment. Therefore, finding a cure for such diseases requires identifying these subtypes. In this work we are taking a method that has been presented theoretically in previous work and implement it for the purpose of finding sub-types of diseases, using machine learning tools. The aim of the current study is to take advantage of the differences between separate populations, using an additional observed signal which is corelated with the unobserved target variable (sub-type), to better recover the underlying structure of a disease.
In General, the algorithm builds a clustering tree assuming that the clusters are disjoint. Each node in the tree is a classifier trained to separate between two populations. At the beginning we take data of patients who having a known disease and create two samples from it, according to prior knowledge about the risk factors of the disease, reweight the patients such that the two samples will have the same cumulative weight, and train the classifier. The first classifier becomes the root of the tree and it splits to two sets, as the next step we take all of the patient in each set separately, reweigh them again and train another classifier to separate between the two samples. We keep going in the same fashion until all the patients associated with a leaf are from the same cluster in which case, no classifier can split the cluster any better than random.
With the outbreak of the Corona epidemic in Israel, the Ministry of Health began publishing a database that includes characteristics of people who undergo Corona tests. The dataset used for this project is available as CSV at the attached link. In addition, pre-processing of the data before using the algorithm is necessary. More information and the various steps are available in the file 'Preprocessing_Corona.py'.
Data collection and sharing was supported by the Israeli Ministry of Health. You can learn more about the database at : https://data.gov.il/dataset/covid-19/resource/d337959a-020a-4ed3-84f7-fca182292308. The database is updated twice a week, so it is important to emphasize that the data used for this project includes the records between March 3 and April 11, 2020. For our work, only the positive subjects are taken, i.e. only those infected with Corona, in total 9937 patients.
This is the main code including all the functions necessary for creating the clustering tree.
This function uses the other functions written in the code to build the clustering tree.
- features - Pandas dataset with all the necessary features
- ground_truth - The class of each record in the features dataset, shouls be with the same number of rows. This algorithm is suitable for binary classification, ie for 2 classes marked as [-1, 1]
- alpha - The deviation we allow from 0.5 error
- min_size_leaf - The minimum number of records per leaf
- left_list - Each index represent a node in the clustering tree, the value each index recieves represents the number of it's left child. In the case its a leaf the corresponding value will be -1
- right_list - Each index represent a node in the clustering tree, the value each index recieves represents the number of it's right child. In the case its a leaf the corresponding value will be -1
- model_list - Details of the classifier trained at each of the internal nodes, according to the index of the list in which the classifier appears
- records_per_leaf - Dictionary in which the keys are the number of the leaf and the values are the original indexes from the database of the patients that belonged to this leaf
Calculating the weight of each class in order to perform the reweigh in the classifier
For our model we use SVM. This function defines the classification according to which the clustering tree will be built, can be changed as needed
Split to two child nodes
Create child splits for a node or make terminal
This is a complete example of the creation of the clustering tree of the corona data set. We used the threshold of age 60 to divede the patient into two populations. This code includes pre-processing of the data as well as visual display after of the clusters after using t-SNE for dimension reduction.
For Example - The tree format obtained at the end is:
left_list = [1, -1, 3, 5, -1, -1, -1]
right_list = [2, -1, 4, 6, -1, -1, -1]