-
Notifications
You must be signed in to change notification settings - Fork 3
Experiments
In this page we focus on the execution of the main BIGBIOCL algorithm, i.e. the iterative random forest with feature deletion. The algorithm is implemented by the JAVA class Classifier_Iterat_RandomForest
and it is documented in the Java (Spark) Applications wiki page.
The preparation process to run this kind of expriments is usually the following one:
- Prepare the input CSV file as explained in the Java (Spark) Applications wiki page. In our experiments we have filtered out control cases, so only "tumoral" and "normal" categories are available.
- Create the mapping file that links CpG sites to genes
- Create the mapping file that links featured named after Spark MLlib convention to input feature names (you can use the JAVA application
it.cnr.camur.data_preparation.CSVExtractHeader
) - Run your experiment!
In this project we provide a set of results of our experiments. Each directory refers to a specific experiment, and its encoded name can give an idea of what is the experiment for; for instance, BRCA_IRF_18
refers to the experiment number 18 of application of Iterative Random Forest to the Breast Cancer Dataset running Spark in local mode, while BRCA_IRF_18_HADOOP
refers to the same experiment on a Hadoop cluster. Additional details about the settings of the experiment are provided in its directory.
Inside the root directory of an experiment there are:
- A set of sub-directories. Each sub directory represents a specific iteration and it is named after the iteration number, starting from 0.
- allFeatures.csv: this is the main result. In includes genes extracted by all iterations of the experiment. The file is a CSV including the Spark label for the feature, the corresponding CpG identifier, and the corresponding gene.
- settings.txt: file with configuration parameters for the experiment.
- statistics.txt: application log, with statistics for each iteration.
Inside the directory of an iteration there are:
- forest.txt: the set of trees extracted by the execution of the Random Forest algorithm. Features are labeled after Spark naming convention.
- features.txt: features extracted in this iteration, labeled after Spark naming convention.
- (not always available) a directory containing the Spark model (which can be loaded using Spark APIs)