Skip to content

Experiments

Fabrizio Celli edited this page Jun 29, 2020 · 6 revisions

In this page we focus on the execution of the main BIGBIOCL algorithm, i.e. the iterative random forest with feature deletion. The algorithm is implemented by the JAVA class Classifier_Iterat_RandomForest and it is documented in the Java (Spark) Applications wiki page.

The INPUT of our experiments is the TCGA (The Cancer Genome Atlas) data available at The Genomic Data Commons data sharing platform (https://gdc.nci.nih.gov/).

The preparation process to run this kind of experiments is usually the following one:

  1. Prepare the input CSV file as explained in the Java (Spark) Applications wiki page. In our experiments we have filtered out control cases, so only "tumoral" and "normal" categories are available.
  2. Create the mapping file that links CpG sites to genes
  3. Create the mapping file that links featured named after Spark MLlib convention to input feature names (you can use the JAVA application com.github.fcproj.bigbiocl.data_preparation.CSVExtractHeader)
  4. Run your experiment!

In this project we provide a set of results of our experiments. Each directory refers to a specific experiment, and its encoded name can give an idea of what is the experiment for; for instance, BRCA_IRF_18 refers to the experiment number 18 of application of Iterative Random Forest to the Breast Cancer Dataset running Spark in local mode, while BRCA_IRF_18_HADOOP refers to the same experiment on a Hadoop cluster. Additional details about the settings of an experiment are provided in its directory.

Inside the root directory of an experiment there are:

  • A set of sub-directories. Each sub directory represents a specific iteration and it is named after the iteration number, starting from 0.
  • allFeatures.csv: this is the main result. In includes genes extracted by all iterations of the experiment. The file is a CSV including the Spark label for the feature, the corresponding CpG identifier, and the corresponding gene.
  • settings.txt: file with configuration parameters for the experiment.
  • statistics.txt: application log, with statistics for each iteration.

Inside the directory of an iteration there are:

  • forest.txt: the set of trees extracted by the execution of the Random Forest algorithm. Features are labeled after Spark naming convention. Category 0 means normal, 1 means tumoral. If your input dataset makes use of different categories, it is important to update the JAVA method com.github.fcproj.bigbiocl.util.LabeledPointManager.prepareLabeledPoints to encode categories with 0 or 1.
  • features.txt: features extracted in this iteration, labeled after Spark naming convention.
  • (not always available) a directory containing the Spark model (which can be loaded using Spark APIs)
Clone this wiki locally