Skip to content

Java (Spark) Applications

Fabrizio Celli edited this page Jun 17, 2017 · 9 revisions

com.github.fcproj.bigbiocl.experiments

Applications in this package rely on Apache Spark MLlib to run machine learning (distributed) algorithms. Please, read the wiki Spark MLlib naming convention to understand how MLlib renames input features and which mapping files are needed to run the code.

Classifier_Iterat_RandomForest

One of the pricipal components of BIGBIOCL. It implements a Random Forest classifier adopting the Camur approach: an iterative algorithm with feature deletion. It removes all features of previous iterations, storing sub-iterations in a specific folder named after the iteration number. In each sub-iteration folder, the list of extracted features (using the Spark naming convention) in the current iteration and F-measure are provided. In addition, the computed Spark model is provided for each iteration: it can be used to load the model in a Spark client application (e.g. RandomForestModel.load method). The algorithm stops when a minimum F-measure or a maximum number of iterations is reached. The main output is a CSV of features (semicolon-separated) extracted from all iterations: it includes features named after Spark convention, original features' names (CpG dinucleotides), and genes. Global statistics are also provided.

The application fully supports the execution on YARN cluster. Input parameters are:

  • maxNumberOfIterations: stopping condition in case minFMeasure is not reached
  • minFMeasure: main stopping condition
  • maxDepth: random forest parameter
  • maxBins: random forest parameter
  • number of trees: random forest parameter
  • dataPath: the input CSV file
  • output directory path: directory that contains the result
  • mapping file features2cpg: CSV comma separated that maps features named after Spark naming convention to column labels in the original CSV (this file can be created using CSVExtractHeader application). An example is the file BRCA_mapping.csv.
  • mapping file cpg2genes: CSV comma separated that maps CpG to genes (this file can be created using MappingGenesCPG application, which requires a BED file). An example is the file cpg2genes.csv.

Special attention needs to be paid to paths. In a Yarn cluster, if you do not specify the path prefix file:// or hdfs://, default depends on the parameter:

  • dataPath: it is an hdfs path. To use a local directory, specify file:// as path prefix (if the input file is big, it is better to load it from hdfs)
  • output directory path: it is a local directory in the server submitting the spark-submit job. This cannot be changed.
  • features2cpg: local directory in the server submitting the spark-submit job. Support for hdfs:// paths.
  • cpg2genes: local directory in the server submitting the spark-submit job. Support for hdfs:// paths.
    If you rely on the default, the input file must be loaded in HDFS (e.g. hdfs dfs -put /camur/brca.csv /user/me/input)

The input dataset must be a CSV file (comma separated):

  • the header is skipped: the algorithm starts from the second row
  • the first column is skipped (in our data, it contains the code of the tissue)
  • the last column is the category. The application supports normal (which is encoded as 0) and tumoral, which is encoded as 1. If different categories need to be used, it is important to update the JAVA class it.cnr.util.LabeledPointManager, method prepareLabeledPoints, to encode used categories with 0 or 1.
  • features are Double values in any range. If the value is not avialable, the question mark ? must be used to encode the value.

The output is described in the experiments wiki page.

Some Examples
Local Mode

spark-submit --class "com.github.fcproj.bigbiocl.experiments.Classifier_Iterat_RandomForest" --conf "spark.driver.memory=18g" --master local[7] ./bigbiocl-1.0.0.jar 1000 0.98 5 16 5 ./DNAMeth_MERGED_NOCONTROL_brca.csv ./EXPERIMENT ./features2cpg.csv ./cpg2genes.csv

YARN: all defaults

spark-submit --class "com.github.fcproj.bigbiocl.experiments.Classifier_Iterat_RandomForest" --master yarn --deploy-mode cluster --driver-memory 10g --executor-memory 10g --executor-cores 7 --queue default /camur/bigbiocl-1.0.0.jar 100 0.97 5 16 5 /user/me/input/kirp.csv /camur/TEST_HADOOP /camur/KIRP_mapping.csv /camur/cpg2genes.csv

YARN: input file in local FS

spark-submit --class "com.github.fcproj.bigbiocl.experiments.Classifier_Iterat_RandomForest" --master yarn --deploy-mode cluster --driver-memory 10g --executor-memory 10g --executor-cores 7 --queue default /camur/bigbiocl-1.0.0.jar 100 0.97 5 16 5 file:///camur/kirp.csv /camur/TEST_HADOOP /camur/KIRP_mapping.csv /camur/cpg2genes.csv

YARN: all input files in HDFS

spark-submit --class "com.github.fcproj.bigbiocl.experiments.Classifier_Iterat_RandomForest" --master yarn --deploy-mode cluster --driver-memory 10g --executor-memory 10g --executor-cores 7 --queue default /camur/bigbiocl-1.0.0.jar 100 0.97 5 16 5 hdfs:///user/me/input/kirp.csv /camur/TEST_HADOOP hdfs:///user/me/input/KIRP_mapping.csv hdfs:///user/me/input/cpg2genes.csv

Classifier_RandomForest

This algorithm applies MLlib random forest to an input dataset with the same on described in the Classifier_Iterat_RandomForest section.

Input parameters are:

  • maxDepth: random forest parameter
  • maxBins: random forest parameter
  • number of trees: random forest parameter
  • dataPath: the input CSV file
  • output directory path
  • (optional) path to a file with the list of features to ignore: this is useful to exclude some features from the computation. Features must be labeled after MLlib naming convention. The file must contain a frature in each line (e.g. a line is feature 34)

In the output directory, at the end of the execution, there are:

  • features.txt: a list of features, labeled after MLlib naming convention, extracted from the model
  • statistics: log with execution time
  • forest.txt: the string of extracted trees
  • a directory, containing the MLlib model, which can be loaded in a MLlib application
Some Examples
Local Mode

spark-submit --class "com.github.fcproj.bigbiocl.experiments.Classifier_RandomForest" --conf "spark.driver.memory=30g" --master local[10] $HOME/camur/bigbiocl-1.0.0.jar 5 16 10 $HOME/camur/brca.csv $HOME/camur/BRCA_RF1

Classifier_DecisionTree

Some Examples
Submit Decision Tree job in Local Mode

spark-submit --class "com.github.fcproj.bigbiocl.experiments.Classifier_DecisionTree" --conf "spark.driver.memory=18g" --master local[7] /home/fabrizio/Experiments/JAR/bigbiocl-1.0.0.jar 5 16 512 /home/fabrizio/Experiments/DNAMeth/DNAMeth_MERGED_NOCONTROL_brca.csv /home/fabrizio/Experiments/DNAMeth/EXPERIMENTS/11

FeatureSelection_ChiSquared_Zero_One

Some Examples
Chi Feature Selection: [0,1] features, as CpG islands

spark-submit --class "com.github.fcproj.bigbiocl.experiments.FeatureSelection_ChiSquared_Zero_One" --conf "spark.driver.memory=12g" --master local[7] /home/fabrizio/Experiments/JAR/bigbiocl-1.0.0.jar 5 16 false /home/fabrizio/Experiments/DNAMeth/DNAMeth_MERGED_NOCONTROL_brca.csv /home/fabrizio/Experiments/DNAMeth/EXPERIMENTS/20 /home/fabrizio/Experiments/DNAMeth/features2cpg.csv /home/fabrizio/Experiments/DNAMeth/cpg2genes.csv

com.github.fcproj.bigbiocl.support

LoadModel

Load a MLlib decision tree or a random forest model and write its debug string to a file. It requires Spark already installed on the machine: it uses Spark in local mode, 1 core. Input parameters are:

  • the directory containing the Spark model (decision tree or random forest)
  • the mode: tree or forest
  • the fullpath of the output file

TranslateRFModel

Given the MLlib Random Forest model debug string and a CSV mapping file "Spark_Feature_Number, Experiment_Feature_Name", extracts the list of features (e.g. CpG dinuclotides in case of DNA-methylation dataset) from the random forest model debug string. In fact, MLlib renames features as "feature N", where N is an integer from 0 to the number of features minus one.
The application requires 3 command line arguments:

  • a CSV file mapping Spark feature numbering to the real labels of input features (e.g. "feature 0,cg13869341"). An example is the file BRCA_mapping.csv.
  • a text file with the MLlib Random Forest model debug string
  • the output path

The output is composed of two text files:

  • a list of features named after MLlib naming convention
  • a list of features using the original feature labels (e.g. CpG dinuclotides)
Some Examples
Translate Spark model to tree string

java -Xmx4096m -cp "/home/fabrizio/Experiments/JAR/bigbiocl-1.0.0.jar:/usr/local/spark-2.1.0-bin-hadoop2.7/jars/\*" com.github.fcproj.bigbiocl.support.TranslateRFModel /home/fabrizio/Experiments/DNAMeth/features2cpg.csv /home/fabrizio/Experiments/DNAMeth/EXPERIMENTS/10/forest.txt /home/fabrizio/Experiments/DNAMeth/EXPERIMENTS/10/features

TranslateListFeatures2genes

Produce a CSV file mapping "MLlib feature number, original feature name, gene name". The input is:

  • CSV file (comma separated) mapping original feature names (e.g. CpG dinuclotide) to gene names. An example is the file cpg2genes.csv.
  • CSV file (comma separated) mapping MLlib feature numbers to original feature names. An example is the file BRCA_mapping.csv.
  • TXT file containing the list of MLlib features names, one per line (e.g. obtained using the LoadModel application)
  • separator for the output CSV file (e.g. ; or ,)

com.github.fcproj.bigbiocl.data_preparation

ExtractNFeatures

Extract first N features from CSV (comma separated). It adds the first and the last column. The header is maintained.

Some Examples
Extract first N columns, plus first and last

java -Xmx4096m -cp "/home/fabrizio/Experiments/JAR/bigbiocl-1.0.0.jar" com.github.fcproj.bigbiocl.data_preparation.ExtractNFeatures 10 /home/fabrizio/Experiments/DNAMeth/DNAMeth_MERGED_NOCONTROL_brca.csv /home/fabrizio/Experiments/DNAMeth/DNAMeth_first10.csv

CSVExtractHeader

It allows two things: (1) Produce a CSV file mapping Spark MLlib feature labels to original feature labels, and (2) Extract in a TEXT file the first column of features (the second column of the input CSV file) (Spark is required)

MappingGenesCPG

Produce a CSV (comma separated) mapping CpGs to Genes.