Skip to content
Fabrizio Celli edited this page Jun 16, 2017 · 10 revisions

BIGBIOCL supports a study to adopt big data solutions and machine learning techniques to the analysis of large DNA-methylation datasets of various diseases, like Breast Cancer. Starting from a large DNA-methylation dataset, whose rows are labeled as 'normal' or 'tumor', our goal is to extract a set of candidate genes that may play a role in the disease. Candidate genes are extracted considering the CpG locations that appear in the computed classification models. This package can be applied to similar research, or to other datasets compatible with application requirements.

What is the research about?

The algorithm proposed here is inspired by CAMUR, for being applied to large input datasets. To achieve this goal, we used Apache Spark (in Local mode and on top of a Hadoop cluster) and MLlib (Spark's machine learning library). We carried out experiments with both Decision Trees and Random Forests, but Random Forests showed better results in terms of F-measure and performance. Then, we adopted a modified version of the Camur iterative algorithm with features deletion; we were able to extract many models and many candidate genes; results can be provided to biologists to check if extracted genes can be drivers for the related cancer.

What is this repository for?

This repository contains JAVA code that can be executed on Apache Spark (Local mode or running on Yarn) for the analysis of DNA-methylation data (or similar datasets). This repository contains several applications that allow to compute Decision Trees or Random Forests, as well as to apply feature selection.

How do I get set up?

Once you produce the JAR file (e.g. "camur-0.0.1-SNAPSHOT.jar"), you can execute some JAVA applications using the "java" command, or you can submit the JAR to a Spark environment ("spark-submit" command). In order to run machine learning jobs, Spark must be already installed on your machine; then, you can execute Spark in Local mode or on top of a Yarn cluster. JAVA 8 is required.

Output of our experiments

See the wiki page

Description of Java Packages and Examples

See the wiki page