Creating Deep Learning Models for Classification of Sequencing Artifacts in Long-Read Whole Genome Sequencing (WGS) Data

The goal of this project is to build deep learning classification models to distinguish artifactual variant calls from genuine artifacts in long-read sequencing data. The three major steps (corresponding to the three notebooks here) for carrying this out are the following:

Processing raw FAST5 files to variant calls which includes:
- Downloading raw FAST5 data from the Nanopore Whole Genome Sequencing Consortium GitHub repository,
- base calling to get FASTQ files,
- mapping to the human reference genome to get BAM files,
- variant calling to get VCF files, and
- intersecting with NIST Genome-in-a-Bottle gold-standard benchmarking data to allow determination of which variant calls were correct vs artifacts.
Preprocessing of the VCFs to a format usable for deep learning modeling which includes:
- extracting sequence context of all variants in the VCF files from the reference genome
- creating a Pandas features-labels dataframe with the sequence context being the feature and artifact/not-artifact being the label
- splitting the data into train-validation and test
- encoding the sequences and labels into Numpy arrays for input into the deep learning models
Fitting and evaluating various deep learning models including:
- multilayer perceptrons (MLPs)
- convolutional neural networks (CNNs)
- recurrent neural networks (RNNs) e.g. LSTMs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Creating Deep Learning Models for Classification of Sequencing Artifacts in Long-Read Whole Genome Sequencing (WGS) Data

Files

README.md

Latest commit

History

README.md

File metadata and controls

Creating Deep Learning Models for Classification of Sequencing Artifacts in Long-Read Whole Genome Sequencing (WGS) Data