DREAM (DNA cis-Regulatory Elements with controllable Activity design platforM) is an efficient, scalable and explainable computational framework to design CREs from scratch.
- Python 3.10.13
- TensorFlow 2.13
- scikit-learn
- h5py
- pysam
- deap 1.4.1
conda install -r requirements.txt
git clone https://github.com/cisGrammar/DREAM.git
The structure of DREAM's enhancer activity prediction framework(SENet):
The script encode_sequences.py
encodes DNA sequences into one-hot format and saves them as an HDF5 dataset.
Command Line Arguments
--REFERENCE
: Path to the reference genome in FASTA format (default: '/data/reference/melanogaster/dm3.fa').--output_h5
: Path to save the output HDF5 file (default: 'alldata.h5').--traindata_df
: Path to the training dataset CSV file (default: 'train_dataset.csv').--valdata_df
: Path to the validation dataset CSV file (default: 'val_dataset.csv').--testdata_df
: Path to the test dataset CSV file (default: 'test_dataset.csv').
To encode DNA sequences and prepare datasets, use the following command:
python dataset/dna_data.py --REFERENCE '/data/reference/melanogaster/dm3.fa' --output_h5 'alldata.h5' --traindata_df 'train_dataset.csv' --valdata_df 'val_dataset.csv' --testdata_df 'test_dataset.csv'
The script train.py
TensorFlow script for training an SE-ResNet model on enhancer activity dataset(e.g., STARR-seq dataset, MPRA dataset).
The script train.py
supports the following command line arguments:
--cuda_devices
: Specifies CUDA_VISIBLE_DEVICES setting (default: '0').--data_path
: Path to the HDF5 data file containing training, validation, and test datasets (default: 'alldata.h5').--batch_size
: Batch size for training (default: 512).--epochs
: Number of epochs for training (default: 100).--patience
: Patience for early stopping (default: 10).--checkpoint_path
: Prefix for the model checkpoint files (default: 'checkpoint').
To train the SE-ResNet model, use the following command:
python train.py --cuda_devices '0' --data_path 'path/to/your/data.h5' --batch_size 512 --epochs 100 --patience 10 --checkpoint_path 'your_checkpoint_prefix'
The GA/ga.py
is the implementation of an evolutionary algorithm using the DEAP framework to optimize the CRE activity based on the CRE activtiy prediction module's output.
The GA/ga.py
script supports the following command-line arguments:
--sequence_length
: Length of the DNA sequences to generate (default: 249).--nucleotide_frequency
: Frequencies of nucleotides[A, C, G, T]
(default: [0.25, 0.25, 0.25, 0.25]).--seed
: Random seed for TensorFlow and NumPy (default: 12345).--cuda_devices
: CUDA visible devices (default: "0").--best_model_checkpoint
: Path to the best model checkpoint for fitness prediction (default: "checkpoint_keras/").--indpb
: Probability of mutating each nucleotide in an individual (default: 0.025).--population_size
: Initial population size (default: 100000).--NGEN
: Number of generations to evolve (default: 90).--output_file
: Path to save the evolution fitness dataframe CSV (default: "evolution_fits_df.csv").
Trained models are available for download via Baidu Netdisk. We recommend using these pre-trained models for enhancer/silencer element design and optimization. Download link: https://pan.baidu.com/s/1m4fEcxwm8GnmhtfKJeyD3g Access code: yywq
python GA/ga.py --sequence_length 249 --nucleotide_frequency 0.25 0.25 0.25 0.25 --seed 12345 --best_model_checkpoint "/path/to/best_model/" --indpb 0.025 --n 1000 --NGEN 10 --output_file "/path/to/output.csv"