This repository contains the code of ToxBERT, a pretrained BERT based model with the ability to use biological and chemical data during pretraining stage
- Installation
- Data Preparation
- Model Architecture
- Training the Model
- Reproducing Results
- Making Predictions
- Results
- Citation
- Clone the repository:
git clone https://github.com/aidd-msca/ToxBERT.git
cd ToxBERT
- Create and activate a conda environment:
conda create -y -q -n ToxBERT -c rdkit rdkit=2019.03.1.0 python=3.7.3
conda activate ToxBERT
if Aalto University (triton):
module load mamba # specific to Aalto University (triton)
source activate ToxBERT
- Install dependencies:
pip install -e .
The model requires two types of data:
- Pretraining data: In-vitro data with binary labels
- Fine-tuning data: Preclinical and Clinical data with binary labels
The input data should be in pickle (.pkl) format with the following structure:
{
'SMILES': [...], # List of SMILES strings
'property1': [...], # Binary labels (0 or 1)
'property2': [...],
# ... additional properties
}
import pandas as pd
# Load your data
data = pd.read_csv('your_data.csv')
# Process SMILES and properties
processed_data = {
'SMILES': data['smiles'].values,
'property1': data['prop1'].values.astype(int),
'property2': data['prop2'].values.astype(int)
}
# Save as pickle
pd.to_pickle(processed_data, 'processed_data.pkl')
Place your prepared data files in the data/
directory:
data/
├── train_set_invitro_1m_300k_ADME_filtered.pkl
├── test_set_invitro_1m_300k_ADME_filtered.pkl
└── pos_weights.csv
The model consists of three main components:
- BERT encoder for molecular representation
- Masked language modeling head for pre-training
- Task-specific heads for:
- ADME property prediction
- Physicochemical property prediction
Architecture details:
- BERT output dimension: 768
- Maximum sequence length: 128
- Hidden layer size: 2048
- Number of attention heads: 12
- Configure your training parameters in
config/default_config.yaml
:
project_name: "BERT_invitro_pretraining"
model_name: "with_masking_invitro_physchem_heads"
max_epochs: 50
batch_size: 264
lr: 1e-05
- Set up environment variables:
export MODEL_WEIGHTS_DIR="/path/to/weights"
export DATA_DIR="/path/to/data"
export WANDB_API_KEY="your_wandb_key" # Optional, for logging
- Start training:
python scripts/train.py --config config/default_config.yaml
# Pretraining data can be downloaded from:
https://figshare.com/articles/dataset/Pretraining_data/28334303
If you use this code in your research, please cite:
@article{ToxBERT,
title={ToxBERT - Modeling DILI by pretraining BERT on invitro data},
author={Muhammad Arslan Masood, Samuel Kaski, Anamya Ajjolli Nagaraja, Katia Belaid, Natalie Mesens, Hugo Ceulemans, Dorota Herman, Markus Heinonen},
journal={under review},
year={2025}
}
This project is licensed under the MIT License - see the LICENSE file for details.
Muhammad Arslan Masood - [email protected]