ToxBERT - Modeling DILI by pretraining BERT on invitro data

This repository contains the code of ToxBERT, a pretrained BERT based model with the ability to use biological and chemical data during pretraining stage

Installation

Clone the repository:

git clone https://github.com/aidd-msca/ToxBERT.git
cd ToxBERT

Create and activate a conda environment:

conda create -y -q -n ToxBERT -c rdkit rdkit=2019.03.1.0 python=3.7.3
conda activate ToxBERT

if Aalto University (triton):

module load mamba # specific to Aalto University (triton)
source activate ToxBERT

Install dependencies:

pip install -e .

Data Preparation

The model requires two types of data:

Pretraining data: In-vitro data with binary labels
Fine-tuning data: Preclinical and Clinical data with binary labels

Data Format

The input data should be in pickle (.pkl) format with the following structure:

{
    'SMILES': [...],  # List of SMILES strings
    'property1': [...],  # Binary labels (0 or 1)
    'property2': [...],
    # ... additional properties
}

Example Data Preparation

import pandas as pd

# Load your data
data = pd.read_csv('your_data.csv')

# Process SMILES and properties
processed_data = {
    'SMILES': data['smiles'].values,
    'property1': data['prop1'].values.astype(int),
    'property2': data['prop2'].values.astype(int)
}

# Save as pickle
pd.to_pickle(processed_data, 'processed_data.pkl')

Place your prepared data files in the data/ directory:

data/
├── train_set_invitro_1m_300k_ADME_filtered.pkl
├── test_set_invitro_1m_300k_ADME_filtered.pkl
└── pos_weights.csv

Model Architecture

The model consists of three main components:

BERT encoder for molecular representation
Masked language modeling head for pre-training
Task-specific heads for:
- ADME property prediction
- Physicochemical property prediction

Architecture details:

BERT output dimension: 768
Maximum sequence length: 128
Hidden layer size: 2048
Number of attention heads: 12

Training the Model

Configure your training parameters in config/default_config.yaml:

project_name: "BERT_invitro_pretraining"
model_name: "with_masking_invitro_physchem_heads"
max_epochs: 50
batch_size: 264
lr: 1e-05

Set up environment variables:

export MODEL_WEIGHTS_DIR="/path/to/weights"
export DATA_DIR="/path/to/data"
export WANDB_API_KEY="your_wandb_key"  # Optional, for logging

Start training:

python scripts/train.py --config config/default_config.yaml

Pretraining by using public data

# Pretraining data can be downloaded from:
https://figshare.com/articles/dataset/Pretraining_data/28334303

Citation

If you use this code in your research, please cite:

@article{ToxBERT,
    title={ToxBERT - Modeling DILI by pretraining BERT on invitro data},
    author={Muhammad Arslan Masood, Samuel Kaski, Anamya Ajjolli Nagaraja, Katia Belaid, Natalie Mesens, Hugo Ceulemans, Dorota Herman, Markus Heinonen},
    journal={under review},
    year={2025}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

Muhammad Arslan Masood - arslan.asood@aalto.fi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ToxBERT - Modeling DILI by pretraining BERT on invitro data

Table of Contents

Installation

Data Preparation

Data Format

Example Data Preparation

Model Architecture

Training the Model

Pretraining by using public data

Citation

License

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

ToxBERT - Modeling DILI by pretraining BERT on invitro data

Table of Contents

Installation

Data Preparation

Data Format

Example Data Preparation

Model Architecture

Training the Model

Pretraining by using public data

Citation

License

Contact