Skip to content

Latest commit

 

History

History
95 lines (70 loc) · 4.46 KB

README.md

File metadata and controls

95 lines (70 loc) · 4.46 KB

ExEnDiff: An Experiment-Guided Diffusion model for protein conformational ensemble

PyTorch Lightning Config: Hydra Template
bioarXiv

This repository contains the official implementation of the paper "ExEnDiff: An Experiment-Guided Diffusion Model for Protein Conformational Ensembles." ExEnDiff integrates experimental measurements from NMR, SAXS, cryo-EM, and more to guide a pretrained diffusion-based protein ensemble sampler. This approach enhances the sampler's ability to produce conformational ensembles that align with experimental data while approximating the Boltzmann distribution. The current implementation is built on Str2str, a recently developed protein ensemble diffusion sampler, but can be adapted to other diffusion-based models.

The current model supports guiding the sampling process with end-to-end distance, radius of gyration, helix percent per residue. Additional, we provide preliminary models guiding with cryo-EM 2D density images:

  • End-to-end distance
  • Radius of gyration
  • Helix percent per residue with distance operator
  • RMSD w.r.t the folded structure
  • Cryo-EM 2D density images (preliminary)
  • Helix percent per residue with RMSD operator (in progress)
  • Beta percent per residue with RMSD operator (in progress)

Installation

git clone https://github.com/Harrydirk41/ExEnDiff.git
cd ExEnDiff

# Create conda environment.
conda env create -f environment.yml
conda activate xendiff

# Support import as a package.
pip install -e .

Inference

There are a few steps to follow before running the sampling:

  • Put the PDB file of the folded protein structure (either from PDB, Alphafold, ESMfold) into '/path/to/ExEnDiff/data/folded_structure'. The PDB file serves both as the condition structure for Str2str and the reference structure for RMSD operator.
  • Put experiment measurement samples into '/path/to/ExEnDiff/data/experiment_samples'. The samples can be saved as a direct 'protein_name/.npy' file or as a pickle file savingGaussian Mixture Model's parameters in 'path/to/ExEnDiff/data/GMM_save_list'. An example of the pickle file including end-to-end distance, radius of gyration, and RMSD is provided.
  • In 'configs/model/diffusion.yaml', change protein_name and multi_operator_choice to indicate which operator(s) to use for guiding the sampling process. Additionally, change conditional_noise and conditional_multi_noise to control the guiding strength.
  • For other customized operators, put the operator function in "/path/to/ExEnDiff/src/models/operator/operator.py", and indicate whether to use GMM pickle (0) or npy file (1) to load samples in conditional_GMM_use_prev_sample in 'configs/model/diffusion.yaml'.
  • To perform inference, a trained model checkpoint is required and can be specified using ckpt_path=/path/to/checkpoint. A pretrained PyTorch checkpoint can be accessed from Google Drive. Download and put it into data/ckpt.
python src/eval.py task_name=inference ckpt_path=/path/to/some/checkpoint
  • Optionally, to fix physical errors in the generate samples, you can run the following command:
python /path/to/ExEnDiff/scripts/run_pdb_fix.py /path/to/pdb_file --gpu_list=0,1 --temp_fix_dir=/path/to/ExEnDiff/data/temp_fix

Citation

@article{liu2024ExEnDiff,
  title={ExEnDiff: An Experiment-Guided Diffusion model for protein conformational ensemble generation},
  author={Liu, Yikai and Sahoo, Abhilash and Yu, Zongxin and Lin, Guang and Chen, Ming and Hanson, Sonya M},
  journal={bioRxiv},
  pages={2024--10},
  year={2024},
  publisher={Cold Spring Harbor Laboratory}
}