This repository contains the scripts to train a neural network model for Protein Structure Search based on protein-level embeddings.
Our model ingests protein residue-level embeddings computed with a structural aware protein language model available in this repo.
Training and testing the model was achieved using the same datasets described in this publication.
The training and testing datasets were carefully curated from CATH and ECOD domain classifications, ensuring independence and relevance.
Ready-to-use residue-level embeddings for the training and testing domains in the datasets can be downloaded from zenodo.
Domain classification identifiers for the different CATH and ECOD domain structures can be found in the resource
folder.
The main script for training and testing is lightning_structure_embedding.py
. Two params are required
--class_path
Should point to the path were CATH and ECOD classification Ids files are located. The files are expected to be named ascath.tsv
andecod.tsv
--embedding_path
Should point to the path where the training and testing residue-level embeddings are stored. The script expect CATH domains embeddings to be stored in<embedding_path>/cath/embedding
and ECOD<embedding_path>/ecod/embedding