Antibody-specific Epitope Prediction (AsEP) Dataset. This dataset is used in the manuscript AsEP: Benchmarking Deep Learning Methods for Antibody-specific Epitope Prediction (submitted to NeurIPS 2024 Datasets and Benchmarks).
The raw dataset can be downloaded from Zenodo.
We provide a 3D viewer for antibody-antigen interface visualization. Check it out at AsEP
To use the python interface, please install the package asep
. It provides the following functionalities:
- Dataset interface (see below)
- Code for loading pre-constructed graphs for antibody-antigen complexes in the dataset
- Code for constructing the neural network proposed in our manuscript - use Protein Language Models (PLMs) for node embeddings and Graph Neural Networks (GNNs) for graph representation
- Training and evaluation scripts
We provide devcontainer configuration for Visual Studio Code in .devcontainer
directory.
We recommend users to use the devcontainer for development.
The documentation for using devcontainer in Visual Studio Code here.
# enable conda (init zsh if you are using zsh, or init bash etc.)
conda init zsh
# you can also use `make` to prepare the conda environment
make setup-gpu-env
# if you don't have a GPU, then run
# make setup-cpu-env
# install other dependencies
make install-dependencies
This requires make
, run sudo apt install make
to install.
This will do the following:
- create a conda environment named
walle
- install the required packages
- install the
asep
package in editable mode.
Apart from the Zenodo link, we also provide console scripts to download the dataset. You can download the dataset by running the following command:
download-asep /path/to/directory AsEP
/path/to/directory
is the directory where you want to save the dataset.AsEP
is the name of the dataset, by default, it isAsEP
.
The antibody-antigen complexes are provided as 2D graph pairs. We provide two types of node features, one-hot encoding and pre-calculated embeddings with AntiBERTy and ESM2.
from asep.data.asepv1_dataset import AsEPv1Dataset, EmbeddingConfig
# one-hot encoding
config = EmbeddingConfig(node_feat_type="one-hot")
asepv1_dataset = AsEPv1Dataset(
root="/path/to/asep/download/folder", # replace with the path to the parent folder of downloaded AsEP
name="AsEP",
embedding_config=config,
)
# pre-calculated embeddings with AntiBERTy (via igfold) and ESM2
config = EmbeddingConfig(
node_feat_type='pre_cal',
ab={"embedding_model": "igfold"}, # change this "esm2" for ESM2 embeddings
ag={"embedding_model": "esm2"},
)
asepv1_dataset = AsEPv1Dataset(
root="/path/to/asep/download/folder", # replace with the path to the parent folder of downloaded AsEP
name="AsEP",
embedding_config=config,
)
# get i-th graph pair and node labels
i = 0
graph_pair = asepv1_dataset[i]
node_labels_b = graph_pair.y_b # antibody graph node labels (1 => interface nodes)
node_labels_g = graph_pair.y_g # antigen graph node labels (1 => interface nodes)
# bipartite graph edges
edge_index_bg = graph_pair.edge_index_bg # bipartite graph edge indices between the antibody and antigen graphs of shape (2, E), 1st col is antibody node indices, 2nd col is antigen node indices
The graph pair object graph_pair
is a PairData
(inherited from torch_geometric.data.Data
) object, which contains the following attributes:
x_b
,x_g
: node features of the antibody and antigen, respectively.- if
one-hot
, thenx_b
andx_g
are one-hot encoding of the amino acid residues, shape of(N, 20)
- if
pre_cal
, thenx_b
andx_g
are embedded with AntiBERTy and ESM2esm2_t12_35M_UR50D
, shape of(N, 512)
and(N, 480)
respectively
- if
edge_index_b
,edge_index_g
are edge indices of the antibody and antigen graphs, respectively(2, E)
edge_index_bg
: bipartite graph edge indices between the antibody and antigen graphs(2, E)
y_b
andy_g
are node labels for antibody and antigen graphs, respectively(N,)
1
indicates interface residues0
indicates non-interface residues
# split_method either "epitope_ratio" or "epitope_group"
split_idx = asepv1_dataset.get_idx_split(split_method="epitope_ratio")
train_set = asepv1_dataset[split_idx['train']]
valid_set = asepv1_dataset[split_idx['valid']]
test_set = asepv1_dataset[split_idx['test']]
print(f"{len(asepv1_dataset)=}") # number of graph pairs
print(f"{len(train_set)=}") # number of training graph pairs
print(f"{len(valid_set)=}") # validation
print(f"{len(test_set)=}") # testing
# len(asepv1_dataset)=1723
# len(train_set)=1383
# len(valid_set)=170
# len(test_set)=170
We provide an evaluator to evaluate model's performance on the AsEPv1 dataset.
y_pred
:torch.Tensor
predicted node labels, shape of(N,)
y_true
:torch.Tensor
ground truth node labels, shape of(N,)
from asep.data.asepv1_dataset import AsEPv1Evaluator
evaluator = AsEPv1Evaluator()
# example
torch.manual_seed(0)
y_pred = torch.rand(1000)
y_true = torch.randint(0, 2, (1000,))
input_dict = {'y_pred': y_pred, 'y_true': y_true}
result_dict = evaluator.eval(input_dict)
print(result_dict) # got {'auc-prc': tensor(0.5565)}
Benchmark performance of several deep learning models on the AsEP dataset on two settings: epitope ratio
and epitope group
.
Methods | Publication | Code/Repository | Antibody-specific | Structure | PLM | Graph |
---|---|---|---|---|---|---|
WALLE | Under review | Here | ✅ | ✅ | ✅ | ✅ |
EpiPred | Publication | Code | ✅ | ✅ | ✕ | ✅ |
ESMFold | Publication | GitHub | ✅ | ✕ | ✅ | ✕ |
MaSIF-site | Publication | GitHub | ✕ | ✅ | ✕ | ✅ |
ESMBind | Publication | HuggingFace | ✕ | ✕ | ✅ | ✕ |
Algorithm | MCC | Precision | Recall | AUCROC | F1 |
---|---|---|---|---|---|
WALLE | 0.210 (0.020) | 0.235 (0.018) | 0.422 (0.028) | 0.635 (0.013) | 0.258 (0.018) |
EpiPred | 0.029 (0.018) | 0.122 (0.014) | 0.180 (0.019) | — | 0.142 (0.016) |
ESMFold | 0.028 (0.010) | 0.137 (0.019) | 0.043 (0.006) | 0.060 (0.008) | — |
ESMBind | 0.016 (0.008) | 0.106 (0.012) | 0.121 (0.014) | 0.506 (0.004) | 0.090 (0.009) |
MaSIF-site | 0.037 (0.012) | 0.125 (0.015) | 0.183 (0.017) | — | 0.114 (0.011) |
Values in parentheses are standard errors.
Algorithm | MCC | Precision | Recall | AUCROC | F1 |
---|---|---|---|---|---|
WALLE | 0.077 (0.015) | 0.143 (0.017) | 0.266 (0.025) | 0.544 (0.010) | 0.145 (0.014) |
EpiPred | -0.006 (0.015) | 0.089 (0.011) | 0.158 (0.019) | — | 0.112 (0.014) |
ESMFold | 0.018 (0.010) | 0.113 (0.019) | 0.034 (0.007) | — | 0.046 (0.009) |
ESMBind | 0.002 (0.008) | 0.082 (0.011) | 0.076 (0.011) | 0.500 (0.004) | 0.064 (0.008) |
MaSIF-site | 0.046 (0.014) | 0.164 (0.020) | 0.174 (0.015) | — | 0.128 (0.012) |
Values in parentheses are standard errors.