Finding peptide targets that bind to HLAs and trigger an immune response is challenging, and peptide-HLA (pHLA) binding prediction is one of the crucial steps in the development of personalized peptide vaccines. Machine Learning (ML) pHLA binding prediction tools are trained on vast amounts of pHLA binding data. ML predictions are effective in guiding the search for therapeutic peptide targets. However, the use of datasets with imbalanced allele content raises concerns about biased performance toward certain geographic populations. We examine the bias of two ML-based pan-allele pHLA binding affinity predictors. We aim to draw attention to the potential therapeutic consequences of this bias, and we challenge the use of the term "pan-allele" to describe models trained with currently available public datasets.
In this repo you can find all the scripts used to perform the given analysis.
Check out the following subsections of this repo:
- Software requirements gives an overview of software needed to run the analysis and provides installation instructions.
- Interactive dashboards gives a chance for readers to take an interactive look into the data presented in the paper.
- Datasets and methods gives details of the content of this repo and outlines steps performed with the notebooks.
- Additional Resources provides background on related databases and literature
The simplest way to run the analysis is to use our binder connection:
Be patient as Binder might take some time to load. Then follow the instructions given by the Datasets and methods section and readme files in the subfolders to perform the analysis.
Most of the analysis is performed in python and wrapped in jupyter notebooks. You will need:
- Python (version 3.7 or later)
- Packages listed in the
requirements.txt
, most notably:- plotly
- pandas
- scipy
- numpy
- Jupyter
- IEDB population coverage tool
Analysis related to the algorithmic bias is performed in R. To run this analysis you will need RStudio. Clone the content of this repository and follow the instructions given by the Datasets and methods section
Using binder, jupyter and voila we constructed interactive dashboards with the data we analyzed.
1. Take a look at HLA allele frequencies across different populations (from the AFND database).
- See allele frequencies per populations - select one or more populations from the dropdown using CTRL+SHIFT keys
- See a hierarchical clustering of populations based on their HLA allele frequencies (populations clustered together have a similar HLA allele content)
- See a UMAP embedding of the populations based on the allele: populations close to each other have similar allele profiles; select populations to display their alleles.
2. Take a look at the allele content of training datasets (NetMHCpan4.1 and MHCFlurry2.0)
3. Inspect the literature origin of the IEDB currated data
1.1 Allele Frequency Net Database (AFND) scraping
To get the frequencies of HLA alleles across different populations we query the AFND database to get the most recent allele frequencies. More details here.
Code for running this is in the AFND_population_frequencies
folder. This code allows you to:
- Download the AFND HLA allele frequencies per locus for all available populations
- Combine the AFND frequencies across different loci
- Visualize and analyze the AFND data
The result of this step is a population allele frequency map AFND_data_locus_all.csv
later used for calculating the population coverage as the source of ground truth allele frequency labels.
1.2 Classifying geographic populations by their level of income
Code for running this is in the WorldBank_Income_levels
folder. We download the country income levels from the World Bank here (current classification by income in XLSX format) and process them.
The result of this step is a population_income_map.csv
which maps the AFND populations to countries and respective income levels
Here we calculate the population coverage of different training datasets (NetMHCpan4.1 and MHCFlurry2.0 - binding affinity and mass spectrometry data). The related scripts are in the Datasets_population_coverage
folder.
2.1 Download and prepare the datasets
Find the detailed instructions and notebooks for downloading the datasets inside the ./datasets/
folder here.
- Prepare the NetMHCpan4.1 datasets
- Prepare the MHCFlurry2.0 datastes
- Visualize the dataset content
2.2 Calculate the population coverage
Use the ./CalculateCoverage_IEDB.ipynb
notebook to calculate the population coverage of the datasets. This notebook will guide you through the following steps:
- Downloading and setting up the IEDB population coverage tool from here
- Running the population coverage for a single population and a single dataset
- Running the population coverage for all datasets and all populations
2.3. Visualize the population coverage of the datasets
Run the ./Visualize_results.ipynb
notebook to see how different datasets cover the given populations and how this coverage relates to different income levels and ancestries.
To examine the algorithmic bias we use a recently published independent dataset collected by Pyke et al.. This dataset contains MS data for some rare alleles that were not sampled before (i.e., A02:52, B15:13). More details here.
The code for running this analysis is in the folder Algorithmic_bias_analysis
. This code allows you to:
Prepare_dataset.R
: Filter and prepare the dataset from Pyke et al. used in the analysis. Namely, peptides with any chemical modifications are removed, and negatives are sampled from the human proteome.Motif_generation.R
: Generate motifs for calculating FOOP scores (see paper for more details)Data_analysis.R
: Calculate PPV/FOOP scores and generate plots
- AFND Allele frequency net database
Currated frequencies across regions/ethnicities!! (what we need) A couple of resources to access this data
-
immunotation R package
-
hladownload python package
-
IPD-IMGT Immuno Polymorphism Database, international ImMunoGeneTics information system
Contains raw sample data - if we need more details.
-
IEDB The immune epitope database
Contains more raw data (on the binding affinity / mass spec side). Contains the population coverage module
-
Works that mention/address the possible bias in the therapeutics.
-
Bui et al., 2005: Predicting population coverage of T-cell epitope-based diagnostics and vaccines
A disproportionate amount of MHC polymorphism occurs in positions constituting the peptide-binding region, and as a result, MHC molecules exhibit a widely varying binding specificity. In the design of peptide-based vaccines and diagnostics, the issue of population coverage in relation to MHC polymorphism is further complicated by the fact that different HLA types are expressed at dramatically different frequencies in different ethnicities. Thus, without careful consideration, a vaccine or diagnostic with ethnically biased population coverage could result.
-
Oyarzun et al., 2015: A bioinformatics tool for epitope-based vaccine design that accounts for human ethnic diversity: Application to emerging infectious diseases
Predivac-2.0 is a novel approach in epitope-based vaccine design, particularly suited to be applied to virus-related emerging infectious diseases, because the geographic distributions of the viruses are well defined and ethnic populations in need of vaccination can be determined (“ethnicity-oriented approach”). Predivac-2.0 is accessible through the website http://predivac.biosci.uq.edu.au/.
-
Sarkizova et al. - HLAAthena
-
Pyke et al. 2021: Precision Neoantigen Discovery Using Large-scale Immunopeptidomes and Composite Modeling of MHC Peptide Presentation
-