This is the repository for the NovoBridge pipeline, as described in:
Hugo B. C. Kleikamp, Mario Pronk, Claudia Tugui, Leonor Guedes da Silva, Ben Abbas, Yue Mei Lin, Mark C.M. van Loosdrecht and Martin Pabst* Database-independent de novo metaproteomics of complex microbial communities, Cell Systems (2021) doi:10.1016/j.cels.2021.04.003
The pipeline was established and tested with shotgun (meta)proteomics data obtained from Q Exactive Orbitrap Mass Spectrometers, using either PEAKS or DeepNovo generated de novo sequence lists. The generation of accurate de novo peptide sequence lists depends on high quality peptide sequencing spectra.
NovoBridge has been tested only in an Anaconda Spyder environment!
Novobridge is an automated pipeline for fast processing, and integrated annotation and visualization of de novo proteomics data.
The current version of the pipeline is included as a single script: Novobridge.py
, which can be run from any python interpreter.
The pipeline uses UniPept API methods pept2lca
and pept2fun
to annotate taxonomy and function and uses the KEGG database to match the functional annotations to pathways based on EC numbers.
The Novobridge pipeline consists of 3 main parts:
- Unipept submission
- Taxonomic analysis
- Functional analysis
In Part 1: input files are read, parsed, filtered and submitted to Unipept for taxonomic and functional annotations.
In Part 2: Unipept taxonomic annotations are quantified, and visual outputs are generated.
In Part 3: Unipept functional annotations are matched to KEGG orthologies and quantified.
- Novobridge is designed as a single "tunable" python script.
- Novobridge does not offer command line options, but parameters can be altered in the script
Novobridge.py
- The script will automatically loop through all files present in the folder
input_peaks
, located in the same folder asNovobridge.py
- The input path can be altered by changing variable
pathin
inNovobridge.py
- To run the script, simply open it in your interpreter of choice and run.
- Outputs will be generated in folders:
output_unipept
,output_composition
,output_function
.
- It is recommended to use de novo sequence lists obtained from high resolution mass spectrometers. The pipeline was established and tested with data from QE Orbitrap mass spectrometers.
- NovoBridge can work with filetypes -.txt, -.tsv, -.csv, -.xls, or in -.xlsx.
- The only required input data to run Novobridge is a single column of peptides with header
Peptide
- When filtering steps are required, Novobridge is designed to work with output formats from de novo sequencing softwares Peaks and Deep NoVo.
- Apart from the input files, there are two utilities files:
keg.pkl
which is required for functional annotation, andKrona_template.xlsm
, which is required for Krona-plot visualization. Both files can be created with the scriptdownload_utilities.py
The outputs generated by the pipeline are distributed over 3 folders. For each file in input_peaks the following outputs are generated:
output_unipept
: input file, annotated with unipept for each separate peptide.output_composition
: quantified taxonomic distributions, krona plots and stacked bar charts.output_function
: quantified KEGG pathways
As default, each input dataset set generated one of each output for normal, and one of each with a randomized dataset `Rand_` of scrambled peptides.
Parameters can be freely changed within the script Novobridge.py
.
There are several parameters that can be changed to include more stringent filtering for de novo peptides, and to change quantification methods.
Part 1: Unipept submission
Filter parameters
Parameter | Default value | Description |
---|---|---|
ALC_cutoff | 40 | numeric, minimum required ALC score (Peaks score) |
Score_cutoff | -0.1 | numeric, minimum required score cutoff (DeepNovo score) |
ppm_cutoff | 20 | numeric, maximum allowed ppm |
length_cutoff | 7 | numeric, minimum required peptide length |
Area_cutoff | 0 | numeric, minimum required peak area |
Intensity_cutoff | 0 | numeric minimum required intensity |
Part 2: Taxonomic analysis
Filter parameters (also applied to Part 3: functional analysis)
Parameter | Default value | Description |
---|---|---|
comp_ALC_cutoff | 70 | numeric, minimum required ALC score (Peaks score) |
comp_Score_cutoff | -0.1 | numeric, minimum required score cutoff (DeepNovo score) |
comp_ppm_cutoff | 15 | numeric, maximum allowed ppm |
comp_length_cutoff | 7 | numeric, minimum required peptide length |
comp_Area_cutoff | 0 | numeric, minimum required peak area |
comp_Intensity_cutoff | 0 | numeric, minimum required intensity |
cutbranch | 3 | numeric, minimum number of unique peptides per taxonomic branch in denoising |
Quantification parameters
Parameter | Default value | Description |
---|---|---|
comp_ranks | ["superkingdom","phylum","class","order","family","genus"] | list, which ncbi-taxonomic ranks to annotate and quantify |
tax_count_targets | ["Spectral_counts","Area","Intensity"] | list or string, on which value should the quantification be done |
tax_count_methods | ["average","total","topx"] | list or string, how the quantification should be done |
tax_topx | 5 | integer, the amount of top hits selected, in case of topx quantification |
normalize | False | boolean, normalize quantification to total for that rank |
Part 3: Functional analysis
Quantification targets
Parameter | Default value | Description |
---|---|---|
Pathways | 09100 Metabolism | list, which Kegg pathways to annotate |
09120 Genetic Information Processing | ||
09130 Environmental Information Processing | ||
09140 Cellular Processes | ||
cats | cat1,cat2,cat3,cat4 | list, on which levels of pathways to quantify |
Quantification parameters
Parameter | Default value | Description |
---|---|---|
fun_count_targets | ["Spectral_counts","Area","Intensity"] | list or string, on which value should the quantification be done |
fun_count_methods | ["average","total","topx"] | list or string, how the quantification should be done |
fun_topx | 5 | integer, the amount of top hits selected, in case of topx quantification |
normalize | False | boolean, normalize quantification to total for that rank |
As a default, taxa and kegg pathways are quantified with 3 different methods and 3 different targets.
The targets determine to count by either Spectral_counts
of peptides, by Area
or by Intensity
, if they are available.
The user can also supply custom columns as target to count by, provided the parameters tax_count_targets
or fun_count_targets
are changed.
If the target is Spectral counting, the only way of quantification is a sum of total spectra. However, when quantification is done on Area, Intensity or a custom target, different quantification methods are available, such as average
: which averages all amounts belonging to a pathway or taxa, total
: which sums all amounts, and topx
: which sums the topx largest amounts, where topx is supplied by a variable.
As an example: if only spectral counts are desired as outputs, the parameter configuration could be changed to:
tax_count_targets="Spectral_counts"
, tax_count_methods=""
, fun_count_targets="Spectral_counts"
, fun_count_methods=""
The pipeline is licensed with standard MIT-license.
If you would like to use this pipeline in your research, please cite the following papers:
-
Hugo B. C. Kleikamp, Mario Pronk, Claudia Tugui, Leonor Guedes da Silva, Ben Abbas, Yue Mei Lin, Mark C.M. van Loosdrecht and Martin Pabst* Quantitative profiling of microbial communities by de novo metaproteomics, BiorXiv (2020) (accepted in CELL SYSTEMS)
-
Robbert Gurdeep Singh, Alessandro Tanca, Antonio Palomba, Felix Van der Jeugt, Pieter Verschaffelt, Sergio Uzzau, Lennart Martens, Peter Dawyndt, and Bart Mesuere. (2019). Unipept 4.0: Functional Analysis of Metaproteome Data. J. Proteome Res. 2019, 18, 606−615 Article.
-
Kanehisa, M., & Goto, S. (2000). KEGG : Kyoto Encyclopedia of Genes and Genomes, 28(1), 27–30.
-Hugo Kleimamp (Developer): [email protected]
-Martin Pabst: [email protected]
https://github.com/unipept
https://github.com/marbl/Krona
https://github.com/nh2tran/DeepNovo