diff --git a/index.html b/index.html
index 054d7c0..3f3cfc2 100644
--- a/index.html
+++ b/index.html
@@ -198,9 +198,9 @@ <h2 id="local-fileweb-browsing">Local file/web browsing</h2>
 <p>in replace of <code>~/.config/microsoft-edge</code>, where TMPDIR is a directory name.</p>
 <p>One could browse files as well as mirrors of two web sites.</p>
 <ol>
-<li>SRCF. The mirror is within the following subdirectory: <code>/srcf</code>.</li>
 <li>Web site. This is from <code>/site</code> as above.</li>
-<li>Colocalisation. See /json/coloc.html.</li>
+<li>SRCF. The mirror is within the following subdirectory: <code>/srcf</code>.</li>
+<li>Colocalisation. See /json/coloc.html. Note that many blanked plots such as A1AG1-* are due to misspecification of chromosomal positions which should otherwise be visible.</li>
 <li>Isotopes associated with &gt;1 proteins, /dup/json/dup.htm</li>
 </ol>
 <p>To facilitate navigation, an <code>index.html</code> is created in place, so <code>python3 -m http.server 8000 &amp;</code> is started from <code>/rds/project/rds-zuZwCZMsS0w/Caprion_proteomics/analysis</code>.</p>
@@ -298,5 +298,5 @@ <h4 class="modal-title" id="keyboardModalLabel">Keyboard Shortcuts</h4>
 
 <!--
 MkDocs version : 1.5.3
-Build Date UTC : 2024-12-30 17:30:25.149327+00:00
+Build Date UTC : 2024-12-30 17:51:53.694158+00:00
 -->
diff --git a/search/search_index.json b/search/search_index.json
index 8aadb5d..2dc7737 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Caprion data analysis Welcome! This repository/site is dedicated to protein/peptide quantitative trait analysis using the Caprion platform, which is organised chonologically/logistically into the following sections. Pilot studies autoencoder gwas2 Pilot studies Analysis Protein analysis Peptide analysis Miscellaneous analysis Additional information Caprion panel Notes Local file/web browsing A web-style navigation is furnised via a port number, e.g., 8000, cd /rds/project/rds-zuZwCZMsS0w/Caprion_proteomics/analysis module load ceuadmin/edge export pn=8000 if lsof -i :${pn}; then echo \"Port ${pn} is already in use' try another one.\" else python3 -m http.server ${pn} & server_pid=$! edge http://localhost:${pn} & fi where the port number can be released with kill $server_pid (can be checked with ps ). In case it does now show, use edge --user-data-dir=${TMPDIR} http://localhost:${pn} & in replace of ~/.config/microsoft-edge , where TMPDIR is a directory name. One could browse files as well as mirrors of two web sites. SRCF. The mirror is within the following subdirectory: /srcf . Web site. This is from /site as above. Colocalisation. See /json/coloc.html. Isotopes associated with >1 proteins, /dup/json/dup.htm To facilitate navigation, an index.html is created in place, so python3 -m http.server 8000 & is started from /rds/project/rds-zuZwCZMsS0w/Caprion_proteomics/analysis . Non-CSD3 browser(s) This approach seems less problematic with user-data-dir mentioned above. We can again set up tunneling from CSD3 with python3 -m http.server 8000 & hostname Once succeeded, we establish the connection elsewhere. ssh -4 -L 8080:127.0.0.1:8000 -fN jhz22@${hostname}.hpc.cam.ac.uk where hostname from CSD3 and ${hostname} have to be the same. We can then browse http://127.0.0.1:8080 .","title":""},{"location":"#caprion-data-analysis","text":"","title":"Caprion data analysis"},{"location":"#welcome","text":"This repository/site is dedicated to protein/peptide quantitative trait analysis using the Caprion platform, which is organised chonologically/logistically into the following sections.","title":"Welcome!"},{"location":"#pilot-studies","text":"autoencoder gwas2 Pilot studies","title":"Pilot studies"},{"location":"#analysis","text":"Protein analysis Peptide analysis Miscellaneous analysis","title":"Analysis"},{"location":"#additional-information","text":"Caprion panel Notes","title":"Additional information"},{"location":"#local-fileweb-browsing","text":"A web-style navigation is furnised via a port number, e.g., 8000, cd /rds/project/rds-zuZwCZMsS0w/Caprion_proteomics/analysis module load ceuadmin/edge export pn=8000 if lsof -i :${pn}; then echo \"Port ${pn} is already in use' try another one.\" else python3 -m http.server ${pn} & server_pid=$! edge http://localhost:${pn} & fi where the port number can be released with kill $server_pid (can be checked with ps ). In case it does now show, use edge --user-data-dir=${TMPDIR} http://localhost:${pn} & in replace of ~/.config/microsoft-edge , where TMPDIR is a directory name. One could browse files as well as mirrors of two web sites. SRCF. The mirror is within the following subdirectory: /srcf . Web site. This is from /site as above. Colocalisation. See /json/coloc.html. Isotopes associated with >1 proteins, /dup/json/dup.htm To facilitate navigation, an index.html is created in place, so python3 -m http.server 8000 & is started from /rds/project/rds-zuZwCZMsS0w/Caprion_proteomics/analysis .","title":"Local file/web browsing"},{"location":"#non-csd3-browsers","text":"This approach seems less problematic with user-data-dir mentioned above. We can again set up tunneling from CSD3 with python3 -m http.server 8000 & hostname Once succeeded, we establish the connection elsewhere. ssh -4 -L 8080:127.0.0.1:8000 -fN jhz22@${hostname}.hpc.cam.ac.uk where hostname from CSD3 and ${hostname} have to be the same. We can then browse http://127.0.0.1:8080 .","title":"Non-CSD3 browser(s)"},{"location":"Notes/","text":"Notes ( Sections I -- III are due to Claude ) I. Meta-data Isotope.Group.ID is a unique identifier for a group of isotopes that belong to the same peptide or molecule. In mass spectrometry, isotopes are atoms of the same element that have the same number of protons but differ in the number of neutrons. This ID helps to group together isotopes that arise from the same peptide, allowing for easier identification and analysis. Protein contains the name or identifier of the protein that the peptide (or molecule) is derived from. This information is typically obtained by searching the MS data against a protein database. Modified.Peptide.Sequence is the amino acid sequence of the peptide, including any post-translational modifications (PTMs) that have been identified. PTMs are chemical modifications that occur after protein synthesis, such as phosphorylation, ubiquitination, or methylation. The sequence is usually represented in a standard format, such as using lowercase letters for modified residues. Monoisotopic.m/z is the monoisotopic mass-to-charge ratio (m/z) of the peptide or molecule. The monoisotopic mass is the mass of the most abundant isotope of each element in the molecule, which is typically the lightest isotope (e.g., 12C, 1H, 14N, 16O, etc.). This value is used as a reference point for identifying the peptide or molecule. Max.Isotope.Time.Centroid is the time centroid (or apex) of the most intense isotope in the isotope group. In liquid chromatography-mass spectrometry (LC-MS), peptides are separated based on their retention time (the time it takes for the peptide to elute from the column). The time centroid is the time point at which the peptide signal is most intense, which can be used to quantify the peptide abundance. Charge is the charge state of the peptide or molecule. In mass spectrometry, peptides can be ionized to different charge states (e.g., +1, +2, +3, etc.), which affects their mass-to-charge ratio (m/z). The charge state is an important parameter for identifying peptides and molecules. They are invaluable for analyzing and interpreting MS data, including peptide identification, quantification, and characterization of post-translational modifications. II. MS1/MS2 In mass spectrometry-based proteomics, the typical workflow for identifying peptides and proteins involves using tandem mass spectrometry (MS/MS or MS2). In this process, precursor ions (peptides) are selected in the first stage of mass spectrometry (MS1) and then fragmented to produce a series of smaller ions in the second stage (MS2). The resulting fragment ions (product ions) are analyzed to infer the sequence of the peptide and, by extension, identify the proteins from which they were derived. However, it is possible to infer peptides and proteins using only MS1 data through a process known as \"MS1-only\" or \"untargeted\" analysis. This approach can be particularly useful in the following scenarios: Label-based quantification : Techniques like SILAC (Stable Isotope Labeling by Amino acids in Cell culture) or chemical labeling (e.g., TMT, iTRAQ) rely on MS1 data for quantification. The mass shift introduced by labels allows for the direct comparison of peptide abundances based on their MS1 ion intensities. Label-free quantification : Proteins can be quantified by comparing the intensities of their corresponding peptide ions in MS1 across different samples. This requires accurate mass and retention time alignment and often uses algorithms to detect and quantify features (peptide ions) consistently across multiple runs. Accurate Mass and Time tags (AMT) : This approach relies on a previously established library of peptide identifications, where each peptide is characterized by its accurate mass and normalized retention time. In subsequent analyses, peptides can be inferred by matching the observed accurate mass and retention time to the library without the need for MS2 fragmentation. Data-independent acquisition (DIA) : In some DIA workflows, proteins can be inferred from MS1 data when coupled with complex data analysis strategies and spectral libraries. It is important to note that while DIA collects MS1 spectra, it also involves the simultaneous fragmentation of all ions in a given mass range, and thus MS2-level data is typically available and used for identification. It is important to note that MS1-only approaches may have limitations in terms of identification specificity and sensitivity compared to traditional MS2-based methods. MS1-based protein inference is generally less confident because it lacks sequence-specific information that can only be obtained from fragment ions in MS2. For this reason, MS1-based methods are often complemented by MS2 data or rely on extensive peptide libraries and sophisticated computational algorithms to increase the confidence of peptide and protein identification. III. OpenMS/crux/MaxQuant/FragPipe OpenMS, Crux, MaxQuant, and FragPipe are all prominent software platforms for analyzing proteomics data, each with its own strengths and weaknesses. Here's a comparison: OpenMS: Focus: Provides a flexible and open-source framework for developing and executing various mass spectrometry data analysis workflows. Strengths: Highly modular and customizable: Offers a vast collection of algorithms and tools that can be combined and customized to create tailored workflows. Open-source and extensible: Encourages community contributions and allows for the development of new tools and algorithms. Supports various data formats and instruments: Compatible with a wide range of data formats and mass spectrometry platforms. Strong support for metabolomics data: While primarily used for proteomics, it also offers tools for analyzing metabolomics data. Limitations: Steeper learning curve: Requires programming knowledge and familiarity with command-line interfaces. Less user-friendly: Lacks a comprehensive graphical user interface (GUI), making it less accessible for beginners. Limited pre-built workflows: While highly customizable, it requires more manual effort to set up standard workflows compared to MaxQuant or FragPipe. Crux: Focus: A command-line toolkit designed for peptide identification, protein quantification, and statistical validation of proteomics data. Strengths: Open-source and well-documented: Provides clear documentation and allows for community contributions. Fast and efficient: Known for its computational efficiency and speed. Strong statistical validation: Offers rigorous statistical methods for validating peptide and protein identifications. Supports various search engines: Compatible with multiple search engines, including Comet and Tide. Limitations: Command-line interface only: Requires familiarity with command-line operations. Less user-friendly: Lacks a GUI, making it less accessible for beginners. Limited pre-built workflows: Requires more manual effort to set up complete analysis pipelines. MaxQuant: Focus: Primarily known for its robust and sensitive peptide and protein identification and quantification using its proprietary Andromeda search engine. Strengths: User-friendly interface: Provides a GUI for easier data processing and analysis. Robust and sensitive identification and quantification: Offers high-quality results for standard DDA-based proteomics experiments. Strong support for label-free quantification (LFQ) and match between runs (MBR). Extensive post-translational modification (PTM) analysis: Offers comprehensive support for identifying and quantifying various PTMs. Limitations: Less flexible for specialized workflows: Primarily designed for standard bottom-up proteomics experiments. Limited support for DIA data: While it can handle DIA data, it's not its primary strength. Closed-source: The core algorithms are not open-source, limiting community contributions and customization. FragPipe: Focus: Offers a more modular and flexible platform with various tools for different proteomics workflows, including both DDA and DIA. Strengths: Versatile and modular: Includes a suite of tools for various tasks, including peptide identification, quantification, and statistical analysis. Extensive support for DIA data: Features DIA-Umpire, a dedicated tool for analyzing DIA data using various algorithms. Highly accurate and sensitive quantification: Employs IonQuant for precise quantification using extracted ion chromatograms. Open-source and actively developed: Encourages community contributions and continuous improvement. Limitations: Steeper learning curve: Primarily operates through a command-line interface, requiring more technical expertise. Less user-friendly interface: Lacks a comprehensive GUI, making it less intuitive for beginners. Here's a table summarizing the key differences: Feature OpenMS Crux MaxQuant FragPipe Primary Focus Flexible Framework Peptide ID & Quantification Peptide & Protein ID/Quant Modular Platform Open Source Yes Yes No (core algorithms) Yes User Interface Primarily CLI CLI GUI Primarily CLI Learning Curve Steep Moderate Easier Steep Flexibility Highly Flexible Moderate Less Flexible More Flexible DIA Support Limited Limited Limited Extensive (DIA-Umpire) Quantification Methods Various Various LFQ, iBAQ IonQuant PTM Analysis Supported Supported Extensive PTM-Shepherd Community Support Strong Moderate Moderate Strong Choosing the Right Tool: OpenMS: Ideal for researchers with programming skills who need a highly customizable and extensible platform for developing specialized workflows. Crux: Suitable for researchers comfortable with command-line interfaces and seeking fast and efficient tools for peptide identification, protein quantification, and statistical validation. MaxQuant: Best for researchers looking for a user-friendly platform with robust performance for standard DDA-based proteomics experiments, especially those focusing on label-free quantification. FragPipe: Ideal for researchers seeking a highly flexible and customizable platform for various workflows, including DIA analysis, and who are comfortable with command-line operations. Remember to consider your specific research goals, data type, and bioinformatics expertise when choosing the best tool for your needs. You might even explore combining different tools to leverage their unique strengths for different aspects of your analysis. Notes on CSD3 . crux/4.1 is functional (along with comet/2024.01.1 & kojak/2.0.0a22 ) on CSD3 but crux/4.2 is not. FragPipe/22.0 does offer a comprehensive GUI. Moreover, MetaMorpheus/1.0.5 , FlashLFQ/1.2.6 and MS Amanda/3.0.21.532 are also available from CSD3. IV. Galaxy tutorials Web: https://usegalaxy.org/ (https://training.galaxyproject.org/training-material/) docker run -p 8080:80 quay.io/galaxy/introduction-training Visit http://localhost:8080 . Login as admin with password password to access. See https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/protein-id-oms/tutorial.html V. PoGo Fast Mapping of Peptides to Genomic Coordinates for Proteogenomic Analyses, https://www.sanger.ac.uk/tool/pogo/ , GitHub, https://github.com/cschlaffner/PoGo . It uses transcript translations and reference gene annotations to identify the genomic loci of peptides and post-translational modifications. Multiple occurrences of peptides in the input data resulting in the same genomic loci will be collapsed as a single occurrence in the output. The input format is a tab delimited file with four columns with file extensions such as .pogo, .txt, and *.tsv. Column Column header Description 1 Sample Name of sample or experiment 2 Peptide Peptide sequence with PSI-MS nodification names in round brackets following the mpdified amino acid, e.g. PEPT(Phopsho)IDE for a phosphorylated threonine 3 PSMs Number of peptide-spectrum matches (PSMs) for the given peptide, including those redundantly identified (peptides can be \u201cseen\u201d more than once in a run) 4 Quant Quantitative value for the given peptide in the given sample An example is established as follows, wget -S ftp://ftp.sanger.ac.uk/pub/teams/17/software/PoGo/PoGo_Testprocedures.zip unzip PoGo_Testprocedures.zip cd PoGo_Testprocedures/Testfiles module load ceuadmin/PoGo for Peptides in Testfile_experimental Testfile_small do PoGo -fasta input/gencode.v25.pc_translations.fa -gtf input/gencode.v25.annotation.gtf -in input/${Peptides}.txt done # expanded version /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/bin/PoGo \\ -fasta /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/PoGo_Testprocedures/Testfiles/input/gencode.v25.pc_translations.fa \\ -gtf /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/PoGo_Testprocedures/Testfiles/input/gencode.v25.annotation.gtf \\ -in /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/PoGo_Testprocedures/Testfiles/input/Testfile_experimental.txt \\ -format ALL \\ -mm 0 Output files are also contained in the input/ directory. GENCODE annotation data are available from https://www.gencodegenes.org/human/ and https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/ . The Java GUI, https://github.com/cschlaffner/PoGoGUI , is run as follows, java -jar PoGoGUI-v1.0.0.jar which requires PoGo executable as well. The source is compiled with maven, https://maven.apache.org/ , e.g., module load maven-3.5.0-gcc-5.4.0-3sgaeze mvn install assuming that pom.xml is available, e.g., /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/PoGoGUI/PoGoGUI . VI. Proteoform Analysis ETH Zurich, U Toronto Team Develops Tool for Bottom-Up Proteomics Proteoform Analysis Jul 28, 2021 | Adam Bonislawski NEW YORK \u2013 A team led by researchers at ETH Zurich and the University of Toronto has developed a tool that allows for the detection of protein proteoforms in bottom-up proteomics data. Described in a paper published in June in Nature Communications, the tool, called COPF (COrrelation-based functional ProteoForm) uses peptide correlation analysis to detect differences in proteoform populations across different samples or conditions and could aid researchers as they seek to better understand the role different protein forms play in biology and disease. The human genome is thought to have around 20,000 protein coding genes, but many of these 20,000 proteins exist in the body in various forms, differentiated by, for instance, post-translational modifications or amino acid substitutions. These different forms are called proteoforms, and it is widely believed that biological processes are guided not just by the proteins present but by what proteoforms are present and in what proportions. Traditional bottom-up proteomics workflows have provided only limited insight into proteoform populations however, due to the fact that the presence of a particular protein is typically inferred by the detection of just a few of its peptides and that digesting proteins into peptides for mass spec analysis makes it near impossible to link a modified peptide back to a particular proteoform. Some proteomics researchers have addressed this issue by moving to top-down proteomics, which looks at intact proteins, allowing them to better distinguish between different proteoforms. Top-down proteomics is very technically challenging, however, and is not yet able to analyze proteins with the breadth and depth of bottom-up workflows. Recently, the development of more reproducible and higher-throughput bottom-up workflows, and particular workflows using data independent-acquisition (DIA) mass spectrometry, have allowed researchers like the Nature Communications authors to apply peptide correlation analysis to the study of proteoforms. Peptide correlation analysis looks at differences in peptide behavior within and across proteins in bottom-up data. Researchers have developed a number of approaches for turning peptide measurements into protein data, with most working under the assumption that peptides from the same protein will behave the same way. In practice, though, that isn't the case. On one hand, there are a number of technical reasons why two peptides from the same protein may not behave the same way. For instance, different digestion efficiencies could lead to some peptides being more abundant than others. Different ionization efficiencies could similarly make one peptide more likely than another to be detected by the mass spec. The presence of different proteoforms could also play a role. For instance, if a protein is present in both a full-length and truncated form, expression changes observed in the full-length form wouldn't be observable if the peptide being measured wasn't present in the truncated form. Not only would this throw off protein-level quantitation, but it would also mask relative changes in the two protein forms that could be biologically important. A major challenge to applying this insight has been determining which differences in peptide behavior reflect real technical or biological variation and which are just noise, noted Hannes R\u00f6st, research chair in mass spectrometry-based personalized medicine at the University of Toronto and an author on the Nature Communications study. \"In many cases [such variation] was noise,\" he said. \"When you look at traditional shotgun proteomics workflows and data analyses, really the power is not at the peptide-level quantification but at the protein level from the aggregation of multiple peptides. On the peptide level you see a lot of noise, and I think that has prevented us from using this observation that individual peptides could yield a lot of interested information because people really only looked at the protein-level data, because that is what they trusted.\" R\u00f6st said that the development of targeted protein quantitation approaches like multiple-reaction monitoring (MRM) has demonstrated that individual peptides can be measured with high accuracy, and the development of DIA mass spec approaches has enabled MRM-style peptide quantitation at the proteome scale. At the same time, improvements in mass spec technology have allowed researchers to collect the kind of large and reproducible datasets required for peptide correlation analysis, he said. \"These are types of experiments we wouldn't have imagined 10 years ago, because for correlation-based approaches to work, you need a relatively large number of samples, and you need low variance,\" he said. \"We are not detecting [proteoforms] that are not changing between different [conditions], we are only detecting those that change. And for this to work we need to have multiple replicates and we need to have different conditions and to be able to measure these peptides with high quantitative accuracy across these conditions.\" The COPF tool looks at the intensities of peptides coming from a particular protein across all the samples measured in an experiment and then calculates peptide correlations for all the pairs of peptides coming from that protein and uses hierarchical clustering to divide the peptides into two clusters. It then scores the likelihood that multiple proteoforms of a protein are present by comparing the level of peptide correlation between the clusters to the level of in-cluster variation. The tool does not identify the specific modifications or variations that distinguish the different proteoforms but rather the peptides that appear to differentiate between the forms of the protein in the different biological contexts investigated. Analyzing a DIA dataset that looked at five different tissue types across eight different mice, COPF identified 63 proteins that exhibited different proteoform groups, including proteins with known tissue-specific splice variants. The researchers also identified proteoforms created by proteolytic and autocatalytic cleavage and phosphorylation, indicating, they wrote, that the tool is \"agnostic to the different mechanisms by which proteoforms can be generated inside the cell.\" The development of COPF follows the publication last year of a study by researchers at Barts Cancer Institute and the University of Wisconsin-Madison detailing another peptide correlation analysis tool for identifying proteoforms in bottom-up data called PeCorA. Unlike COPF, which requires proteoforms to differ by two or more peptides, PeCorA can detect proteoforms based on single peptide differences. This makes it a potentially more sensitive tool but also less specific than COPF, R\u00f6st said. More generally, he said that he expected ongoing improvements in mass spec technology would further improve peptide correlation-based approaches like COPF and PeCorA by boosting peptide coverage. \"To kind of cover every possible protein isoform we would need to have complete coverage of every protein, and unfortunately we are currently quite far away from having peptide-level coverage of every protein,\" he said. \"I think that is currently one of the limitations where we are kind of hitting a wall.\" R\u00f6st added that his lab has begun acquiring data on Bruker's timsTOF Pro platform, \"and there we definitely see both an increase in protein coverage and also in the number of peptides we can measure.\" \"That's why I'm very optimistic that while this is just the first implementation of the method, the data we are producing at this moment is much more complete, and therefore I think it would be even more suitable to our approach than the data we used in the paper,\" he said.","title":"Notes"},{"location":"Notes/#notes","text":"( Sections I -- III are due to Claude )","title":"Notes"},{"location":"Notes/#i-meta-data","text":"Isotope.Group.ID is a unique identifier for a group of isotopes that belong to the same peptide or molecule. In mass spectrometry, isotopes are atoms of the same element that have the same number of protons but differ in the number of neutrons. This ID helps to group together isotopes that arise from the same peptide, allowing for easier identification and analysis. Protein contains the name or identifier of the protein that the peptide (or molecule) is derived from. This information is typically obtained by searching the MS data against a protein database. Modified.Peptide.Sequence is the amino acid sequence of the peptide, including any post-translational modifications (PTMs) that have been identified. PTMs are chemical modifications that occur after protein synthesis, such as phosphorylation, ubiquitination, or methylation. The sequence is usually represented in a standard format, such as using lowercase letters for modified residues. Monoisotopic.m/z is the monoisotopic mass-to-charge ratio (m/z) of the peptide or molecule. The monoisotopic mass is the mass of the most abundant isotope of each element in the molecule, which is typically the lightest isotope (e.g., 12C, 1H, 14N, 16O, etc.). This value is used as a reference point for identifying the peptide or molecule. Max.Isotope.Time.Centroid is the time centroid (or apex) of the most intense isotope in the isotope group. In liquid chromatography-mass spectrometry (LC-MS), peptides are separated based on their retention time (the time it takes for the peptide to elute from the column). The time centroid is the time point at which the peptide signal is most intense, which can be used to quantify the peptide abundance. Charge is the charge state of the peptide or molecule. In mass spectrometry, peptides can be ionized to different charge states (e.g., +1, +2, +3, etc.), which affects their mass-to-charge ratio (m/z). The charge state is an important parameter for identifying peptides and molecules. They are invaluable for analyzing and interpreting MS data, including peptide identification, quantification, and characterization of post-translational modifications.","title":"I. Meta-data"},{"location":"Notes/#ii-ms1ms2","text":"In mass spectrometry-based proteomics, the typical workflow for identifying peptides and proteins involves using tandem mass spectrometry (MS/MS or MS2). In this process, precursor ions (peptides) are selected in the first stage of mass spectrometry (MS1) and then fragmented to produce a series of smaller ions in the second stage (MS2). The resulting fragment ions (product ions) are analyzed to infer the sequence of the peptide and, by extension, identify the proteins from which they were derived. However, it is possible to infer peptides and proteins using only MS1 data through a process known as \"MS1-only\" or \"untargeted\" analysis. This approach can be particularly useful in the following scenarios: Label-based quantification : Techniques like SILAC (Stable Isotope Labeling by Amino acids in Cell culture) or chemical labeling (e.g., TMT, iTRAQ) rely on MS1 data for quantification. The mass shift introduced by labels allows for the direct comparison of peptide abundances based on their MS1 ion intensities. Label-free quantification : Proteins can be quantified by comparing the intensities of their corresponding peptide ions in MS1 across different samples. This requires accurate mass and retention time alignment and often uses algorithms to detect and quantify features (peptide ions) consistently across multiple runs. Accurate Mass and Time tags (AMT) : This approach relies on a previously established library of peptide identifications, where each peptide is characterized by its accurate mass and normalized retention time. In subsequent analyses, peptides can be inferred by matching the observed accurate mass and retention time to the library without the need for MS2 fragmentation. Data-independent acquisition (DIA) : In some DIA workflows, proteins can be inferred from MS1 data when coupled with complex data analysis strategies and spectral libraries. It is important to note that while DIA collects MS1 spectra, it also involves the simultaneous fragmentation of all ions in a given mass range, and thus MS2-level data is typically available and used for identification. It is important to note that MS1-only approaches may have limitations in terms of identification specificity and sensitivity compared to traditional MS2-based methods. MS1-based protein inference is generally less confident because it lacks sequence-specific information that can only be obtained from fragment ions in MS2. For this reason, MS1-based methods are often complemented by MS2 data or rely on extensive peptide libraries and sophisticated computational algorithms to increase the confidence of peptide and protein identification.","title":"II. MS1/MS2"},{"location":"Notes/#iii-openmscruxmaxquantfragpipe","text":"OpenMS, Crux, MaxQuant, and FragPipe are all prominent software platforms for analyzing proteomics data, each with its own strengths and weaknesses. Here's a comparison: OpenMS: Focus: Provides a flexible and open-source framework for developing and executing various mass spectrometry data analysis workflows. Strengths: Highly modular and customizable: Offers a vast collection of algorithms and tools that can be combined and customized to create tailored workflows. Open-source and extensible: Encourages community contributions and allows for the development of new tools and algorithms. Supports various data formats and instruments: Compatible with a wide range of data formats and mass spectrometry platforms. Strong support for metabolomics data: While primarily used for proteomics, it also offers tools for analyzing metabolomics data. Limitations: Steeper learning curve: Requires programming knowledge and familiarity with command-line interfaces. Less user-friendly: Lacks a comprehensive graphical user interface (GUI), making it less accessible for beginners. Limited pre-built workflows: While highly customizable, it requires more manual effort to set up standard workflows compared to MaxQuant or FragPipe. Crux: Focus: A command-line toolkit designed for peptide identification, protein quantification, and statistical validation of proteomics data. Strengths: Open-source and well-documented: Provides clear documentation and allows for community contributions. Fast and efficient: Known for its computational efficiency and speed. Strong statistical validation: Offers rigorous statistical methods for validating peptide and protein identifications. Supports various search engines: Compatible with multiple search engines, including Comet and Tide. Limitations: Command-line interface only: Requires familiarity with command-line operations. Less user-friendly: Lacks a GUI, making it less accessible for beginners. Limited pre-built workflows: Requires more manual effort to set up complete analysis pipelines. MaxQuant: Focus: Primarily known for its robust and sensitive peptide and protein identification and quantification using its proprietary Andromeda search engine. Strengths: User-friendly interface: Provides a GUI for easier data processing and analysis. Robust and sensitive identification and quantification: Offers high-quality results for standard DDA-based proteomics experiments. Strong support for label-free quantification (LFQ) and match between runs (MBR). Extensive post-translational modification (PTM) analysis: Offers comprehensive support for identifying and quantifying various PTMs. Limitations: Less flexible for specialized workflows: Primarily designed for standard bottom-up proteomics experiments. Limited support for DIA data: While it can handle DIA data, it's not its primary strength. Closed-source: The core algorithms are not open-source, limiting community contributions and customization. FragPipe: Focus: Offers a more modular and flexible platform with various tools for different proteomics workflows, including both DDA and DIA. Strengths: Versatile and modular: Includes a suite of tools for various tasks, including peptide identification, quantification, and statistical analysis. Extensive support for DIA data: Features DIA-Umpire, a dedicated tool for analyzing DIA data using various algorithms. Highly accurate and sensitive quantification: Employs IonQuant for precise quantification using extracted ion chromatograms. Open-source and actively developed: Encourages community contributions and continuous improvement. Limitations: Steeper learning curve: Primarily operates through a command-line interface, requiring more technical expertise. Less user-friendly interface: Lacks a comprehensive GUI, making it less intuitive for beginners. Here's a table summarizing the key differences: Feature OpenMS Crux MaxQuant FragPipe Primary Focus Flexible Framework Peptide ID & Quantification Peptide & Protein ID/Quant Modular Platform Open Source Yes Yes No (core algorithms) Yes User Interface Primarily CLI CLI GUI Primarily CLI Learning Curve Steep Moderate Easier Steep Flexibility Highly Flexible Moderate Less Flexible More Flexible DIA Support Limited Limited Limited Extensive (DIA-Umpire) Quantification Methods Various Various LFQ, iBAQ IonQuant PTM Analysis Supported Supported Extensive PTM-Shepherd Community Support Strong Moderate Moderate Strong Choosing the Right Tool: OpenMS: Ideal for researchers with programming skills who need a highly customizable and extensible platform for developing specialized workflows. Crux: Suitable for researchers comfortable with command-line interfaces and seeking fast and efficient tools for peptide identification, protein quantification, and statistical validation. MaxQuant: Best for researchers looking for a user-friendly platform with robust performance for standard DDA-based proteomics experiments, especially those focusing on label-free quantification. FragPipe: Ideal for researchers seeking a highly flexible and customizable platform for various workflows, including DIA analysis, and who are comfortable with command-line operations. Remember to consider your specific research goals, data type, and bioinformatics expertise when choosing the best tool for your needs. You might even explore combining different tools to leverage their unique strengths for different aspects of your analysis. Notes on CSD3 . crux/4.1 is functional (along with comet/2024.01.1 & kojak/2.0.0a22 ) on CSD3 but crux/4.2 is not. FragPipe/22.0 does offer a comprehensive GUI. Moreover, MetaMorpheus/1.0.5 , FlashLFQ/1.2.6 and MS Amanda/3.0.21.532 are also available from CSD3.","title":"III. OpenMS/crux/MaxQuant/FragPipe"},{"location":"Notes/#iv-galaxy-tutorials","text":"Web: https://usegalaxy.org/ (https://training.galaxyproject.org/training-material/) docker run -p 8080:80 quay.io/galaxy/introduction-training Visit http://localhost:8080 . Login as admin with password password to access. See https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/protein-id-oms/tutorial.html","title":"IV. Galaxy tutorials"},{"location":"Notes/#v-pogo","text":"Fast Mapping of Peptides to Genomic Coordinates for Proteogenomic Analyses, https://www.sanger.ac.uk/tool/pogo/ , GitHub, https://github.com/cschlaffner/PoGo . It uses transcript translations and reference gene annotations to identify the genomic loci of peptides and post-translational modifications. Multiple occurrences of peptides in the input data resulting in the same genomic loci will be collapsed as a single occurrence in the output. The input format is a tab delimited file with four columns with file extensions such as .pogo, .txt, and *.tsv. Column Column header Description 1 Sample Name of sample or experiment 2 Peptide Peptide sequence with PSI-MS nodification names in round brackets following the mpdified amino acid, e.g. PEPT(Phopsho)IDE for a phosphorylated threonine 3 PSMs Number of peptide-spectrum matches (PSMs) for the given peptide, including those redundantly identified (peptides can be \u201cseen\u201d more than once in a run) 4 Quant Quantitative value for the given peptide in the given sample An example is established as follows, wget -S ftp://ftp.sanger.ac.uk/pub/teams/17/software/PoGo/PoGo_Testprocedures.zip unzip PoGo_Testprocedures.zip cd PoGo_Testprocedures/Testfiles module load ceuadmin/PoGo for Peptides in Testfile_experimental Testfile_small do PoGo -fasta input/gencode.v25.pc_translations.fa -gtf input/gencode.v25.annotation.gtf -in input/${Peptides}.txt done # expanded version /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/bin/PoGo \\ -fasta /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/PoGo_Testprocedures/Testfiles/input/gencode.v25.pc_translations.fa \\ -gtf /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/PoGo_Testprocedures/Testfiles/input/gencode.v25.annotation.gtf \\ -in /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/PoGo_Testprocedures/Testfiles/input/Testfile_experimental.txt \\ -format ALL \\ -mm 0 Output files are also contained in the input/ directory. GENCODE annotation data are available from https://www.gencodegenes.org/human/ and https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/ . The Java GUI, https://github.com/cschlaffner/PoGoGUI , is run as follows, java -jar PoGoGUI-v1.0.0.jar which requires PoGo executable as well. The source is compiled with maven, https://maven.apache.org/ , e.g., module load maven-3.5.0-gcc-5.4.0-3sgaeze mvn install assuming that pom.xml is available, e.g., /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/PoGoGUI/PoGoGUI .","title":"V. PoGo"},{"location":"Notes/#vi-proteoform-analysis","text":"","title":"VI. Proteoform Analysis"},{"location":"Notes/#eth-zurich-u-toronto-team-develops-tool-for-bottom-up-proteomics-proteoform-analysis","text":"Jul 28, 2021 | Adam Bonislawski NEW YORK \u2013 A team led by researchers at ETH Zurich and the University of Toronto has developed a tool that allows for the detection of protein proteoforms in bottom-up proteomics data. Described in a paper published in June in Nature Communications, the tool, called COPF (COrrelation-based functional ProteoForm) uses peptide correlation analysis to detect differences in proteoform populations across different samples or conditions and could aid researchers as they seek to better understand the role different protein forms play in biology and disease. The human genome is thought to have around 20,000 protein coding genes, but many of these 20,000 proteins exist in the body in various forms, differentiated by, for instance, post-translational modifications or amino acid substitutions. These different forms are called proteoforms, and it is widely believed that biological processes are guided not just by the proteins present but by what proteoforms are present and in what proportions. Traditional bottom-up proteomics workflows have provided only limited insight into proteoform populations however, due to the fact that the presence of a particular protein is typically inferred by the detection of just a few of its peptides and that digesting proteins into peptides for mass spec analysis makes it near impossible to link a modified peptide back to a particular proteoform. Some proteomics researchers have addressed this issue by moving to top-down proteomics, which looks at intact proteins, allowing them to better distinguish between different proteoforms. Top-down proteomics is very technically challenging, however, and is not yet able to analyze proteins with the breadth and depth of bottom-up workflows. Recently, the development of more reproducible and higher-throughput bottom-up workflows, and particular workflows using data independent-acquisition (DIA) mass spectrometry, have allowed researchers like the Nature Communications authors to apply peptide correlation analysis to the study of proteoforms. Peptide correlation analysis looks at differences in peptide behavior within and across proteins in bottom-up data. Researchers have developed a number of approaches for turning peptide measurements into protein data, with most working under the assumption that peptides from the same protein will behave the same way. In practice, though, that isn't the case. On one hand, there are a number of technical reasons why two peptides from the same protein may not behave the same way. For instance, different digestion efficiencies could lead to some peptides being more abundant than others. Different ionization efficiencies could similarly make one peptide more likely than another to be detected by the mass spec. The presence of different proteoforms could also play a role. For instance, if a protein is present in both a full-length and truncated form, expression changes observed in the full-length form wouldn't be observable if the peptide being measured wasn't present in the truncated form. Not only would this throw off protein-level quantitation, but it would also mask relative changes in the two protein forms that could be biologically important. A major challenge to applying this insight has been determining which differences in peptide behavior reflect real technical or biological variation and which are just noise, noted Hannes R\u00f6st, research chair in mass spectrometry-based personalized medicine at the University of Toronto and an author on the Nature Communications study. \"In many cases [such variation] was noise,\" he said. \"When you look at traditional shotgun proteomics workflows and data analyses, really the power is not at the peptide-level quantification but at the protein level from the aggregation of multiple peptides. On the peptide level you see a lot of noise, and I think that has prevented us from using this observation that individual peptides could yield a lot of interested information because people really only looked at the protein-level data, because that is what they trusted.\" R\u00f6st said that the development of targeted protein quantitation approaches like multiple-reaction monitoring (MRM) has demonstrated that individual peptides can be measured with high accuracy, and the development of DIA mass spec approaches has enabled MRM-style peptide quantitation at the proteome scale. At the same time, improvements in mass spec technology have allowed researchers to collect the kind of large and reproducible datasets required for peptide correlation analysis, he said. \"These are types of experiments we wouldn't have imagined 10 years ago, because for correlation-based approaches to work, you need a relatively large number of samples, and you need low variance,\" he said. \"We are not detecting [proteoforms] that are not changing between different [conditions], we are only detecting those that change. And for this to work we need to have multiple replicates and we need to have different conditions and to be able to measure these peptides with high quantitative accuracy across these conditions.\" The COPF tool looks at the intensities of peptides coming from a particular protein across all the samples measured in an experiment and then calculates peptide correlations for all the pairs of peptides coming from that protein and uses hierarchical clustering to divide the peptides into two clusters. It then scores the likelihood that multiple proteoforms of a protein are present by comparing the level of peptide correlation between the clusters to the level of in-cluster variation. The tool does not identify the specific modifications or variations that distinguish the different proteoforms but rather the peptides that appear to differentiate between the forms of the protein in the different biological contexts investigated. Analyzing a DIA dataset that looked at five different tissue types across eight different mice, COPF identified 63 proteins that exhibited different proteoform groups, including proteins with known tissue-specific splice variants. The researchers also identified proteoforms created by proteolytic and autocatalytic cleavage and phosphorylation, indicating, they wrote, that the tool is \"agnostic to the different mechanisms by which proteoforms can be generated inside the cell.\" The development of COPF follows the publication last year of a study by researchers at Barts Cancer Institute and the University of Wisconsin-Madison detailing another peptide correlation analysis tool for identifying proteoforms in bottom-up data called PeCorA. Unlike COPF, which requires proteoforms to differ by two or more peptides, PeCorA can detect proteoforms based on single peptide differences. This makes it a potentially more sensitive tool but also less specific than COPF, R\u00f6st said. More generally, he said that he expected ongoing improvements in mass spec technology would further improve peptide correlation-based approaches like COPF and PeCorA by boosting peptide coverage. \"To kind of cover every possible protein isoform we would need to have complete coverage of every protein, and unfortunately we are currently quite far away from having peptide-level coverage of every protein,\" he said. \"I think that is currently one of the limitations where we are kind of hitting a wall.\" R\u00f6st added that his lab has begun acquiring data on Bruker's timsTOF Pro platform, \"and there we definitely see both an increase in protein coverage and also in the number of peptides we can measure.\" \"That's why I'm very optimistic that while this is just the first implementation of the method, the data we are producing at this moment is much more complete, and therefore I think it would be even more suitable to our approach than the data we used in the paper,\" he said.","title":"ETH Zurich, U Toronto Team Develops Tool for Bottom-Up Proteomics Proteoform Analysis"},{"location":"misc/","text":"Miscellaneous analysis This section accommodates many largely independent tasks. Implementation might well be generic so that both proteins and peptides are covered. Programs and applications These are summarised in the following table, Program Description coloc.sb Coloc(alisation) analysis csq.sh Consequences of variants Caprion_deCODE_UKB_PPP.sh Caprion/deCODE/UKB-PPP replication eSet.sh ExpresssionSet implementations glmnet_pense.sh glmnet/pense modeling impute.sb imputation experiments json.sh JSON file generation peptideAssociationPlot.sh protein Manhattan-peptide signal plots dup-pgwas.sh pGWAS for duplicated proteins dup-extract.sh pQTL extractions dup-json.sh LocusZoom.js plots dup-plot.sh pQTL plots dup-tbl.R pQTL table pqtlGWAS.R pQTL-GWAS lookup tables.sh Supplementary-Tables.xlsx generator ToDo.sh various staged experiments NB: coloc.sb alternatively calls coloc.R . impute.sb employs impute_parallel() when N(isotope groups) > 500. Nevertheless, when coming to protein requantification this is an option to use the orginal intensity data. Legacy codes compare.sb . earlier contrast with deCODE/UKB-PPP. inf1.sh . snapshot from SCALLOP/INF meta-analysis. Created on 9/12/2024","title":"Miscellaneous analysis"},{"location":"misc/#miscellaneous-analysis","text":"This section accommodates many largely independent tasks. Implementation might well be generic so that both proteins and peptides are covered.","title":"Miscellaneous analysis"},{"location":"misc/#programs-and-applications","text":"These are summarised in the following table, Program Description coloc.sb Coloc(alisation) analysis csq.sh Consequences of variants Caprion_deCODE_UKB_PPP.sh Caprion/deCODE/UKB-PPP replication eSet.sh ExpresssionSet implementations glmnet_pense.sh glmnet/pense modeling impute.sb imputation experiments json.sh JSON file generation peptideAssociationPlot.sh protein Manhattan-peptide signal plots dup-pgwas.sh pGWAS for duplicated proteins dup-extract.sh pQTL extractions dup-json.sh LocusZoom.js plots dup-plot.sh pQTL plots dup-tbl.R pQTL table pqtlGWAS.R pQTL-GWAS lookup tables.sh Supplementary-Tables.xlsx generator ToDo.sh various staged experiments NB: coloc.sb alternatively calls coloc.R . impute.sb employs impute_parallel() when N(isotope groups) > 500. Nevertheless, when coming to protein requantification this is an option to use the orginal intensity data.","title":"Programs and applications"},{"location":"misc/#legacy-codes","text":"compare.sb . earlier contrast with deCODE/UKB-PPP. inf1.sh . snapshot from SCALLOP/INF meta-analysis. Created on 9/12/2024","title":"Legacy codes"},{"location":"peptide_progs/","text":"Peptide analysis CSD3 directory /rds/project/jmmh2/rds-jmmh2-projects/Caprion_proteomics/analysis/ Scripts and results The project directory above contains scripts at peptide_progs/ and results results at peptide/ , respectively. These are also a set of scripts called from bash which invokes SLURM jobs. Script name Description Protein-specific error/output Association analysis 1_pgwas.sh Association analysis {protein}.e / {protein}.o 2_meta_analysis.sh Meta-analysis {protein}-METAL_{SLURM_job_id}_{phenotype_number}.e / {protein}-METAL_{SLURM_job_id}_{phenotype_number}.o Signal identification (see {protein}/sentinels/slurm ) setup.sh Environmental variables 3.1_extract.sh Signal extraction _step1_{SLURM_job_id}_{phenotype_number}.e / _step1_{SLURM_job_id}_{phenotype_number}.o 3.2_collect.sh Signal collection/classification _step2_{protein}.e / _step2_{protein}.o 3.3_plot.sh Forest, Q-Q, Manhattan, LocusZoom, mean-by-genotype/dosage plots _step3_{SLURM_job_id}_{phenotype_number}.e / _step3_{SLURM_job_id}_{phenotype_number}.o utils.sh Various utitlties graph TD; 1_pgwas.sh 2_meta_analysis.sh 1_pgwas.sh --> 2_meta_analysis.sh --> setup.sh setup.sh --> 3.1_extract.sh setup.sh --> 3.2_collect.sh setup.sh --> 3.3_plot.sh subgraph Group1[ ] direction LR 3.1_extract.sh --> 3.2_collect.sh --> 3.3_plot.sh end utils.sh Specfic prerequistes for a Manhattan/peptide association plot are a call to vep_annotate functino in 3.2_collect.sh for proteins. a call to bgz() (in utils.sh for protein) for a indexed and compressed DR-filtered data. for step 3.2, ceuadmin/ensembl-vep/111-icelake now is the default since partition icelake-himem is used instead of cclake (CentOS 7) which has ceuadmin/ensembl-vep/104 . module ceuadmin/R/4.4.1-icelake now works as smoothly as the old ceuadmin/R at cclake Script name Description Protein-specific error/output Experimental codes mz.* file handling & MetaMorpheus, MSAmanda. mzML and results in */metamorpheus, msamonda crux.* search, R/multicomp+crux benchmark crux/ BoxCar.py/pyteomics.py BoxCar algorighm and its use graph TD; mz.* crux.* BoxCar.py/pyteomics.py The module mono-5.10.0.78-gcc-5.4.0-c6cq4hh is required for rawrr , to ${HOME}/.cache/R/rawrr/rawrrassembly (4/8/2024). File Size eula.txt 163 rawrr.exe 28672 ThermoFisher.CommonCore.BackgroundSubtraction.dll 44544 ThermoFisher.CommonCore.Data.dll 406016 ThermoFisher.CommonCore.MassPrecisionEstimator.dll 11264 ThermoFisher.CommonCore.RawFileReader.dll 654336 Finally, ceumadin/FragPipe/22.0 is available as a GUI for experiments on various worflows. Glossary The atomic mass unit (dalton) is equal to the mass of one-twelvth of the mass of a \\(^{12}C\\) atom ( \\(1.660 540 2 \\times 10^{-27}\\) g). References Bittremieux W, Levitsky L, Pilz M, Sachsenberg T, Huber F, Wang M, Dorrestein PC. Unified and standardized mass spectrometry data processing in Python using spectrum_utils. J Proteome Res 22:625\u2013631 (2023), https://doi.org/10.1021/acs.jproteome.2c00632 , https://spectrum-utils.readthedocs.io/en/latest/ . Eidhammer I, Flikka K, Martens L, Mikalsen S-O. Computational Methods for Mass Spectrometry Proteomics. Wiley, 2007. ISBN: 978-0-470-51297-5 1. Peptides are the short stretches of amino acids that are obtained after the proteolytic cleavage of proteins. Peptides are usually around 10\u201315 amino acids long, and a single protein yields approximately 35 peptides on average. 2. The mass (m) of a molecule or atom is expressed in unified atomic mass units (u). 3. Isotopes are (chemical) elements that have the same atomic number (and therefore similar chemical properties), but different molecular mass (slightly different physical properties). 4. Monoisotopic mass is the exact mass of an ion or molecule calculated using the mass of the most abundant isotope of each element. 5. A posttranslational modification (PTM) can be defined as any alteration to the chemical structure of the protein effected by the cellular machinery after the formation of the protein. 6. The raw data spectrum contains signals from the peptides, as well as signals derived from different forms of noise. fragpipe.nesvilab.org, https://fragpipe.nesvilab.org/ Hasam S, Emery K, Noble WS, Keich U. A Pipeline for Peptide Detection Using Multiple Decoys. Methods Mol Biol 2023;2426:25-34, doi: 10.1007/978-1-0716-1967-4_2 . The most common method of peptide and protein False Discovery Rate (FDR) calculation is by adding protein sequences that are not expected to be present in the sample. These are also called decoy protein sequences. This can be done by generating reverse sequences of the target protein entries and appending these protein entries to the protein database. Some search algoritmms use premade target-decoy protein sequences while others can generate a target-decoy protein sequence database from a target protein sequence database before using them for peptide spectral matching. Kertesz-Farkas A, Nii Adoquaye Acquaye FL, Bhimani K, Eng JK, Fondrie WE, Grant C, Hoopmann MR, Lin A, Lu YY, Moritz RL, MacCoss MJ, Noble WS. The Crux Toolkit for Analysis of Bottom-Up Tandem Mass Spectrometry Proteomics Data. J Proteome Res 2023;22(2):561-569, https://doi.org/10.1021/acs.jproteome.2c00615 , https://crux.ms . Lazear MR. Sage: An Open-Source Tool for Fast Proteomics Searching and Quantification at Scale. J Proteome Res 2023 22 (11), 3652-3659, DOI: 10.1021/acs.jproteome.3c00486 . Levitsky LI, Klein JA, Ivanov MV, Gorshkov MV. Pyteomics 4.0: Five Years of Development of a Python Proteomics Framework. J Proteome Res. 2019;18(2):709-714. doi: 10.1021/acs.jproteome.8b00717 , https://github.com/levitsky/pyteomics . ms-utils.org, https://ms-utils.org/ . Rehfeldt TG, Gabriels R, Bouwmeester R, Gessulat S, Neely BA, Palmblad M, Perez-Riverol Y, Schmidt T, Vizca\u00edno JA, Deutsch EW. ProteomicsML: An Online Platform for Community-Curated Data sets and Tutorials for Machine Learning in Proteomics. J Proteome Res , 2023;22(2):632-636, https://doi.org/10.1021/acs.jproteome.2c00629 , https://proteomicsml.org/ . Sturm M, Bertsch A, Gr\u00f6pl C, Hildebrandt A, Hussong R, Lange E, Pfeifer N, Schulz-Trieglaff O, Zerck A, Reinert K, Kohlbacher O. OpenMS - an open-source software framework for mass spectrometry. BMC Bioinformatics . 2008;9:163. doi: 10.1186/1471-2105-9-163 .","title":"Peptide analysis"},{"location":"peptide_progs/#peptide-analysis","text":"","title":"Peptide analysis"},{"location":"peptide_progs/#csd3-directory","text":"/rds/project/jmmh2/rds-jmmh2-projects/Caprion_proteomics/analysis/","title":"CSD3 directory"},{"location":"peptide_progs/#scripts-and-results","text":"The project directory above contains scripts at peptide_progs/ and results results at peptide/ , respectively. These are also a set of scripts called from bash which invokes SLURM jobs. Script name Description Protein-specific error/output Association analysis 1_pgwas.sh Association analysis {protein}.e / {protein}.o 2_meta_analysis.sh Meta-analysis {protein}-METAL_{SLURM_job_id}_{phenotype_number}.e / {protein}-METAL_{SLURM_job_id}_{phenotype_number}.o Signal identification (see {protein}/sentinels/slurm ) setup.sh Environmental variables 3.1_extract.sh Signal extraction _step1_{SLURM_job_id}_{phenotype_number}.e / _step1_{SLURM_job_id}_{phenotype_number}.o 3.2_collect.sh Signal collection/classification _step2_{protein}.e / _step2_{protein}.o 3.3_plot.sh Forest, Q-Q, Manhattan, LocusZoom, mean-by-genotype/dosage plots _step3_{SLURM_job_id}_{phenotype_number}.e / _step3_{SLURM_job_id}_{phenotype_number}.o utils.sh Various utitlties graph TD; 1_pgwas.sh 2_meta_analysis.sh 1_pgwas.sh --> 2_meta_analysis.sh --> setup.sh setup.sh --> 3.1_extract.sh setup.sh --> 3.2_collect.sh setup.sh --> 3.3_plot.sh subgraph Group1[ ] direction LR 3.1_extract.sh --> 3.2_collect.sh --> 3.3_plot.sh end utils.sh Specfic prerequistes for a Manhattan/peptide association plot are a call to vep_annotate functino in 3.2_collect.sh for proteins. a call to bgz() (in utils.sh for protein) for a indexed and compressed DR-filtered data. for step 3.2, ceuadmin/ensembl-vep/111-icelake now is the default since partition icelake-himem is used instead of cclake (CentOS 7) which has ceuadmin/ensembl-vep/104 . module ceuadmin/R/4.4.1-icelake now works as smoothly as the old ceuadmin/R at cclake Script name Description Protein-specific error/output Experimental codes mz.* file handling & MetaMorpheus, MSAmanda. mzML and results in */metamorpheus, msamonda crux.* search, R/multicomp+crux benchmark crux/ BoxCar.py/pyteomics.py BoxCar algorighm and its use graph TD; mz.* crux.* BoxCar.py/pyteomics.py The module mono-5.10.0.78-gcc-5.4.0-c6cq4hh is required for rawrr , to ${HOME}/.cache/R/rawrr/rawrrassembly (4/8/2024). File Size eula.txt 163 rawrr.exe 28672 ThermoFisher.CommonCore.BackgroundSubtraction.dll 44544 ThermoFisher.CommonCore.Data.dll 406016 ThermoFisher.CommonCore.MassPrecisionEstimator.dll 11264 ThermoFisher.CommonCore.RawFileReader.dll 654336 Finally, ceumadin/FragPipe/22.0 is available as a GUI for experiments on various worflows.","title":"Scripts and results"},{"location":"peptide_progs/#glossary","text":"The atomic mass unit (dalton) is equal to the mass of one-twelvth of the mass of a \\(^{12}C\\) atom ( \\(1.660 540 2 \\times 10^{-27}\\) g).","title":"Glossary"},{"location":"peptide_progs/#references","text":"Bittremieux W, Levitsky L, Pilz M, Sachsenberg T, Huber F, Wang M, Dorrestein PC. Unified and standardized mass spectrometry data processing in Python using spectrum_utils. J Proteome Res 22:625\u2013631 (2023), https://doi.org/10.1021/acs.jproteome.2c00632 , https://spectrum-utils.readthedocs.io/en/latest/ . Eidhammer I, Flikka K, Martens L, Mikalsen S-O. Computational Methods for Mass Spectrometry Proteomics. Wiley, 2007. ISBN: 978-0-470-51297-5 1. Peptides are the short stretches of amino acids that are obtained after the proteolytic cleavage of proteins. Peptides are usually around 10\u201315 amino acids long, and a single protein yields approximately 35 peptides on average. 2. The mass (m) of a molecule or atom is expressed in unified atomic mass units (u). 3. Isotopes are (chemical) elements that have the same atomic number (and therefore similar chemical properties), but different molecular mass (slightly different physical properties). 4. Monoisotopic mass is the exact mass of an ion or molecule calculated using the mass of the most abundant isotope of each element. 5. A posttranslational modification (PTM) can be defined as any alteration to the chemical structure of the protein effected by the cellular machinery after the formation of the protein. 6. The raw data spectrum contains signals from the peptides, as well as signals derived from different forms of noise. fragpipe.nesvilab.org, https://fragpipe.nesvilab.org/ Hasam S, Emery K, Noble WS, Keich U. A Pipeline for Peptide Detection Using Multiple Decoys. Methods Mol Biol 2023;2426:25-34, doi: 10.1007/978-1-0716-1967-4_2 . The most common method of peptide and protein False Discovery Rate (FDR) calculation is by adding protein sequences that are not expected to be present in the sample. These are also called decoy protein sequences. This can be done by generating reverse sequences of the target protein entries and appending these protein entries to the protein database. Some search algoritmms use premade target-decoy protein sequences while others can generate a target-decoy protein sequence database from a target protein sequence database before using them for peptide spectral matching. Kertesz-Farkas A, Nii Adoquaye Acquaye FL, Bhimani K, Eng JK, Fondrie WE, Grant C, Hoopmann MR, Lin A, Lu YY, Moritz RL, MacCoss MJ, Noble WS. The Crux Toolkit for Analysis of Bottom-Up Tandem Mass Spectrometry Proteomics Data. J Proteome Res 2023;22(2):561-569, https://doi.org/10.1021/acs.jproteome.2c00615 , https://crux.ms . Lazear MR. Sage: An Open-Source Tool for Fast Proteomics Searching and Quantification at Scale. J Proteome Res 2023 22 (11), 3652-3659, DOI: 10.1021/acs.jproteome.3c00486 . Levitsky LI, Klein JA, Ivanov MV, Gorshkov MV. Pyteomics 4.0: Five Years of Development of a Python Proteomics Framework. J Proteome Res. 2019;18(2):709-714. doi: 10.1021/acs.jproteome.8b00717 , https://github.com/levitsky/pyteomics . ms-utils.org, https://ms-utils.org/ . Rehfeldt TG, Gabriels R, Bouwmeester R, Gessulat S, Neely BA, Palmblad M, Perez-Riverol Y, Schmidt T, Vizca\u00edno JA, Deutsch EW. ProteomicsML: An Online Platform for Community-Curated Data sets and Tutorials for Machine Learning in Proteomics. J Proteome Res , 2023;22(2):632-636, https://doi.org/10.1021/acs.jproteome.2c00629 , https://proteomicsml.org/ . Sturm M, Bertsch A, Gr\u00f6pl C, Hildebrandt A, Hussong R, Lange E, Pfeifer N, Schulz-Trieglaff O, Zerck A, Reinert K, Kohlbacher O. OpenMS - an open-source software framework for mass spectrometry. BMC Bioinformatics . 2008;9:163. doi: 10.1186/1471-2105-9-163 .","title":"References"},{"location":"pilot/","text":"Pilot studies Site map Pilot (N=196) data/ contains genotype files in .bgen format bgen/ PLINK2 results according to .bgen files; summary outputs and sentinels/ directory are in the following directories 1e-5 1e-6 5e-8 Batch 2 (N=1,488) data2/ contains genotype files in .bgen format bgen2/ PLINK2 results according to .bgen files; summary outputs and sentinels/ directory are in the following directories 1e-5 5e-8 Comparison of pilot and batch 2 miamiplot Batch 3 data (N=807) data3/ .bgen data bgen3/ PLINK2 results 1e-5 5e-8 Coding There are apparent commonalities between batches from the list of programs and diagrams; many of which are activated as subroutines. Pilot caprion.R and caprion.ini are for data processing. Their derivatives are in the utils/ subdirectory: affymetrix.sh is for variant-specific association analysis. qctool.sb is used to extract available sample and genotypes. qctool.sh further extracts genotypes with MAF 0.01 only. plink2.sh non-SLURM version of association analysis. qqman.sh and qqman.R produce QQ and Manhattan plots. sentinels_nold.sh and merge.sh select sentinels. ps.sh and ps.R run through PhenoScanner. lookup.sh looks up for overlap with SomaLogic and Olink. caprion.ipynb is a Jupyter notebook with some preprocessing done by tensorqtl.sh . Batch 2 (prefix=utils/ when unspecified) graph TB tensoqtl.sh 2020.sh --> EPCR-PROC/ 2020.sh --> data2/affymetrix.id qctool.sb --> qctool.sh qctool.sh --> plink2.sh plink2.sh --> sentinels_nold.sh sentinels_nold.sh --> merge.sh Batch 3 (prefix=utils/) graph TB 2021.sh 2021.sh --> eSet.R 2021.sh --> 2021.R eSet.R --> 2021.R eSet.R --> UDP.R 2021.sh --> UDP.R UDP.R --> qctool.sb qctool.sb --> qctool.sh qctool.sh --> plink2.* 2021.sh --> plink2.* plink2.* --> sentinels_nold.sh+merge.sh Note that eSet.R actually covers data from pilot, batches 2 and 3. Documents ppr.md EPCR-PROC.md 2021.md Reference Klaus B, Reisenauer S (2018). An end to end workflow for differential gene expression using Affymetrix microarrays . https://bioinformatics.psb.ugent.be/webtools/Venn/","title":"Pilot studies"},{"location":"pilot/#pilot-studies","text":"","title":"Pilot studies"},{"location":"pilot/#site-map","text":"Pilot (N=196) data/ contains genotype files in .bgen format bgen/ PLINK2 results according to .bgen files; summary outputs and sentinels/ directory are in the following directories 1e-5 1e-6 5e-8 Batch 2 (N=1,488) data2/ contains genotype files in .bgen format bgen2/ PLINK2 results according to .bgen files; summary outputs and sentinels/ directory are in the following directories 1e-5 5e-8 Comparison of pilot and batch 2 miamiplot Batch 3 data (N=807) data3/ .bgen data bgen3/ PLINK2 results 1e-5 5e-8","title":"Site map"},{"location":"pilot/#coding","text":"There are apparent commonalities between batches from the list of programs and diagrams; many of which are activated as subroutines. Pilot caprion.R and caprion.ini are for data processing. Their derivatives are in the utils/ subdirectory: affymetrix.sh is for variant-specific association analysis. qctool.sb is used to extract available sample and genotypes. qctool.sh further extracts genotypes with MAF 0.01 only. plink2.sh non-SLURM version of association analysis. qqman.sh and qqman.R produce QQ and Manhattan plots. sentinels_nold.sh and merge.sh select sentinels. ps.sh and ps.R run through PhenoScanner. lookup.sh looks up for overlap with SomaLogic and Olink. caprion.ipynb is a Jupyter notebook with some preprocessing done by tensorqtl.sh . Batch 2 (prefix=utils/ when unspecified) graph TB tensoqtl.sh 2020.sh --> EPCR-PROC/ 2020.sh --> data2/affymetrix.id qctool.sb --> qctool.sh qctool.sh --> plink2.sh plink2.sh --> sentinels_nold.sh sentinels_nold.sh --> merge.sh Batch 3 (prefix=utils/) graph TB 2021.sh 2021.sh --> eSet.R 2021.sh --> 2021.R eSet.R --> 2021.R eSet.R --> UDP.R 2021.sh --> UDP.R UDP.R --> qctool.sb qctool.sb --> qctool.sh qctool.sh --> plink2.* 2021.sh --> plink2.* plink2.* --> sentinels_nold.sh+merge.sh Note that eSet.R actually covers data from pilot, batches 2 and 3.","title":"Coding"},{"location":"pilot/#documents","text":"ppr.md EPCR-PROC.md 2021.md","title":"Documents"},{"location":"pilot/#reference","text":"Klaus B, Reisenauer S (2018). An end to end workflow for differential gene expression using Affymetrix microarrays . https://bioinformatics.psb.ugent.be/webtools/Venn/","title":"Reference"},{"location":"pilot/autoencoder/","text":"Autoencoder As shown at R-bloggers , autoencoder is better at reconstructing the original data set than PCA when k is small, where k corresponds to the number of principal components in PCA or bottleneck dimension in AE, however the error converges as k increases. For very large data sets this difference will be larger and means a smaller data set could be used for the same error as PCA. When dealing with big data this is an important property`. The local adoption is ae_test.Rmd which produces ae_test.html and ae_test.pdf . Additional work will be on variatinoal autoencoder (VAE) and denoising counterpart as indicated in the references below. REFERENCES Bishop CM, Bishop H (2024), Deep learning: foundations and concepts, Springer International Publishing, DOI: 10.1007/978-3-031-45468-4 . Bludau I, Frank M, D\u00f6rig C. et al. Systematic detection of functional proteoform groups from bottom-up proteomic datasets. Nat Commun 12, 3810 (2021). https://doi.org/10.1038/s41467-021-24030-x . Hofert M, Prasad A, Zhu M (2019). Quasi-Monte Carlo for multivariate distributions viagenerative neural networks. https://arxiv.org/abs/1811.00683 , https://CRAN.R-project.org/package=gnn . Kingma DP, Welling M (2014). Auto-Encoding Variational Bayes. https://arxiv.org/abs/1312.6114 , https://keras.rstudio.com/articles/examples/variational_autoencoder.html . Ng A. Sparse autoencoder, CS294A Lecture notes, https://web.stanford.edu/class/archive/cs/cs294a/cs294a.1104/sparseAutoencoder.pdf . Sattarov T, Herurkar D, Hees J (2023). Explaining Anomalies using Denoising Autoencoders for Financial Tabular Data. Technical Report 2023-01. Deutsche Bundesban. Trivadis SK (2017). Variational autoencoders for anomaly detection. https://rpubs.com/zkajdan/308801 . URLs https://github.com/diazale/gt-dimred , https://github.com/lmcinnes/umap ( https://umap-learn.readthedocs.io/en/latest/ ) and https://keras.io/examples/timeseries/timeseries_anomaly_detection/ , https://www.mathworks.com/help/deeplearning/ug/anomaly-detection-using-autoencoder-and-wavelets.html , among others.","title":"Autoencoder"},{"location":"pilot/autoencoder/#autoencoder","text":"As shown at R-bloggers , autoencoder is better at reconstructing the original data set than PCA when k is small, where k corresponds to the number of principal components in PCA or bottleneck dimension in AE, however the error converges as k increases. For very large data sets this difference will be larger and means a smaller data set could be used for the same error as PCA. When dealing with big data this is an important property`. The local adoption is ae_test.Rmd which produces ae_test.html and ae_test.pdf . Additional work will be on variatinoal autoencoder (VAE) and denoising counterpart as indicated in the references below.","title":"Autoencoder"},{"location":"pilot/autoencoder/#references","text":"Bishop CM, Bishop H (2024), Deep learning: foundations and concepts, Springer International Publishing, DOI: 10.1007/978-3-031-45468-4 . Bludau I, Frank M, D\u00f6rig C. et al. Systematic detection of functional proteoform groups from bottom-up proteomic datasets. Nat Commun 12, 3810 (2021). https://doi.org/10.1038/s41467-021-24030-x . Hofert M, Prasad A, Zhu M (2019). Quasi-Monte Carlo for multivariate distributions viagenerative neural networks. https://arxiv.org/abs/1811.00683 , https://CRAN.R-project.org/package=gnn . Kingma DP, Welling M (2014). Auto-Encoding Variational Bayes. https://arxiv.org/abs/1312.6114 , https://keras.rstudio.com/articles/examples/variational_autoencoder.html . Ng A. Sparse autoencoder, CS294A Lecture notes, https://web.stanford.edu/class/archive/cs/cs294a/cs294a.1104/sparseAutoencoder.pdf . Sattarov T, Herurkar D, Hees J (2023). Explaining Anomalies using Denoising Autoencoders for Financial Tabular Data. Technical Report 2023-01. Deutsche Bundesban. Trivadis SK (2017). Variational autoencoders for anomaly detection. https://rpubs.com/zkajdan/308801 .","title":"REFERENCES"},{"location":"pilot/autoencoder/#urls","text":"https://github.com/diazale/gt-dimred , https://github.com/lmcinnes/umap ( https://umap-learn.readthedocs.io/en/latest/ ) and https://keras.io/examples/timeseries/timeseries_anomaly_detection/ , https://www.mathworks.com/help/deeplearning/ug/anomaly-detection-using-autoencoder-and-wavelets.html , among others.","title":"URLs"},{"location":"pilot/gwas2/","text":"gwas2 This is a promising alternative showing through RCN3/FCGRN with gwas2.sh . graph TB; gwas2.sh --> gwas.do gwas2.sh --> gwas2.do where gwas.do ( caprion.dat also contains _invn data) and gwas2.do ( gwas2_invn.do for _invn data) are for the pilot and batch 2 data, respectively. See gwas2 repository for additional information.","title":"gwas2"},{"location":"pilot/gwas2/#gwas2","text":"This is a promising alternative showing through RCN3/FCGRN with gwas2.sh . graph TB; gwas2.sh --> gwas.do gwas2.sh --> gwas2.do where gwas.do ( caprion.dat also contains _invn data) and gwas2.do ( gwas2_invn.do for _invn data) are for the pilot and batch 2 data, respectively. See gwas2 repository for additional information.","title":"gwas2"},{"location":"progs/","text":"Protein analysis Programs 1 Work was done in a named sequence 2 . 1_pca_projection.sh 2_ggm.R 3_wgcna.R 4_pca_clustering.R 5_pgwas.sh 6_meta_analysis.sh 7_merge.sh 8_hla.sh graph TB 1_pca_projection.sh 2_ggm.R 3_wgcna.R 4_pca_clustering.R --> 5_pgwas.sb --> 6_meta_analysis.sh --> 6_meta_analysis.sb --> 7_merge.sb --> 7_merge.sh --> 0_utils.sb --> 5_pgwas.sb 8_hla.sh Chromose X is handled together with autosomes, and the loop from 0_utils.sb to 5_pgwas.sb is to produce mean-by-genotype/QQ/Manhattan/LocusZoom plots -- the former also implements vep_annotate(), fp_data(), fp() which only requires --array=1 . Note also that HetISq() only works inside an interactive R session. 1. Data handling and PCA projection The pipeline follows HGI contributions nevertheless only serves for reassurance since the study samples were carefully selected. 2. GGM The results are ready to report. 3. WGCNA This can be finalised according to the Science paper. 4. PCA and clustering The groupings based on unfiltered and DR-filtered proteins can be made on three phases altogether and instead of a classification indicator the first three PCs are used. The PLINK2 has been used in the pilot studies, but now fastGWA using double transformations of the phenotypic data similar to SCALLOP-Seq analysis. Amazingly, a standard assignment statement inside sapply() would produce .pheno / .mpheno containing the raw data. The file also includes experiments on normalisation. 5. pGWAS 3 The bgen files were extracted from a list of all samples, the variant IDs of which were for all RSids to allow for multiallelic loci. The (sb)atch file is extended to produce Q-Q/Manhattan/LocusZoom plots and extreme p values are possible for all plots. Note that LocusZoom 1.4 does not contain 1000Genomes build 37 genotypes for chromosome X and therefore they are supplemented with local files in the required format, namely, locuszoom_1.4/data/1000G/genotypes/2014-10-14/EUR/chrX.[bed, bim, fan] . Now that for the Manhattan plot call for VEP is necessary from 0_utils.sb , which also produces mean by genotype/dosage plots. 6. Meta-analysis Internally, this follows from the SCALLOP/INF implementation, as designed analogous to a Makefile, i.e., 6_meta_analysis <task> where task = METAL_list, METAL_files, METAL_analysis, respectively in sequence. However, due to time limit on HPC, a call to .sb is made for meta-analysis. To extract significant variants one may resort to awk 'NR==1||$12<log(1e-6)/log(10)' 1433B-1.tbl , say. 7. Variant identification An iterative merging scheme is employed; the HLA region is simplified but will be specifically handled. Somewhat paradoxically, forest plots are also obtained here 4 . A SLURM job is executed, to be followed by collection of results. 8. HLA imputation 5 This is experimented on several software including HIBAG, CookHLA and SNP2HLA as desribed here . The whole cohort imputation requests resources exceeding the system limits, so a cardio SLURM job is used instead. The hped file from CookHLA (or converted from HIBAG) can be used by HATK for association analysis while the advantage of SNP2HLA is that binary ped files are ready for use as usual. Directories This is per Caprion project module load miniconda3/4.5.1 export csd3path=/rds/project/jmmh2/rds-jmmh2-projects/olink_proteomics/scallop/miniconda37 source ${csd3path}/bin/activate Name Description pgwas pGWAS METAL Meta-analysis HLA HLA imputation peptide_progs peptide analysis reports Reports Note that docs.sh copies pilot/utils directory of the pilot studies, so coding under that directory is preferable to avoid overwrite. To accommodate filteredd results, a suffix \"\" or \"_dr\" is applied when appropriate. \u21a9 workflow (experimental) module add ceuadmin/snakemake snakemake -s workflow/rules/cojo.smk -j1 snakemake -s workflow/rules/report.smk -j1 snakemake -s workflow/rules/cojo.smk -c --profile workflow and use --unlock when necessary. \u21a9 Protein GWAS GCTA/fastGWA employs MAF>=0.001 (~56%) and geno=0.1 so potentially we can have .bgen files as such to speed up. GCTA uses headerless phenotype files, generated by 5_pgwas.sh which is now unnecessary. \u21a9 Incomplete gamma function The .info files for proteins BROX and CT027 could not be obtained from METAL 2020-05-05 with the following error message, FATAL ERROR - a too large, ITMAX too small in gamma countinued fraction (gcf) An attempt was made to fix this and reported as a fixable issue to METAL GitHub respository ( https://github.com/statgen/METAL/issues/24 ). This has enabled Forest plots for the associate pQTLs. \u21a9 HLA A database of 3D structures of Major Histocompatibility Complex, https://www.histo.fyi/ Whole cohort imputation is feasible with a HIBAG reference panel, Locus A B C DPB1 DQA1 DQB1 DRB1 N 1857 2572 1866 1624 1740 1924 2436 SNPs 891 990 1041 689 948 979 891 while the reference panel is based on the 1000Genomes data (N=503) with SNP2HLA and CookHLA. It is of note that 1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC.markers in the 1000Genomes reference panel has 465 variants with HLA prefix and the partition is as follows, Locus A B C DPB1 DQA1 DQB1 DRB1 HLA_ 98 183 69 0 0 33 82 A recent update: PGG.HLA, https://pog.fudan.edu.cn/pggmhc/ , requires data submission. \u21a9","title":"Protein analysis"},{"location":"progs/#protein-analysis","text":"","title":"Protein analysis"},{"location":"progs/#programs1","text":"Work was done in a named sequence 2 . 1_pca_projection.sh 2_ggm.R 3_wgcna.R 4_pca_clustering.R 5_pgwas.sh 6_meta_analysis.sh 7_merge.sh 8_hla.sh graph TB 1_pca_projection.sh 2_ggm.R 3_wgcna.R 4_pca_clustering.R --> 5_pgwas.sb --> 6_meta_analysis.sh --> 6_meta_analysis.sb --> 7_merge.sb --> 7_merge.sh --> 0_utils.sb --> 5_pgwas.sb 8_hla.sh Chromose X is handled together with autosomes, and the loop from 0_utils.sb to 5_pgwas.sb is to produce mean-by-genotype/QQ/Manhattan/LocusZoom plots -- the former also implements vep_annotate(), fp_data(), fp() which only requires --array=1 . Note also that HetISq() only works inside an interactive R session.","title":"Programs1"},{"location":"progs/#1-data-handling-and-pca-projection","text":"The pipeline follows HGI contributions nevertheless only serves for reassurance since the study samples were carefully selected.","title":"1. Data handling and PCA projection"},{"location":"progs/#2-ggm","text":"The results are ready to report.","title":"2. GGM"},{"location":"progs/#3-wgcna","text":"This can be finalised according to the Science paper.","title":"3. WGCNA"},{"location":"progs/#4-pca-and-clustering","text":"The groupings based on unfiltered and DR-filtered proteins can be made on three phases altogether and instead of a classification indicator the first three PCs are used. The PLINK2 has been used in the pilot studies, but now fastGWA using double transformations of the phenotypic data similar to SCALLOP-Seq analysis. Amazingly, a standard assignment statement inside sapply() would produce .pheno / .mpheno containing the raw data. The file also includes experiments on normalisation.","title":"4. PCA and clustering"},{"location":"progs/#5-pgwas3","text":"The bgen files were extracted from a list of all samples, the variant IDs of which were for all RSids to allow for multiallelic loci. The (sb)atch file is extended to produce Q-Q/Manhattan/LocusZoom plots and extreme p values are possible for all plots. Note that LocusZoom 1.4 does not contain 1000Genomes build 37 genotypes for chromosome X and therefore they are supplemented with local files in the required format, namely, locuszoom_1.4/data/1000G/genotypes/2014-10-14/EUR/chrX.[bed, bim, fan] . Now that for the Manhattan plot call for VEP is necessary from 0_utils.sb , which also produces mean by genotype/dosage plots.","title":"5. pGWAS3"},{"location":"progs/#6-meta-analysis","text":"Internally, this follows from the SCALLOP/INF implementation, as designed analogous to a Makefile, i.e., 6_meta_analysis <task> where task = METAL_list, METAL_files, METAL_analysis, respectively in sequence. However, due to time limit on HPC, a call to .sb is made for meta-analysis. To extract significant variants one may resort to awk 'NR==1||$12<log(1e-6)/log(10)' 1433B-1.tbl , say.","title":"6. Meta-analysis"},{"location":"progs/#7-variant-identification","text":"An iterative merging scheme is employed; the HLA region is simplified but will be specifically handled. Somewhat paradoxically, forest plots are also obtained here 4 . A SLURM job is executed, to be followed by collection of results.","title":"7. Variant identification"},{"location":"progs/#8-hla-imputation5","text":"This is experimented on several software including HIBAG, CookHLA and SNP2HLA as desribed here . The whole cohort imputation requests resources exceeding the system limits, so a cardio SLURM job is used instead. The hped file from CookHLA (or converted from HIBAG) can be used by HATK for association analysis while the advantage of SNP2HLA is that binary ped files are ready for use as usual. Directories This is per Caprion project module load miniconda3/4.5.1 export csd3path=/rds/project/jmmh2/rds-jmmh2-projects/olink_proteomics/scallop/miniconda37 source ${csd3path}/bin/activate Name Description pgwas pGWAS METAL Meta-analysis HLA HLA imputation peptide_progs peptide analysis reports Reports Note that docs.sh copies pilot/utils directory of the pilot studies, so coding under that directory is preferable to avoid overwrite. To accommodate filteredd results, a suffix \"\" or \"_dr\" is applied when appropriate. \u21a9 workflow (experimental) module add ceuadmin/snakemake snakemake -s workflow/rules/cojo.smk -j1 snakemake -s workflow/rules/report.smk -j1 snakemake -s workflow/rules/cojo.smk -c --profile workflow and use --unlock when necessary. \u21a9 Protein GWAS GCTA/fastGWA employs MAF>=0.001 (~56%) and geno=0.1 so potentially we can have .bgen files as such to speed up. GCTA uses headerless phenotype files, generated by 5_pgwas.sh which is now unnecessary. \u21a9 Incomplete gamma function The .info files for proteins BROX and CT027 could not be obtained from METAL 2020-05-05 with the following error message, FATAL ERROR - a too large, ITMAX too small in gamma countinued fraction (gcf) An attempt was made to fix this and reported as a fixable issue to METAL GitHub respository ( https://github.com/statgen/METAL/issues/24 ). This has enabled Forest plots for the associate pQTLs. \u21a9 HLA A database of 3D structures of Major Histocompatibility Complex, https://www.histo.fyi/ Whole cohort imputation is feasible with a HIBAG reference panel, Locus A B C DPB1 DQA1 DQB1 DRB1 N 1857 2572 1866 1624 1740 1924 2436 SNPs 891 990 1041 689 948 979 891 while the reference panel is based on the 1000Genomes data (N=503) with SNP2HLA and CookHLA. It is of note that 1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC.markers in the 1000Genomes reference panel has 465 variants with HLA prefix and the partition is as follows, Locus A B C DPB1 DQA1 DQB1 DRB1 HLA_ 98 183 69 0 0 33 82 A recent update: PGG.HLA, https://pog.fudan.edu.cn/pggmhc/ , requires data submission. \u21a9","title":"8. HLA imputation5"}]}
\ No newline at end of file
+{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Caprion data analysis Welcome! This repository/site is dedicated to protein/peptide quantitative trait analysis using the Caprion platform, which is organised chonologically/logistically into the following sections. Pilot studies autoencoder gwas2 Pilot studies Analysis Protein analysis Peptide analysis Miscellaneous analysis Additional information Caprion panel Notes Local file/web browsing A web-style navigation is furnised via a port number, e.g., 8000, cd /rds/project/rds-zuZwCZMsS0w/Caprion_proteomics/analysis module load ceuadmin/edge export pn=8000 if lsof -i :${pn}; then echo \"Port ${pn} is already in use' try another one.\" else python3 -m http.server ${pn} & server_pid=$! edge http://localhost:${pn} & fi where the port number can be released with kill $server_pid (can be checked with ps ). In case it does now show, use edge --user-data-dir=${TMPDIR} http://localhost:${pn} & in replace of ~/.config/microsoft-edge , where TMPDIR is a directory name. One could browse files as well as mirrors of two web sites. Web site. This is from /site as above. SRCF. The mirror is within the following subdirectory: /srcf . Colocalisation. See /json/coloc.html. Note that many blanked plots such as A1AG1-* are due to misspecification of chromosomal positions which should otherwise be visible. Isotopes associated with >1 proteins, /dup/json/dup.htm To facilitate navigation, an index.html is created in place, so python3 -m http.server 8000 & is started from /rds/project/rds-zuZwCZMsS0w/Caprion_proteomics/analysis . Non-CSD3 browser(s) This approach seems less problematic with user-data-dir mentioned above. We can again set up tunneling from CSD3 with python3 -m http.server 8000 & hostname Once succeeded, we establish the connection elsewhere. ssh -4 -L 8080:127.0.0.1:8000 -fN jhz22@${hostname}.hpc.cam.ac.uk where hostname from CSD3 and ${hostname} have to be the same. We can then browse http://127.0.0.1:8080 .","title":""},{"location":"#caprion-data-analysis","text":"","title":"Caprion data analysis"},{"location":"#welcome","text":"This repository/site is dedicated to protein/peptide quantitative trait analysis using the Caprion platform, which is organised chonologically/logistically into the following sections.","title":"Welcome!"},{"location":"#pilot-studies","text":"autoencoder gwas2 Pilot studies","title":"Pilot studies"},{"location":"#analysis","text":"Protein analysis Peptide analysis Miscellaneous analysis","title":"Analysis"},{"location":"#additional-information","text":"Caprion panel Notes","title":"Additional information"},{"location":"#local-fileweb-browsing","text":"A web-style navigation is furnised via a port number, e.g., 8000, cd /rds/project/rds-zuZwCZMsS0w/Caprion_proteomics/analysis module load ceuadmin/edge export pn=8000 if lsof -i :${pn}; then echo \"Port ${pn} is already in use' try another one.\" else python3 -m http.server ${pn} & server_pid=$! edge http://localhost:${pn} & fi where the port number can be released with kill $server_pid (can be checked with ps ). In case it does now show, use edge --user-data-dir=${TMPDIR} http://localhost:${pn} & in replace of ~/.config/microsoft-edge , where TMPDIR is a directory name. One could browse files as well as mirrors of two web sites. Web site. This is from /site as above. SRCF. The mirror is within the following subdirectory: /srcf . Colocalisation. See /json/coloc.html. Note that many blanked plots such as A1AG1-* are due to misspecification of chromosomal positions which should otherwise be visible. Isotopes associated with >1 proteins, /dup/json/dup.htm To facilitate navigation, an index.html is created in place, so python3 -m http.server 8000 & is started from /rds/project/rds-zuZwCZMsS0w/Caprion_proteomics/analysis .","title":"Local file/web browsing"},{"location":"#non-csd3-browsers","text":"This approach seems less problematic with user-data-dir mentioned above. We can again set up tunneling from CSD3 with python3 -m http.server 8000 & hostname Once succeeded, we establish the connection elsewhere. ssh -4 -L 8080:127.0.0.1:8000 -fN jhz22@${hostname}.hpc.cam.ac.uk where hostname from CSD3 and ${hostname} have to be the same. We can then browse http://127.0.0.1:8080 .","title":"Non-CSD3 browser(s)"},{"location":"Notes/","text":"Notes ( Sections I -- III are due to Claude ) I. Meta-data Isotope.Group.ID is a unique identifier for a group of isotopes that belong to the same peptide or molecule. In mass spectrometry, isotopes are atoms of the same element that have the same number of protons but differ in the number of neutrons. This ID helps to group together isotopes that arise from the same peptide, allowing for easier identification and analysis. Protein contains the name or identifier of the protein that the peptide (or molecule) is derived from. This information is typically obtained by searching the MS data against a protein database. Modified.Peptide.Sequence is the amino acid sequence of the peptide, including any post-translational modifications (PTMs) that have been identified. PTMs are chemical modifications that occur after protein synthesis, such as phosphorylation, ubiquitination, or methylation. The sequence is usually represented in a standard format, such as using lowercase letters for modified residues. Monoisotopic.m/z is the monoisotopic mass-to-charge ratio (m/z) of the peptide or molecule. The monoisotopic mass is the mass of the most abundant isotope of each element in the molecule, which is typically the lightest isotope (e.g., 12C, 1H, 14N, 16O, etc.). This value is used as a reference point for identifying the peptide or molecule. Max.Isotope.Time.Centroid is the time centroid (or apex) of the most intense isotope in the isotope group. In liquid chromatography-mass spectrometry (LC-MS), peptides are separated based on their retention time (the time it takes for the peptide to elute from the column). The time centroid is the time point at which the peptide signal is most intense, which can be used to quantify the peptide abundance. Charge is the charge state of the peptide or molecule. In mass spectrometry, peptides can be ionized to different charge states (e.g., +1, +2, +3, etc.), which affects their mass-to-charge ratio (m/z). The charge state is an important parameter for identifying peptides and molecules. They are invaluable for analyzing and interpreting MS data, including peptide identification, quantification, and characterization of post-translational modifications. II. MS1/MS2 In mass spectrometry-based proteomics, the typical workflow for identifying peptides and proteins involves using tandem mass spectrometry (MS/MS or MS2). In this process, precursor ions (peptides) are selected in the first stage of mass spectrometry (MS1) and then fragmented to produce a series of smaller ions in the second stage (MS2). The resulting fragment ions (product ions) are analyzed to infer the sequence of the peptide and, by extension, identify the proteins from which they were derived. However, it is possible to infer peptides and proteins using only MS1 data through a process known as \"MS1-only\" or \"untargeted\" analysis. This approach can be particularly useful in the following scenarios: Label-based quantification : Techniques like SILAC (Stable Isotope Labeling by Amino acids in Cell culture) or chemical labeling (e.g., TMT, iTRAQ) rely on MS1 data for quantification. The mass shift introduced by labels allows for the direct comparison of peptide abundances based on their MS1 ion intensities. Label-free quantification : Proteins can be quantified by comparing the intensities of their corresponding peptide ions in MS1 across different samples. This requires accurate mass and retention time alignment and often uses algorithms to detect and quantify features (peptide ions) consistently across multiple runs. Accurate Mass and Time tags (AMT) : This approach relies on a previously established library of peptide identifications, where each peptide is characterized by its accurate mass and normalized retention time. In subsequent analyses, peptides can be inferred by matching the observed accurate mass and retention time to the library without the need for MS2 fragmentation. Data-independent acquisition (DIA) : In some DIA workflows, proteins can be inferred from MS1 data when coupled with complex data analysis strategies and spectral libraries. It is important to note that while DIA collects MS1 spectra, it also involves the simultaneous fragmentation of all ions in a given mass range, and thus MS2-level data is typically available and used for identification. It is important to note that MS1-only approaches may have limitations in terms of identification specificity and sensitivity compared to traditional MS2-based methods. MS1-based protein inference is generally less confident because it lacks sequence-specific information that can only be obtained from fragment ions in MS2. For this reason, MS1-based methods are often complemented by MS2 data or rely on extensive peptide libraries and sophisticated computational algorithms to increase the confidence of peptide and protein identification. III. OpenMS/crux/MaxQuant/FragPipe OpenMS, Crux, MaxQuant, and FragPipe are all prominent software platforms for analyzing proteomics data, each with its own strengths and weaknesses. Here's a comparison: OpenMS: Focus: Provides a flexible and open-source framework for developing and executing various mass spectrometry data analysis workflows. Strengths: Highly modular and customizable: Offers a vast collection of algorithms and tools that can be combined and customized to create tailored workflows. Open-source and extensible: Encourages community contributions and allows for the development of new tools and algorithms. Supports various data formats and instruments: Compatible with a wide range of data formats and mass spectrometry platforms. Strong support for metabolomics data: While primarily used for proteomics, it also offers tools for analyzing metabolomics data. Limitations: Steeper learning curve: Requires programming knowledge and familiarity with command-line interfaces. Less user-friendly: Lacks a comprehensive graphical user interface (GUI), making it less accessible for beginners. Limited pre-built workflows: While highly customizable, it requires more manual effort to set up standard workflows compared to MaxQuant or FragPipe. Crux: Focus: A command-line toolkit designed for peptide identification, protein quantification, and statistical validation of proteomics data. Strengths: Open-source and well-documented: Provides clear documentation and allows for community contributions. Fast and efficient: Known for its computational efficiency and speed. Strong statistical validation: Offers rigorous statistical methods for validating peptide and protein identifications. Supports various search engines: Compatible with multiple search engines, including Comet and Tide. Limitations: Command-line interface only: Requires familiarity with command-line operations. Less user-friendly: Lacks a GUI, making it less accessible for beginners. Limited pre-built workflows: Requires more manual effort to set up complete analysis pipelines. MaxQuant: Focus: Primarily known for its robust and sensitive peptide and protein identification and quantification using its proprietary Andromeda search engine. Strengths: User-friendly interface: Provides a GUI for easier data processing and analysis. Robust and sensitive identification and quantification: Offers high-quality results for standard DDA-based proteomics experiments. Strong support for label-free quantification (LFQ) and match between runs (MBR). Extensive post-translational modification (PTM) analysis: Offers comprehensive support for identifying and quantifying various PTMs. Limitations: Less flexible for specialized workflows: Primarily designed for standard bottom-up proteomics experiments. Limited support for DIA data: While it can handle DIA data, it's not its primary strength. Closed-source: The core algorithms are not open-source, limiting community contributions and customization. FragPipe: Focus: Offers a more modular and flexible platform with various tools for different proteomics workflows, including both DDA and DIA. Strengths: Versatile and modular: Includes a suite of tools for various tasks, including peptide identification, quantification, and statistical analysis. Extensive support for DIA data: Features DIA-Umpire, a dedicated tool for analyzing DIA data using various algorithms. Highly accurate and sensitive quantification: Employs IonQuant for precise quantification using extracted ion chromatograms. Open-source and actively developed: Encourages community contributions and continuous improvement. Limitations: Steeper learning curve: Primarily operates through a command-line interface, requiring more technical expertise. Less user-friendly interface: Lacks a comprehensive GUI, making it less intuitive for beginners. Here's a table summarizing the key differences: Feature OpenMS Crux MaxQuant FragPipe Primary Focus Flexible Framework Peptide ID & Quantification Peptide & Protein ID/Quant Modular Platform Open Source Yes Yes No (core algorithms) Yes User Interface Primarily CLI CLI GUI Primarily CLI Learning Curve Steep Moderate Easier Steep Flexibility Highly Flexible Moderate Less Flexible More Flexible DIA Support Limited Limited Limited Extensive (DIA-Umpire) Quantification Methods Various Various LFQ, iBAQ IonQuant PTM Analysis Supported Supported Extensive PTM-Shepherd Community Support Strong Moderate Moderate Strong Choosing the Right Tool: OpenMS: Ideal for researchers with programming skills who need a highly customizable and extensible platform for developing specialized workflows. Crux: Suitable for researchers comfortable with command-line interfaces and seeking fast and efficient tools for peptide identification, protein quantification, and statistical validation. MaxQuant: Best for researchers looking for a user-friendly platform with robust performance for standard DDA-based proteomics experiments, especially those focusing on label-free quantification. FragPipe: Ideal for researchers seeking a highly flexible and customizable platform for various workflows, including DIA analysis, and who are comfortable with command-line operations. Remember to consider your specific research goals, data type, and bioinformatics expertise when choosing the best tool for your needs. You might even explore combining different tools to leverage their unique strengths for different aspects of your analysis. Notes on CSD3 . crux/4.1 is functional (along with comet/2024.01.1 & kojak/2.0.0a22 ) on CSD3 but crux/4.2 is not. FragPipe/22.0 does offer a comprehensive GUI. Moreover, MetaMorpheus/1.0.5 , FlashLFQ/1.2.6 and MS Amanda/3.0.21.532 are also available from CSD3. IV. Galaxy tutorials Web: https://usegalaxy.org/ (https://training.galaxyproject.org/training-material/) docker run -p 8080:80 quay.io/galaxy/introduction-training Visit http://localhost:8080 . Login as admin with password password to access. See https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/protein-id-oms/tutorial.html V. PoGo Fast Mapping of Peptides to Genomic Coordinates for Proteogenomic Analyses, https://www.sanger.ac.uk/tool/pogo/ , GitHub, https://github.com/cschlaffner/PoGo . It uses transcript translations and reference gene annotations to identify the genomic loci of peptides and post-translational modifications. Multiple occurrences of peptides in the input data resulting in the same genomic loci will be collapsed as a single occurrence in the output. The input format is a tab delimited file with four columns with file extensions such as .pogo, .txt, and *.tsv. Column Column header Description 1 Sample Name of sample or experiment 2 Peptide Peptide sequence with PSI-MS nodification names in round brackets following the mpdified amino acid, e.g. PEPT(Phopsho)IDE for a phosphorylated threonine 3 PSMs Number of peptide-spectrum matches (PSMs) for the given peptide, including those redundantly identified (peptides can be \u201cseen\u201d more than once in a run) 4 Quant Quantitative value for the given peptide in the given sample An example is established as follows, wget -S ftp://ftp.sanger.ac.uk/pub/teams/17/software/PoGo/PoGo_Testprocedures.zip unzip PoGo_Testprocedures.zip cd PoGo_Testprocedures/Testfiles module load ceuadmin/PoGo for Peptides in Testfile_experimental Testfile_small do PoGo -fasta input/gencode.v25.pc_translations.fa -gtf input/gencode.v25.annotation.gtf -in input/${Peptides}.txt done # expanded version /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/bin/PoGo \\ -fasta /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/PoGo_Testprocedures/Testfiles/input/gencode.v25.pc_translations.fa \\ -gtf /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/PoGo_Testprocedures/Testfiles/input/gencode.v25.annotation.gtf \\ -in /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/PoGo_Testprocedures/Testfiles/input/Testfile_experimental.txt \\ -format ALL \\ -mm 0 Output files are also contained in the input/ directory. GENCODE annotation data are available from https://www.gencodegenes.org/human/ and https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/ . The Java GUI, https://github.com/cschlaffner/PoGoGUI , is run as follows, java -jar PoGoGUI-v1.0.0.jar which requires PoGo executable as well. The source is compiled with maven, https://maven.apache.org/ , e.g., module load maven-3.5.0-gcc-5.4.0-3sgaeze mvn install assuming that pom.xml is available, e.g., /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/PoGoGUI/PoGoGUI . VI. Proteoform Analysis ETH Zurich, U Toronto Team Develops Tool for Bottom-Up Proteomics Proteoform Analysis Jul 28, 2021 | Adam Bonislawski NEW YORK \u2013 A team led by researchers at ETH Zurich and the University of Toronto has developed a tool that allows for the detection of protein proteoforms in bottom-up proteomics data. Described in a paper published in June in Nature Communications, the tool, called COPF (COrrelation-based functional ProteoForm) uses peptide correlation analysis to detect differences in proteoform populations across different samples or conditions and could aid researchers as they seek to better understand the role different protein forms play in biology and disease. The human genome is thought to have around 20,000 protein coding genes, but many of these 20,000 proteins exist in the body in various forms, differentiated by, for instance, post-translational modifications or amino acid substitutions. These different forms are called proteoforms, and it is widely believed that biological processes are guided not just by the proteins present but by what proteoforms are present and in what proportions. Traditional bottom-up proteomics workflows have provided only limited insight into proteoform populations however, due to the fact that the presence of a particular protein is typically inferred by the detection of just a few of its peptides and that digesting proteins into peptides for mass spec analysis makes it near impossible to link a modified peptide back to a particular proteoform. Some proteomics researchers have addressed this issue by moving to top-down proteomics, which looks at intact proteins, allowing them to better distinguish between different proteoforms. Top-down proteomics is very technically challenging, however, and is not yet able to analyze proteins with the breadth and depth of bottom-up workflows. Recently, the development of more reproducible and higher-throughput bottom-up workflows, and particular workflows using data independent-acquisition (DIA) mass spectrometry, have allowed researchers like the Nature Communications authors to apply peptide correlation analysis to the study of proteoforms. Peptide correlation analysis looks at differences in peptide behavior within and across proteins in bottom-up data. Researchers have developed a number of approaches for turning peptide measurements into protein data, with most working under the assumption that peptides from the same protein will behave the same way. In practice, though, that isn't the case. On one hand, there are a number of technical reasons why two peptides from the same protein may not behave the same way. For instance, different digestion efficiencies could lead to some peptides being more abundant than others. Different ionization efficiencies could similarly make one peptide more likely than another to be detected by the mass spec. The presence of different proteoforms could also play a role. For instance, if a protein is present in both a full-length and truncated form, expression changes observed in the full-length form wouldn't be observable if the peptide being measured wasn't present in the truncated form. Not only would this throw off protein-level quantitation, but it would also mask relative changes in the two protein forms that could be biologically important. A major challenge to applying this insight has been determining which differences in peptide behavior reflect real technical or biological variation and which are just noise, noted Hannes R\u00f6st, research chair in mass spectrometry-based personalized medicine at the University of Toronto and an author on the Nature Communications study. \"In many cases [such variation] was noise,\" he said. \"When you look at traditional shotgun proteomics workflows and data analyses, really the power is not at the peptide-level quantification but at the protein level from the aggregation of multiple peptides. On the peptide level you see a lot of noise, and I think that has prevented us from using this observation that individual peptides could yield a lot of interested information because people really only looked at the protein-level data, because that is what they trusted.\" R\u00f6st said that the development of targeted protein quantitation approaches like multiple-reaction monitoring (MRM) has demonstrated that individual peptides can be measured with high accuracy, and the development of DIA mass spec approaches has enabled MRM-style peptide quantitation at the proteome scale. At the same time, improvements in mass spec technology have allowed researchers to collect the kind of large and reproducible datasets required for peptide correlation analysis, he said. \"These are types of experiments we wouldn't have imagined 10 years ago, because for correlation-based approaches to work, you need a relatively large number of samples, and you need low variance,\" he said. \"We are not detecting [proteoforms] that are not changing between different [conditions], we are only detecting those that change. And for this to work we need to have multiple replicates and we need to have different conditions and to be able to measure these peptides with high quantitative accuracy across these conditions.\" The COPF tool looks at the intensities of peptides coming from a particular protein across all the samples measured in an experiment and then calculates peptide correlations for all the pairs of peptides coming from that protein and uses hierarchical clustering to divide the peptides into two clusters. It then scores the likelihood that multiple proteoforms of a protein are present by comparing the level of peptide correlation between the clusters to the level of in-cluster variation. The tool does not identify the specific modifications or variations that distinguish the different proteoforms but rather the peptides that appear to differentiate between the forms of the protein in the different biological contexts investigated. Analyzing a DIA dataset that looked at five different tissue types across eight different mice, COPF identified 63 proteins that exhibited different proteoform groups, including proteins with known tissue-specific splice variants. The researchers also identified proteoforms created by proteolytic and autocatalytic cleavage and phosphorylation, indicating, they wrote, that the tool is \"agnostic to the different mechanisms by which proteoforms can be generated inside the cell.\" The development of COPF follows the publication last year of a study by researchers at Barts Cancer Institute and the University of Wisconsin-Madison detailing another peptide correlation analysis tool for identifying proteoforms in bottom-up data called PeCorA. Unlike COPF, which requires proteoforms to differ by two or more peptides, PeCorA can detect proteoforms based on single peptide differences. This makes it a potentially more sensitive tool but also less specific than COPF, R\u00f6st said. More generally, he said that he expected ongoing improvements in mass spec technology would further improve peptide correlation-based approaches like COPF and PeCorA by boosting peptide coverage. \"To kind of cover every possible protein isoform we would need to have complete coverage of every protein, and unfortunately we are currently quite far away from having peptide-level coverage of every protein,\" he said. \"I think that is currently one of the limitations where we are kind of hitting a wall.\" R\u00f6st added that his lab has begun acquiring data on Bruker's timsTOF Pro platform, \"and there we definitely see both an increase in protein coverage and also in the number of peptides we can measure.\" \"That's why I'm very optimistic that while this is just the first implementation of the method, the data we are producing at this moment is much more complete, and therefore I think it would be even more suitable to our approach than the data we used in the paper,\" he said.","title":"Notes"},{"location":"Notes/#notes","text":"( Sections I -- III are due to Claude )","title":"Notes"},{"location":"Notes/#i-meta-data","text":"Isotope.Group.ID is a unique identifier for a group of isotopes that belong to the same peptide or molecule. In mass spectrometry, isotopes are atoms of the same element that have the same number of protons but differ in the number of neutrons. This ID helps to group together isotopes that arise from the same peptide, allowing for easier identification and analysis. Protein contains the name or identifier of the protein that the peptide (or molecule) is derived from. This information is typically obtained by searching the MS data against a protein database. Modified.Peptide.Sequence is the amino acid sequence of the peptide, including any post-translational modifications (PTMs) that have been identified. PTMs are chemical modifications that occur after protein synthesis, such as phosphorylation, ubiquitination, or methylation. The sequence is usually represented in a standard format, such as using lowercase letters for modified residues. Monoisotopic.m/z is the monoisotopic mass-to-charge ratio (m/z) of the peptide or molecule. The monoisotopic mass is the mass of the most abundant isotope of each element in the molecule, which is typically the lightest isotope (e.g., 12C, 1H, 14N, 16O, etc.). This value is used as a reference point for identifying the peptide or molecule. Max.Isotope.Time.Centroid is the time centroid (or apex) of the most intense isotope in the isotope group. In liquid chromatography-mass spectrometry (LC-MS), peptides are separated based on their retention time (the time it takes for the peptide to elute from the column). The time centroid is the time point at which the peptide signal is most intense, which can be used to quantify the peptide abundance. Charge is the charge state of the peptide or molecule. In mass spectrometry, peptides can be ionized to different charge states (e.g., +1, +2, +3, etc.), which affects their mass-to-charge ratio (m/z). The charge state is an important parameter for identifying peptides and molecules. They are invaluable for analyzing and interpreting MS data, including peptide identification, quantification, and characterization of post-translational modifications.","title":"I. Meta-data"},{"location":"Notes/#ii-ms1ms2","text":"In mass spectrometry-based proteomics, the typical workflow for identifying peptides and proteins involves using tandem mass spectrometry (MS/MS or MS2). In this process, precursor ions (peptides) are selected in the first stage of mass spectrometry (MS1) and then fragmented to produce a series of smaller ions in the second stage (MS2). The resulting fragment ions (product ions) are analyzed to infer the sequence of the peptide and, by extension, identify the proteins from which they were derived. However, it is possible to infer peptides and proteins using only MS1 data through a process known as \"MS1-only\" or \"untargeted\" analysis. This approach can be particularly useful in the following scenarios: Label-based quantification : Techniques like SILAC (Stable Isotope Labeling by Amino acids in Cell culture) or chemical labeling (e.g., TMT, iTRAQ) rely on MS1 data for quantification. The mass shift introduced by labels allows for the direct comparison of peptide abundances based on their MS1 ion intensities. Label-free quantification : Proteins can be quantified by comparing the intensities of their corresponding peptide ions in MS1 across different samples. This requires accurate mass and retention time alignment and often uses algorithms to detect and quantify features (peptide ions) consistently across multiple runs. Accurate Mass and Time tags (AMT) : This approach relies on a previously established library of peptide identifications, where each peptide is characterized by its accurate mass and normalized retention time. In subsequent analyses, peptides can be inferred by matching the observed accurate mass and retention time to the library without the need for MS2 fragmentation. Data-independent acquisition (DIA) : In some DIA workflows, proteins can be inferred from MS1 data when coupled with complex data analysis strategies and spectral libraries. It is important to note that while DIA collects MS1 spectra, it also involves the simultaneous fragmentation of all ions in a given mass range, and thus MS2-level data is typically available and used for identification. It is important to note that MS1-only approaches may have limitations in terms of identification specificity and sensitivity compared to traditional MS2-based methods. MS1-based protein inference is generally less confident because it lacks sequence-specific information that can only be obtained from fragment ions in MS2. For this reason, MS1-based methods are often complemented by MS2 data or rely on extensive peptide libraries and sophisticated computational algorithms to increase the confidence of peptide and protein identification.","title":"II. MS1/MS2"},{"location":"Notes/#iii-openmscruxmaxquantfragpipe","text":"OpenMS, Crux, MaxQuant, and FragPipe are all prominent software platforms for analyzing proteomics data, each with its own strengths and weaknesses. Here's a comparison: OpenMS: Focus: Provides a flexible and open-source framework for developing and executing various mass spectrometry data analysis workflows. Strengths: Highly modular and customizable: Offers a vast collection of algorithms and tools that can be combined and customized to create tailored workflows. Open-source and extensible: Encourages community contributions and allows for the development of new tools and algorithms. Supports various data formats and instruments: Compatible with a wide range of data formats and mass spectrometry platforms. Strong support for metabolomics data: While primarily used for proteomics, it also offers tools for analyzing metabolomics data. Limitations: Steeper learning curve: Requires programming knowledge and familiarity with command-line interfaces. Less user-friendly: Lacks a comprehensive graphical user interface (GUI), making it less accessible for beginners. Limited pre-built workflows: While highly customizable, it requires more manual effort to set up standard workflows compared to MaxQuant or FragPipe. Crux: Focus: A command-line toolkit designed for peptide identification, protein quantification, and statistical validation of proteomics data. Strengths: Open-source and well-documented: Provides clear documentation and allows for community contributions. Fast and efficient: Known for its computational efficiency and speed. Strong statistical validation: Offers rigorous statistical methods for validating peptide and protein identifications. Supports various search engines: Compatible with multiple search engines, including Comet and Tide. Limitations: Command-line interface only: Requires familiarity with command-line operations. Less user-friendly: Lacks a GUI, making it less accessible for beginners. Limited pre-built workflows: Requires more manual effort to set up complete analysis pipelines. MaxQuant: Focus: Primarily known for its robust and sensitive peptide and protein identification and quantification using its proprietary Andromeda search engine. Strengths: User-friendly interface: Provides a GUI for easier data processing and analysis. Robust and sensitive identification and quantification: Offers high-quality results for standard DDA-based proteomics experiments. Strong support for label-free quantification (LFQ) and match between runs (MBR). Extensive post-translational modification (PTM) analysis: Offers comprehensive support for identifying and quantifying various PTMs. Limitations: Less flexible for specialized workflows: Primarily designed for standard bottom-up proteomics experiments. Limited support for DIA data: While it can handle DIA data, it's not its primary strength. Closed-source: The core algorithms are not open-source, limiting community contributions and customization. FragPipe: Focus: Offers a more modular and flexible platform with various tools for different proteomics workflows, including both DDA and DIA. Strengths: Versatile and modular: Includes a suite of tools for various tasks, including peptide identification, quantification, and statistical analysis. Extensive support for DIA data: Features DIA-Umpire, a dedicated tool for analyzing DIA data using various algorithms. Highly accurate and sensitive quantification: Employs IonQuant for precise quantification using extracted ion chromatograms. Open-source and actively developed: Encourages community contributions and continuous improvement. Limitations: Steeper learning curve: Primarily operates through a command-line interface, requiring more technical expertise. Less user-friendly interface: Lacks a comprehensive GUI, making it less intuitive for beginners. Here's a table summarizing the key differences: Feature OpenMS Crux MaxQuant FragPipe Primary Focus Flexible Framework Peptide ID & Quantification Peptide & Protein ID/Quant Modular Platform Open Source Yes Yes No (core algorithms) Yes User Interface Primarily CLI CLI GUI Primarily CLI Learning Curve Steep Moderate Easier Steep Flexibility Highly Flexible Moderate Less Flexible More Flexible DIA Support Limited Limited Limited Extensive (DIA-Umpire) Quantification Methods Various Various LFQ, iBAQ IonQuant PTM Analysis Supported Supported Extensive PTM-Shepherd Community Support Strong Moderate Moderate Strong Choosing the Right Tool: OpenMS: Ideal for researchers with programming skills who need a highly customizable and extensible platform for developing specialized workflows. Crux: Suitable for researchers comfortable with command-line interfaces and seeking fast and efficient tools for peptide identification, protein quantification, and statistical validation. MaxQuant: Best for researchers looking for a user-friendly platform with robust performance for standard DDA-based proteomics experiments, especially those focusing on label-free quantification. FragPipe: Ideal for researchers seeking a highly flexible and customizable platform for various workflows, including DIA analysis, and who are comfortable with command-line operations. Remember to consider your specific research goals, data type, and bioinformatics expertise when choosing the best tool for your needs. You might even explore combining different tools to leverage their unique strengths for different aspects of your analysis. Notes on CSD3 . crux/4.1 is functional (along with comet/2024.01.1 & kojak/2.0.0a22 ) on CSD3 but crux/4.2 is not. FragPipe/22.0 does offer a comprehensive GUI. Moreover, MetaMorpheus/1.0.5 , FlashLFQ/1.2.6 and MS Amanda/3.0.21.532 are also available from CSD3.","title":"III. OpenMS/crux/MaxQuant/FragPipe"},{"location":"Notes/#iv-galaxy-tutorials","text":"Web: https://usegalaxy.org/ (https://training.galaxyproject.org/training-material/) docker run -p 8080:80 quay.io/galaxy/introduction-training Visit http://localhost:8080 . Login as admin with password password to access. See https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/protein-id-oms/tutorial.html","title":"IV. Galaxy tutorials"},{"location":"Notes/#v-pogo","text":"Fast Mapping of Peptides to Genomic Coordinates for Proteogenomic Analyses, https://www.sanger.ac.uk/tool/pogo/ , GitHub, https://github.com/cschlaffner/PoGo . It uses transcript translations and reference gene annotations to identify the genomic loci of peptides and post-translational modifications. Multiple occurrences of peptides in the input data resulting in the same genomic loci will be collapsed as a single occurrence in the output. The input format is a tab delimited file with four columns with file extensions such as .pogo, .txt, and *.tsv. Column Column header Description 1 Sample Name of sample or experiment 2 Peptide Peptide sequence with PSI-MS nodification names in round brackets following the mpdified amino acid, e.g. PEPT(Phopsho)IDE for a phosphorylated threonine 3 PSMs Number of peptide-spectrum matches (PSMs) for the given peptide, including those redundantly identified (peptides can be \u201cseen\u201d more than once in a run) 4 Quant Quantitative value for the given peptide in the given sample An example is established as follows, wget -S ftp://ftp.sanger.ac.uk/pub/teams/17/software/PoGo/PoGo_Testprocedures.zip unzip PoGo_Testprocedures.zip cd PoGo_Testprocedures/Testfiles module load ceuadmin/PoGo for Peptides in Testfile_experimental Testfile_small do PoGo -fasta input/gencode.v25.pc_translations.fa -gtf input/gencode.v25.annotation.gtf -in input/${Peptides}.txt done # expanded version /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/bin/PoGo \\ -fasta /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/PoGo_Testprocedures/Testfiles/input/gencode.v25.pc_translations.fa \\ -gtf /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/PoGo_Testprocedures/Testfiles/input/gencode.v25.annotation.gtf \\ -in /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/PoGo_Testprocedures/Testfiles/input/Testfile_experimental.txt \\ -format ALL \\ -mm 0 Output files are also contained in the input/ directory. GENCODE annotation data are available from https://www.gencodegenes.org/human/ and https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/ . The Java GUI, https://github.com/cschlaffner/PoGoGUI , is run as follows, java -jar PoGoGUI-v1.0.0.jar which requires PoGo executable as well. The source is compiled with maven, https://maven.apache.org/ , e.g., module load maven-3.5.0-gcc-5.4.0-3sgaeze mvn install assuming that pom.xml is available, e.g., /usr/local/Cluster-Apps/ceuadmin/PoGo/1.0.0/PoGoGUI/PoGoGUI .","title":"V. PoGo"},{"location":"Notes/#vi-proteoform-analysis","text":"","title":"VI. Proteoform Analysis"},{"location":"Notes/#eth-zurich-u-toronto-team-develops-tool-for-bottom-up-proteomics-proteoform-analysis","text":"Jul 28, 2021 | Adam Bonislawski NEW YORK \u2013 A team led by researchers at ETH Zurich and the University of Toronto has developed a tool that allows for the detection of protein proteoforms in bottom-up proteomics data. Described in a paper published in June in Nature Communications, the tool, called COPF (COrrelation-based functional ProteoForm) uses peptide correlation analysis to detect differences in proteoform populations across different samples or conditions and could aid researchers as they seek to better understand the role different protein forms play in biology and disease. The human genome is thought to have around 20,000 protein coding genes, but many of these 20,000 proteins exist in the body in various forms, differentiated by, for instance, post-translational modifications or amino acid substitutions. These different forms are called proteoforms, and it is widely believed that biological processes are guided not just by the proteins present but by what proteoforms are present and in what proportions. Traditional bottom-up proteomics workflows have provided only limited insight into proteoform populations however, due to the fact that the presence of a particular protein is typically inferred by the detection of just a few of its peptides and that digesting proteins into peptides for mass spec analysis makes it near impossible to link a modified peptide back to a particular proteoform. Some proteomics researchers have addressed this issue by moving to top-down proteomics, which looks at intact proteins, allowing them to better distinguish between different proteoforms. Top-down proteomics is very technically challenging, however, and is not yet able to analyze proteins with the breadth and depth of bottom-up workflows. Recently, the development of more reproducible and higher-throughput bottom-up workflows, and particular workflows using data independent-acquisition (DIA) mass spectrometry, have allowed researchers like the Nature Communications authors to apply peptide correlation analysis to the study of proteoforms. Peptide correlation analysis looks at differences in peptide behavior within and across proteins in bottom-up data. Researchers have developed a number of approaches for turning peptide measurements into protein data, with most working under the assumption that peptides from the same protein will behave the same way. In practice, though, that isn't the case. On one hand, there are a number of technical reasons why two peptides from the same protein may not behave the same way. For instance, different digestion efficiencies could lead to some peptides being more abundant than others. Different ionization efficiencies could similarly make one peptide more likely than another to be detected by the mass spec. The presence of different proteoforms could also play a role. For instance, if a protein is present in both a full-length and truncated form, expression changes observed in the full-length form wouldn't be observable if the peptide being measured wasn't present in the truncated form. Not only would this throw off protein-level quantitation, but it would also mask relative changes in the two protein forms that could be biologically important. A major challenge to applying this insight has been determining which differences in peptide behavior reflect real technical or biological variation and which are just noise, noted Hannes R\u00f6st, research chair in mass spectrometry-based personalized medicine at the University of Toronto and an author on the Nature Communications study. \"In many cases [such variation] was noise,\" he said. \"When you look at traditional shotgun proteomics workflows and data analyses, really the power is not at the peptide-level quantification but at the protein level from the aggregation of multiple peptides. On the peptide level you see a lot of noise, and I think that has prevented us from using this observation that individual peptides could yield a lot of interested information because people really only looked at the protein-level data, because that is what they trusted.\" R\u00f6st said that the development of targeted protein quantitation approaches like multiple-reaction monitoring (MRM) has demonstrated that individual peptides can be measured with high accuracy, and the development of DIA mass spec approaches has enabled MRM-style peptide quantitation at the proteome scale. At the same time, improvements in mass spec technology have allowed researchers to collect the kind of large and reproducible datasets required for peptide correlation analysis, he said. \"These are types of experiments we wouldn't have imagined 10 years ago, because for correlation-based approaches to work, you need a relatively large number of samples, and you need low variance,\" he said. \"We are not detecting [proteoforms] that are not changing between different [conditions], we are only detecting those that change. And for this to work we need to have multiple replicates and we need to have different conditions and to be able to measure these peptides with high quantitative accuracy across these conditions.\" The COPF tool looks at the intensities of peptides coming from a particular protein across all the samples measured in an experiment and then calculates peptide correlations for all the pairs of peptides coming from that protein and uses hierarchical clustering to divide the peptides into two clusters. It then scores the likelihood that multiple proteoforms of a protein are present by comparing the level of peptide correlation between the clusters to the level of in-cluster variation. The tool does not identify the specific modifications or variations that distinguish the different proteoforms but rather the peptides that appear to differentiate between the forms of the protein in the different biological contexts investigated. Analyzing a DIA dataset that looked at five different tissue types across eight different mice, COPF identified 63 proteins that exhibited different proteoform groups, including proteins with known tissue-specific splice variants. The researchers also identified proteoforms created by proteolytic and autocatalytic cleavage and phosphorylation, indicating, they wrote, that the tool is \"agnostic to the different mechanisms by which proteoforms can be generated inside the cell.\" The development of COPF follows the publication last year of a study by researchers at Barts Cancer Institute and the University of Wisconsin-Madison detailing another peptide correlation analysis tool for identifying proteoforms in bottom-up data called PeCorA. Unlike COPF, which requires proteoforms to differ by two or more peptides, PeCorA can detect proteoforms based on single peptide differences. This makes it a potentially more sensitive tool but also less specific than COPF, R\u00f6st said. More generally, he said that he expected ongoing improvements in mass spec technology would further improve peptide correlation-based approaches like COPF and PeCorA by boosting peptide coverage. \"To kind of cover every possible protein isoform we would need to have complete coverage of every protein, and unfortunately we are currently quite far away from having peptide-level coverage of every protein,\" he said. \"I think that is currently one of the limitations where we are kind of hitting a wall.\" R\u00f6st added that his lab has begun acquiring data on Bruker's timsTOF Pro platform, \"and there we definitely see both an increase in protein coverage and also in the number of peptides we can measure.\" \"That's why I'm very optimistic that while this is just the first implementation of the method, the data we are producing at this moment is much more complete, and therefore I think it would be even more suitable to our approach than the data we used in the paper,\" he said.","title":"ETH Zurich, U Toronto Team Develops Tool for Bottom-Up Proteomics Proteoform Analysis"},{"location":"misc/","text":"Miscellaneous analysis This section accommodates many largely independent tasks. Implementation might well be generic so that both proteins and peptides are covered. Programs and applications These are summarised in the following table, Program Description coloc.sb Coloc(alisation) analysis csq.sh Consequences of variants Caprion_deCODE_UKB_PPP.sh Caprion/deCODE/UKB-PPP replication eSet.sh ExpresssionSet implementations glmnet_pense.sh glmnet/pense modeling impute.sb imputation experiments json.sh JSON file generation peptideAssociationPlot.sh protein Manhattan-peptide signal plots dup-pgwas.sh pGWAS for duplicated proteins dup-extract.sh pQTL extractions dup-json.sh LocusZoom.js plots dup-plot.sh pQTL plots dup-tbl.R pQTL table pqtlGWAS.R pQTL-GWAS lookup tables.sh Supplementary-Tables.xlsx generator ToDo.sh various staged experiments NB: coloc.sb alternatively calls coloc.R . impute.sb employs impute_parallel() when N(isotope groups) > 500. Nevertheless, when coming to protein requantification this is an option to use the orginal intensity data. Legacy codes compare.sb . earlier contrast with deCODE/UKB-PPP. inf1.sh . snapshot from SCALLOP/INF meta-analysis. Created on 9/12/2024","title":"Miscellaneous analysis"},{"location":"misc/#miscellaneous-analysis","text":"This section accommodates many largely independent tasks. Implementation might well be generic so that both proteins and peptides are covered.","title":"Miscellaneous analysis"},{"location":"misc/#programs-and-applications","text":"These are summarised in the following table, Program Description coloc.sb Coloc(alisation) analysis csq.sh Consequences of variants Caprion_deCODE_UKB_PPP.sh Caprion/deCODE/UKB-PPP replication eSet.sh ExpresssionSet implementations glmnet_pense.sh glmnet/pense modeling impute.sb imputation experiments json.sh JSON file generation peptideAssociationPlot.sh protein Manhattan-peptide signal plots dup-pgwas.sh pGWAS for duplicated proteins dup-extract.sh pQTL extractions dup-json.sh LocusZoom.js plots dup-plot.sh pQTL plots dup-tbl.R pQTL table pqtlGWAS.R pQTL-GWAS lookup tables.sh Supplementary-Tables.xlsx generator ToDo.sh various staged experiments NB: coloc.sb alternatively calls coloc.R . impute.sb employs impute_parallel() when N(isotope groups) > 500. Nevertheless, when coming to protein requantification this is an option to use the orginal intensity data.","title":"Programs and applications"},{"location":"misc/#legacy-codes","text":"compare.sb . earlier contrast with deCODE/UKB-PPP. inf1.sh . snapshot from SCALLOP/INF meta-analysis. Created on 9/12/2024","title":"Legacy codes"},{"location":"peptide_progs/","text":"Peptide analysis CSD3 directory /rds/project/jmmh2/rds-jmmh2-projects/Caprion_proteomics/analysis/ Scripts and results The project directory above contains scripts at peptide_progs/ and results results at peptide/ , respectively. These are also a set of scripts called from bash which invokes SLURM jobs. Script name Description Protein-specific error/output Association analysis 1_pgwas.sh Association analysis {protein}.e / {protein}.o 2_meta_analysis.sh Meta-analysis {protein}-METAL_{SLURM_job_id}_{phenotype_number}.e / {protein}-METAL_{SLURM_job_id}_{phenotype_number}.o Signal identification (see {protein}/sentinels/slurm ) setup.sh Environmental variables 3.1_extract.sh Signal extraction _step1_{SLURM_job_id}_{phenotype_number}.e / _step1_{SLURM_job_id}_{phenotype_number}.o 3.2_collect.sh Signal collection/classification _step2_{protein}.e / _step2_{protein}.o 3.3_plot.sh Forest, Q-Q, Manhattan, LocusZoom, mean-by-genotype/dosage plots _step3_{SLURM_job_id}_{phenotype_number}.e / _step3_{SLURM_job_id}_{phenotype_number}.o utils.sh Various utitlties graph TD; 1_pgwas.sh 2_meta_analysis.sh 1_pgwas.sh --> 2_meta_analysis.sh --> setup.sh setup.sh --> 3.1_extract.sh setup.sh --> 3.2_collect.sh setup.sh --> 3.3_plot.sh subgraph Group1[ ] direction LR 3.1_extract.sh --> 3.2_collect.sh --> 3.3_plot.sh end utils.sh Specfic prerequistes for a Manhattan/peptide association plot are a call to vep_annotate functino in 3.2_collect.sh for proteins. a call to bgz() (in utils.sh for protein) for a indexed and compressed DR-filtered data. for step 3.2, ceuadmin/ensembl-vep/111-icelake now is the default since partition icelake-himem is used instead of cclake (CentOS 7) which has ceuadmin/ensembl-vep/104 . module ceuadmin/R/4.4.1-icelake now works as smoothly as the old ceuadmin/R at cclake Script name Description Protein-specific error/output Experimental codes mz.* file handling & MetaMorpheus, MSAmanda. mzML and results in */metamorpheus, msamonda crux.* search, R/multicomp+crux benchmark crux/ BoxCar.py/pyteomics.py BoxCar algorighm and its use graph TD; mz.* crux.* BoxCar.py/pyteomics.py The module mono-5.10.0.78-gcc-5.4.0-c6cq4hh is required for rawrr , to ${HOME}/.cache/R/rawrr/rawrrassembly (4/8/2024). File Size eula.txt 163 rawrr.exe 28672 ThermoFisher.CommonCore.BackgroundSubtraction.dll 44544 ThermoFisher.CommonCore.Data.dll 406016 ThermoFisher.CommonCore.MassPrecisionEstimator.dll 11264 ThermoFisher.CommonCore.RawFileReader.dll 654336 Finally, ceumadin/FragPipe/22.0 is available as a GUI for experiments on various worflows. Glossary The atomic mass unit (dalton) is equal to the mass of one-twelvth of the mass of a \\(^{12}C\\) atom ( \\(1.660 540 2 \\times 10^{-27}\\) g). References Bittremieux W, Levitsky L, Pilz M, Sachsenberg T, Huber F, Wang M, Dorrestein PC. Unified and standardized mass spectrometry data processing in Python using spectrum_utils. J Proteome Res 22:625\u2013631 (2023), https://doi.org/10.1021/acs.jproteome.2c00632 , https://spectrum-utils.readthedocs.io/en/latest/ . Eidhammer I, Flikka K, Martens L, Mikalsen S-O. Computational Methods for Mass Spectrometry Proteomics. Wiley, 2007. ISBN: 978-0-470-51297-5 1. Peptides are the short stretches of amino acids that are obtained after the proteolytic cleavage of proteins. Peptides are usually around 10\u201315 amino acids long, and a single protein yields approximately 35 peptides on average. 2. The mass (m) of a molecule or atom is expressed in unified atomic mass units (u). 3. Isotopes are (chemical) elements that have the same atomic number (and therefore similar chemical properties), but different molecular mass (slightly different physical properties). 4. Monoisotopic mass is the exact mass of an ion or molecule calculated using the mass of the most abundant isotope of each element. 5. A posttranslational modification (PTM) can be defined as any alteration to the chemical structure of the protein effected by the cellular machinery after the formation of the protein. 6. The raw data spectrum contains signals from the peptides, as well as signals derived from different forms of noise. fragpipe.nesvilab.org, https://fragpipe.nesvilab.org/ Hasam S, Emery K, Noble WS, Keich U. A Pipeline for Peptide Detection Using Multiple Decoys. Methods Mol Biol 2023;2426:25-34, doi: 10.1007/978-1-0716-1967-4_2 . The most common method of peptide and protein False Discovery Rate (FDR) calculation is by adding protein sequences that are not expected to be present in the sample. These are also called decoy protein sequences. This can be done by generating reverse sequences of the target protein entries and appending these protein entries to the protein database. Some search algoritmms use premade target-decoy protein sequences while others can generate a target-decoy protein sequence database from a target protein sequence database before using them for peptide spectral matching. Kertesz-Farkas A, Nii Adoquaye Acquaye FL, Bhimani K, Eng JK, Fondrie WE, Grant C, Hoopmann MR, Lin A, Lu YY, Moritz RL, MacCoss MJ, Noble WS. The Crux Toolkit for Analysis of Bottom-Up Tandem Mass Spectrometry Proteomics Data. J Proteome Res 2023;22(2):561-569, https://doi.org/10.1021/acs.jproteome.2c00615 , https://crux.ms . Lazear MR. Sage: An Open-Source Tool for Fast Proteomics Searching and Quantification at Scale. J Proteome Res 2023 22 (11), 3652-3659, DOI: 10.1021/acs.jproteome.3c00486 . Levitsky LI, Klein JA, Ivanov MV, Gorshkov MV. Pyteomics 4.0: Five Years of Development of a Python Proteomics Framework. J Proteome Res. 2019;18(2):709-714. doi: 10.1021/acs.jproteome.8b00717 , https://github.com/levitsky/pyteomics . ms-utils.org, https://ms-utils.org/ . Rehfeldt TG, Gabriels R, Bouwmeester R, Gessulat S, Neely BA, Palmblad M, Perez-Riverol Y, Schmidt T, Vizca\u00edno JA, Deutsch EW. ProteomicsML: An Online Platform for Community-Curated Data sets and Tutorials for Machine Learning in Proteomics. J Proteome Res , 2023;22(2):632-636, https://doi.org/10.1021/acs.jproteome.2c00629 , https://proteomicsml.org/ . Sturm M, Bertsch A, Gr\u00f6pl C, Hildebrandt A, Hussong R, Lange E, Pfeifer N, Schulz-Trieglaff O, Zerck A, Reinert K, Kohlbacher O. OpenMS - an open-source software framework for mass spectrometry. BMC Bioinformatics . 2008;9:163. doi: 10.1186/1471-2105-9-163 .","title":"Peptide analysis"},{"location":"peptide_progs/#peptide-analysis","text":"","title":"Peptide analysis"},{"location":"peptide_progs/#csd3-directory","text":"/rds/project/jmmh2/rds-jmmh2-projects/Caprion_proteomics/analysis/","title":"CSD3 directory"},{"location":"peptide_progs/#scripts-and-results","text":"The project directory above contains scripts at peptide_progs/ and results results at peptide/ , respectively. These are also a set of scripts called from bash which invokes SLURM jobs. Script name Description Protein-specific error/output Association analysis 1_pgwas.sh Association analysis {protein}.e / {protein}.o 2_meta_analysis.sh Meta-analysis {protein}-METAL_{SLURM_job_id}_{phenotype_number}.e / {protein}-METAL_{SLURM_job_id}_{phenotype_number}.o Signal identification (see {protein}/sentinels/slurm ) setup.sh Environmental variables 3.1_extract.sh Signal extraction _step1_{SLURM_job_id}_{phenotype_number}.e / _step1_{SLURM_job_id}_{phenotype_number}.o 3.2_collect.sh Signal collection/classification _step2_{protein}.e / _step2_{protein}.o 3.3_plot.sh Forest, Q-Q, Manhattan, LocusZoom, mean-by-genotype/dosage plots _step3_{SLURM_job_id}_{phenotype_number}.e / _step3_{SLURM_job_id}_{phenotype_number}.o utils.sh Various utitlties graph TD; 1_pgwas.sh 2_meta_analysis.sh 1_pgwas.sh --> 2_meta_analysis.sh --> setup.sh setup.sh --> 3.1_extract.sh setup.sh --> 3.2_collect.sh setup.sh --> 3.3_plot.sh subgraph Group1[ ] direction LR 3.1_extract.sh --> 3.2_collect.sh --> 3.3_plot.sh end utils.sh Specfic prerequistes for a Manhattan/peptide association plot are a call to vep_annotate functino in 3.2_collect.sh for proteins. a call to bgz() (in utils.sh for protein) for a indexed and compressed DR-filtered data. for step 3.2, ceuadmin/ensembl-vep/111-icelake now is the default since partition icelake-himem is used instead of cclake (CentOS 7) which has ceuadmin/ensembl-vep/104 . module ceuadmin/R/4.4.1-icelake now works as smoothly as the old ceuadmin/R at cclake Script name Description Protein-specific error/output Experimental codes mz.* file handling & MetaMorpheus, MSAmanda. mzML and results in */metamorpheus, msamonda crux.* search, R/multicomp+crux benchmark crux/ BoxCar.py/pyteomics.py BoxCar algorighm and its use graph TD; mz.* crux.* BoxCar.py/pyteomics.py The module mono-5.10.0.78-gcc-5.4.0-c6cq4hh is required for rawrr , to ${HOME}/.cache/R/rawrr/rawrrassembly (4/8/2024). File Size eula.txt 163 rawrr.exe 28672 ThermoFisher.CommonCore.BackgroundSubtraction.dll 44544 ThermoFisher.CommonCore.Data.dll 406016 ThermoFisher.CommonCore.MassPrecisionEstimator.dll 11264 ThermoFisher.CommonCore.RawFileReader.dll 654336 Finally, ceumadin/FragPipe/22.0 is available as a GUI for experiments on various worflows.","title":"Scripts and results"},{"location":"peptide_progs/#glossary","text":"The atomic mass unit (dalton) is equal to the mass of one-twelvth of the mass of a \\(^{12}C\\) atom ( \\(1.660 540 2 \\times 10^{-27}\\) g).","title":"Glossary"},{"location":"peptide_progs/#references","text":"Bittremieux W, Levitsky L, Pilz M, Sachsenberg T, Huber F, Wang M, Dorrestein PC. Unified and standardized mass spectrometry data processing in Python using spectrum_utils. J Proteome Res 22:625\u2013631 (2023), https://doi.org/10.1021/acs.jproteome.2c00632 , https://spectrum-utils.readthedocs.io/en/latest/ . Eidhammer I, Flikka K, Martens L, Mikalsen S-O. Computational Methods for Mass Spectrometry Proteomics. Wiley, 2007. ISBN: 978-0-470-51297-5 1. Peptides are the short stretches of amino acids that are obtained after the proteolytic cleavage of proteins. Peptides are usually around 10\u201315 amino acids long, and a single protein yields approximately 35 peptides on average. 2. The mass (m) of a molecule or atom is expressed in unified atomic mass units (u). 3. Isotopes are (chemical) elements that have the same atomic number (and therefore similar chemical properties), but different molecular mass (slightly different physical properties). 4. Monoisotopic mass is the exact mass of an ion or molecule calculated using the mass of the most abundant isotope of each element. 5. A posttranslational modification (PTM) can be defined as any alteration to the chemical structure of the protein effected by the cellular machinery after the formation of the protein. 6. The raw data spectrum contains signals from the peptides, as well as signals derived from different forms of noise. fragpipe.nesvilab.org, https://fragpipe.nesvilab.org/ Hasam S, Emery K, Noble WS, Keich U. A Pipeline for Peptide Detection Using Multiple Decoys. Methods Mol Biol 2023;2426:25-34, doi: 10.1007/978-1-0716-1967-4_2 . The most common method of peptide and protein False Discovery Rate (FDR) calculation is by adding protein sequences that are not expected to be present in the sample. These are also called decoy protein sequences. This can be done by generating reverse sequences of the target protein entries and appending these protein entries to the protein database. Some search algoritmms use premade target-decoy protein sequences while others can generate a target-decoy protein sequence database from a target protein sequence database before using them for peptide spectral matching. Kertesz-Farkas A, Nii Adoquaye Acquaye FL, Bhimani K, Eng JK, Fondrie WE, Grant C, Hoopmann MR, Lin A, Lu YY, Moritz RL, MacCoss MJ, Noble WS. The Crux Toolkit for Analysis of Bottom-Up Tandem Mass Spectrometry Proteomics Data. J Proteome Res 2023;22(2):561-569, https://doi.org/10.1021/acs.jproteome.2c00615 , https://crux.ms . Lazear MR. Sage: An Open-Source Tool for Fast Proteomics Searching and Quantification at Scale. J Proteome Res 2023 22 (11), 3652-3659, DOI: 10.1021/acs.jproteome.3c00486 . Levitsky LI, Klein JA, Ivanov MV, Gorshkov MV. Pyteomics 4.0: Five Years of Development of a Python Proteomics Framework. J Proteome Res. 2019;18(2):709-714. doi: 10.1021/acs.jproteome.8b00717 , https://github.com/levitsky/pyteomics . ms-utils.org, https://ms-utils.org/ . Rehfeldt TG, Gabriels R, Bouwmeester R, Gessulat S, Neely BA, Palmblad M, Perez-Riverol Y, Schmidt T, Vizca\u00edno JA, Deutsch EW. ProteomicsML: An Online Platform for Community-Curated Data sets and Tutorials for Machine Learning in Proteomics. J Proteome Res , 2023;22(2):632-636, https://doi.org/10.1021/acs.jproteome.2c00629 , https://proteomicsml.org/ . Sturm M, Bertsch A, Gr\u00f6pl C, Hildebrandt A, Hussong R, Lange E, Pfeifer N, Schulz-Trieglaff O, Zerck A, Reinert K, Kohlbacher O. OpenMS - an open-source software framework for mass spectrometry. BMC Bioinformatics . 2008;9:163. doi: 10.1186/1471-2105-9-163 .","title":"References"},{"location":"pilot/","text":"Pilot studies Site map Pilot (N=196) data/ contains genotype files in .bgen format bgen/ PLINK2 results according to .bgen files; summary outputs and sentinels/ directory are in the following directories 1e-5 1e-6 5e-8 Batch 2 (N=1,488) data2/ contains genotype files in .bgen format bgen2/ PLINK2 results according to .bgen files; summary outputs and sentinels/ directory are in the following directories 1e-5 5e-8 Comparison of pilot and batch 2 miamiplot Batch 3 data (N=807) data3/ .bgen data bgen3/ PLINK2 results 1e-5 5e-8 Coding There are apparent commonalities between batches from the list of programs and diagrams; many of which are activated as subroutines. Pilot caprion.R and caprion.ini are for data processing. Their derivatives are in the utils/ subdirectory: affymetrix.sh is for variant-specific association analysis. qctool.sb is used to extract available sample and genotypes. qctool.sh further extracts genotypes with MAF 0.01 only. plink2.sh non-SLURM version of association analysis. qqman.sh and qqman.R produce QQ and Manhattan plots. sentinels_nold.sh and merge.sh select sentinels. ps.sh and ps.R run through PhenoScanner. lookup.sh looks up for overlap with SomaLogic and Olink. caprion.ipynb is a Jupyter notebook with some preprocessing done by tensorqtl.sh . Batch 2 (prefix=utils/ when unspecified) graph TB tensoqtl.sh 2020.sh --> EPCR-PROC/ 2020.sh --> data2/affymetrix.id qctool.sb --> qctool.sh qctool.sh --> plink2.sh plink2.sh --> sentinels_nold.sh sentinels_nold.sh --> merge.sh Batch 3 (prefix=utils/) graph TB 2021.sh 2021.sh --> eSet.R 2021.sh --> 2021.R eSet.R --> 2021.R eSet.R --> UDP.R 2021.sh --> UDP.R UDP.R --> qctool.sb qctool.sb --> qctool.sh qctool.sh --> plink2.* 2021.sh --> plink2.* plink2.* --> sentinels_nold.sh+merge.sh Note that eSet.R actually covers data from pilot, batches 2 and 3. Documents ppr.md EPCR-PROC.md 2021.md Reference Klaus B, Reisenauer S (2018). An end to end workflow for differential gene expression using Affymetrix microarrays . https://bioinformatics.psb.ugent.be/webtools/Venn/","title":"Pilot studies"},{"location":"pilot/#pilot-studies","text":"","title":"Pilot studies"},{"location":"pilot/#site-map","text":"Pilot (N=196) data/ contains genotype files in .bgen format bgen/ PLINK2 results according to .bgen files; summary outputs and sentinels/ directory are in the following directories 1e-5 1e-6 5e-8 Batch 2 (N=1,488) data2/ contains genotype files in .bgen format bgen2/ PLINK2 results according to .bgen files; summary outputs and sentinels/ directory are in the following directories 1e-5 5e-8 Comparison of pilot and batch 2 miamiplot Batch 3 data (N=807) data3/ .bgen data bgen3/ PLINK2 results 1e-5 5e-8","title":"Site map"},{"location":"pilot/#coding","text":"There are apparent commonalities between batches from the list of programs and diagrams; many of which are activated as subroutines. Pilot caprion.R and caprion.ini are for data processing. Their derivatives are in the utils/ subdirectory: affymetrix.sh is for variant-specific association analysis. qctool.sb is used to extract available sample and genotypes. qctool.sh further extracts genotypes with MAF 0.01 only. plink2.sh non-SLURM version of association analysis. qqman.sh and qqman.R produce QQ and Manhattan plots. sentinels_nold.sh and merge.sh select sentinels. ps.sh and ps.R run through PhenoScanner. lookup.sh looks up for overlap with SomaLogic and Olink. caprion.ipynb is a Jupyter notebook with some preprocessing done by tensorqtl.sh . Batch 2 (prefix=utils/ when unspecified) graph TB tensoqtl.sh 2020.sh --> EPCR-PROC/ 2020.sh --> data2/affymetrix.id qctool.sb --> qctool.sh qctool.sh --> plink2.sh plink2.sh --> sentinels_nold.sh sentinels_nold.sh --> merge.sh Batch 3 (prefix=utils/) graph TB 2021.sh 2021.sh --> eSet.R 2021.sh --> 2021.R eSet.R --> 2021.R eSet.R --> UDP.R 2021.sh --> UDP.R UDP.R --> qctool.sb qctool.sb --> qctool.sh qctool.sh --> plink2.* 2021.sh --> plink2.* plink2.* --> sentinels_nold.sh+merge.sh Note that eSet.R actually covers data from pilot, batches 2 and 3.","title":"Coding"},{"location":"pilot/#documents","text":"ppr.md EPCR-PROC.md 2021.md","title":"Documents"},{"location":"pilot/#reference","text":"Klaus B, Reisenauer S (2018). An end to end workflow for differential gene expression using Affymetrix microarrays . https://bioinformatics.psb.ugent.be/webtools/Venn/","title":"Reference"},{"location":"pilot/autoencoder/","text":"Autoencoder As shown at R-bloggers , autoencoder is better at reconstructing the original data set than PCA when k is small, where k corresponds to the number of principal components in PCA or bottleneck dimension in AE, however the error converges as k increases. For very large data sets this difference will be larger and means a smaller data set could be used for the same error as PCA. When dealing with big data this is an important property`. The local adoption is ae_test.Rmd which produces ae_test.html and ae_test.pdf . Additional work will be on variatinoal autoencoder (VAE) and denoising counterpart as indicated in the references below. REFERENCES Bishop CM, Bishop H (2024), Deep learning: foundations and concepts, Springer International Publishing, DOI: 10.1007/978-3-031-45468-4 . Bludau I, Frank M, D\u00f6rig C. et al. Systematic detection of functional proteoform groups from bottom-up proteomic datasets. Nat Commun 12, 3810 (2021). https://doi.org/10.1038/s41467-021-24030-x . Hofert M, Prasad A, Zhu M (2019). Quasi-Monte Carlo for multivariate distributions viagenerative neural networks. https://arxiv.org/abs/1811.00683 , https://CRAN.R-project.org/package=gnn . Kingma DP, Welling M (2014). Auto-Encoding Variational Bayes. https://arxiv.org/abs/1312.6114 , https://keras.rstudio.com/articles/examples/variational_autoencoder.html . Ng A. Sparse autoencoder, CS294A Lecture notes, https://web.stanford.edu/class/archive/cs/cs294a/cs294a.1104/sparseAutoencoder.pdf . Sattarov T, Herurkar D, Hees J (2023). Explaining Anomalies using Denoising Autoencoders for Financial Tabular Data. Technical Report 2023-01. Deutsche Bundesban. Trivadis SK (2017). Variational autoencoders for anomaly detection. https://rpubs.com/zkajdan/308801 . URLs https://github.com/diazale/gt-dimred , https://github.com/lmcinnes/umap ( https://umap-learn.readthedocs.io/en/latest/ ) and https://keras.io/examples/timeseries/timeseries_anomaly_detection/ , https://www.mathworks.com/help/deeplearning/ug/anomaly-detection-using-autoencoder-and-wavelets.html , among others.","title":"Autoencoder"},{"location":"pilot/autoencoder/#autoencoder","text":"As shown at R-bloggers , autoencoder is better at reconstructing the original data set than PCA when k is small, where k corresponds to the number of principal components in PCA or bottleneck dimension in AE, however the error converges as k increases. For very large data sets this difference will be larger and means a smaller data set could be used for the same error as PCA. When dealing with big data this is an important property`. The local adoption is ae_test.Rmd which produces ae_test.html and ae_test.pdf . Additional work will be on variatinoal autoencoder (VAE) and denoising counterpart as indicated in the references below.","title":"Autoencoder"},{"location":"pilot/autoencoder/#references","text":"Bishop CM, Bishop H (2024), Deep learning: foundations and concepts, Springer International Publishing, DOI: 10.1007/978-3-031-45468-4 . Bludau I, Frank M, D\u00f6rig C. et al. Systematic detection of functional proteoform groups from bottom-up proteomic datasets. Nat Commun 12, 3810 (2021). https://doi.org/10.1038/s41467-021-24030-x . Hofert M, Prasad A, Zhu M (2019). Quasi-Monte Carlo for multivariate distributions viagenerative neural networks. https://arxiv.org/abs/1811.00683 , https://CRAN.R-project.org/package=gnn . Kingma DP, Welling M (2014). Auto-Encoding Variational Bayes. https://arxiv.org/abs/1312.6114 , https://keras.rstudio.com/articles/examples/variational_autoencoder.html . Ng A. Sparse autoencoder, CS294A Lecture notes, https://web.stanford.edu/class/archive/cs/cs294a/cs294a.1104/sparseAutoencoder.pdf . Sattarov T, Herurkar D, Hees J (2023). Explaining Anomalies using Denoising Autoencoders for Financial Tabular Data. Technical Report 2023-01. Deutsche Bundesban. Trivadis SK (2017). Variational autoencoders for anomaly detection. https://rpubs.com/zkajdan/308801 .","title":"REFERENCES"},{"location":"pilot/autoencoder/#urls","text":"https://github.com/diazale/gt-dimred , https://github.com/lmcinnes/umap ( https://umap-learn.readthedocs.io/en/latest/ ) and https://keras.io/examples/timeseries/timeseries_anomaly_detection/ , https://www.mathworks.com/help/deeplearning/ug/anomaly-detection-using-autoencoder-and-wavelets.html , among others.","title":"URLs"},{"location":"pilot/gwas2/","text":"gwas2 This is a promising alternative showing through RCN3/FCGRN with gwas2.sh . graph TB; gwas2.sh --> gwas.do gwas2.sh --> gwas2.do where gwas.do ( caprion.dat also contains _invn data) and gwas2.do ( gwas2_invn.do for _invn data) are for the pilot and batch 2 data, respectively. See gwas2 repository for additional information.","title":"gwas2"},{"location":"pilot/gwas2/#gwas2","text":"This is a promising alternative showing through RCN3/FCGRN with gwas2.sh . graph TB; gwas2.sh --> gwas.do gwas2.sh --> gwas2.do where gwas.do ( caprion.dat also contains _invn data) and gwas2.do ( gwas2_invn.do for _invn data) are for the pilot and batch 2 data, respectively. See gwas2 repository for additional information.","title":"gwas2"},{"location":"progs/","text":"Protein analysis Programs 1 Work was done in a named sequence 2 . 1_pca_projection.sh 2_ggm.R 3_wgcna.R 4_pca_clustering.R 5_pgwas.sh 6_meta_analysis.sh 7_merge.sh 8_hla.sh graph TB 1_pca_projection.sh 2_ggm.R 3_wgcna.R 4_pca_clustering.R --> 5_pgwas.sb --> 6_meta_analysis.sh --> 6_meta_analysis.sb --> 7_merge.sb --> 7_merge.sh --> 0_utils.sb --> 5_pgwas.sb 8_hla.sh Chromose X is handled together with autosomes, and the loop from 0_utils.sb to 5_pgwas.sb is to produce mean-by-genotype/QQ/Manhattan/LocusZoom plots -- the former also implements vep_annotate(), fp_data(), fp() which only requires --array=1 . Note also that HetISq() only works inside an interactive R session. 1. Data handling and PCA projection The pipeline follows HGI contributions nevertheless only serves for reassurance since the study samples were carefully selected. 2. GGM The results are ready to report. 3. WGCNA This can be finalised according to the Science paper. 4. PCA and clustering The groupings based on unfiltered and DR-filtered proteins can be made on three phases altogether and instead of a classification indicator the first three PCs are used. The PLINK2 has been used in the pilot studies, but now fastGWA using double transformations of the phenotypic data similar to SCALLOP-Seq analysis. Amazingly, a standard assignment statement inside sapply() would produce .pheno / .mpheno containing the raw data. The file also includes experiments on normalisation. 5. pGWAS 3 The bgen files were extracted from a list of all samples, the variant IDs of which were for all RSids to allow for multiallelic loci. The (sb)atch file is extended to produce Q-Q/Manhattan/LocusZoom plots and extreme p values are possible for all plots. Note that LocusZoom 1.4 does not contain 1000Genomes build 37 genotypes for chromosome X and therefore they are supplemented with local files in the required format, namely, locuszoom_1.4/data/1000G/genotypes/2014-10-14/EUR/chrX.[bed, bim, fan] . Now that for the Manhattan plot call for VEP is necessary from 0_utils.sb , which also produces mean by genotype/dosage plots. 6. Meta-analysis Internally, this follows from the SCALLOP/INF implementation, as designed analogous to a Makefile, i.e., 6_meta_analysis <task> where task = METAL_list, METAL_files, METAL_analysis, respectively in sequence. However, due to time limit on HPC, a call to .sb is made for meta-analysis. To extract significant variants one may resort to awk 'NR==1||$12<log(1e-6)/log(10)' 1433B-1.tbl , say. 7. Variant identification An iterative merging scheme is employed; the HLA region is simplified but will be specifically handled. Somewhat paradoxically, forest plots are also obtained here 4 . A SLURM job is executed, to be followed by collection of results. 8. HLA imputation 5 This is experimented on several software including HIBAG, CookHLA and SNP2HLA as desribed here . The whole cohort imputation requests resources exceeding the system limits, so a cardio SLURM job is used instead. The hped file from CookHLA (or converted from HIBAG) can be used by HATK for association analysis while the advantage of SNP2HLA is that binary ped files are ready for use as usual. Directories This is per Caprion project module load miniconda3/4.5.1 export csd3path=/rds/project/jmmh2/rds-jmmh2-projects/olink_proteomics/scallop/miniconda37 source ${csd3path}/bin/activate Name Description pgwas pGWAS METAL Meta-analysis HLA HLA imputation peptide_progs peptide analysis reports Reports Note that docs.sh copies pilot/utils directory of the pilot studies, so coding under that directory is preferable to avoid overwrite. To accommodate filteredd results, a suffix \"\" or \"_dr\" is applied when appropriate. \u21a9 workflow (experimental) module add ceuadmin/snakemake snakemake -s workflow/rules/cojo.smk -j1 snakemake -s workflow/rules/report.smk -j1 snakemake -s workflow/rules/cojo.smk -c --profile workflow and use --unlock when necessary. \u21a9 Protein GWAS GCTA/fastGWA employs MAF>=0.001 (~56%) and geno=0.1 so potentially we can have .bgen files as such to speed up. GCTA uses headerless phenotype files, generated by 5_pgwas.sh which is now unnecessary. \u21a9 Incomplete gamma function The .info files for proteins BROX and CT027 could not be obtained from METAL 2020-05-05 with the following error message, FATAL ERROR - a too large, ITMAX too small in gamma countinued fraction (gcf) An attempt was made to fix this and reported as a fixable issue to METAL GitHub respository ( https://github.com/statgen/METAL/issues/24 ). This has enabled Forest plots for the associate pQTLs. \u21a9 HLA A database of 3D structures of Major Histocompatibility Complex, https://www.histo.fyi/ Whole cohort imputation is feasible with a HIBAG reference panel, Locus A B C DPB1 DQA1 DQB1 DRB1 N 1857 2572 1866 1624 1740 1924 2436 SNPs 891 990 1041 689 948 979 891 while the reference panel is based on the 1000Genomes data (N=503) with SNP2HLA and CookHLA. It is of note that 1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC.markers in the 1000Genomes reference panel has 465 variants with HLA prefix and the partition is as follows, Locus A B C DPB1 DQA1 DQB1 DRB1 HLA_ 98 183 69 0 0 33 82 A recent update: PGG.HLA, https://pog.fudan.edu.cn/pggmhc/ , requires data submission. \u21a9","title":"Protein analysis"},{"location":"progs/#protein-analysis","text":"","title":"Protein analysis"},{"location":"progs/#programs1","text":"Work was done in a named sequence 2 . 1_pca_projection.sh 2_ggm.R 3_wgcna.R 4_pca_clustering.R 5_pgwas.sh 6_meta_analysis.sh 7_merge.sh 8_hla.sh graph TB 1_pca_projection.sh 2_ggm.R 3_wgcna.R 4_pca_clustering.R --> 5_pgwas.sb --> 6_meta_analysis.sh --> 6_meta_analysis.sb --> 7_merge.sb --> 7_merge.sh --> 0_utils.sb --> 5_pgwas.sb 8_hla.sh Chromose X is handled together with autosomes, and the loop from 0_utils.sb to 5_pgwas.sb is to produce mean-by-genotype/QQ/Manhattan/LocusZoom plots -- the former also implements vep_annotate(), fp_data(), fp() which only requires --array=1 . Note also that HetISq() only works inside an interactive R session.","title":"Programs1"},{"location":"progs/#1-data-handling-and-pca-projection","text":"The pipeline follows HGI contributions nevertheless only serves for reassurance since the study samples were carefully selected.","title":"1. Data handling and PCA projection"},{"location":"progs/#2-ggm","text":"The results are ready to report.","title":"2. GGM"},{"location":"progs/#3-wgcna","text":"This can be finalised according to the Science paper.","title":"3. WGCNA"},{"location":"progs/#4-pca-and-clustering","text":"The groupings based on unfiltered and DR-filtered proteins can be made on three phases altogether and instead of a classification indicator the first three PCs are used. The PLINK2 has been used in the pilot studies, but now fastGWA using double transformations of the phenotypic data similar to SCALLOP-Seq analysis. Amazingly, a standard assignment statement inside sapply() would produce .pheno / .mpheno containing the raw data. The file also includes experiments on normalisation.","title":"4. PCA and clustering"},{"location":"progs/#5-pgwas3","text":"The bgen files were extracted from a list of all samples, the variant IDs of which were for all RSids to allow for multiallelic loci. The (sb)atch file is extended to produce Q-Q/Manhattan/LocusZoom plots and extreme p values are possible for all plots. Note that LocusZoom 1.4 does not contain 1000Genomes build 37 genotypes for chromosome X and therefore they are supplemented with local files in the required format, namely, locuszoom_1.4/data/1000G/genotypes/2014-10-14/EUR/chrX.[bed, bim, fan] . Now that for the Manhattan plot call for VEP is necessary from 0_utils.sb , which also produces mean by genotype/dosage plots.","title":"5. pGWAS3"},{"location":"progs/#6-meta-analysis","text":"Internally, this follows from the SCALLOP/INF implementation, as designed analogous to a Makefile, i.e., 6_meta_analysis <task> where task = METAL_list, METAL_files, METAL_analysis, respectively in sequence. However, due to time limit on HPC, a call to .sb is made for meta-analysis. To extract significant variants one may resort to awk 'NR==1||$12<log(1e-6)/log(10)' 1433B-1.tbl , say.","title":"6. Meta-analysis"},{"location":"progs/#7-variant-identification","text":"An iterative merging scheme is employed; the HLA region is simplified but will be specifically handled. Somewhat paradoxically, forest plots are also obtained here 4 . A SLURM job is executed, to be followed by collection of results.","title":"7. Variant identification"},{"location":"progs/#8-hla-imputation5","text":"This is experimented on several software including HIBAG, CookHLA and SNP2HLA as desribed here . The whole cohort imputation requests resources exceeding the system limits, so a cardio SLURM job is used instead. The hped file from CookHLA (or converted from HIBAG) can be used by HATK for association analysis while the advantage of SNP2HLA is that binary ped files are ready for use as usual. Directories This is per Caprion project module load miniconda3/4.5.1 export csd3path=/rds/project/jmmh2/rds-jmmh2-projects/olink_proteomics/scallop/miniconda37 source ${csd3path}/bin/activate Name Description pgwas pGWAS METAL Meta-analysis HLA HLA imputation peptide_progs peptide analysis reports Reports Note that docs.sh copies pilot/utils directory of the pilot studies, so coding under that directory is preferable to avoid overwrite. To accommodate filteredd results, a suffix \"\" or \"_dr\" is applied when appropriate. \u21a9 workflow (experimental) module add ceuadmin/snakemake snakemake -s workflow/rules/cojo.smk -j1 snakemake -s workflow/rules/report.smk -j1 snakemake -s workflow/rules/cojo.smk -c --profile workflow and use --unlock when necessary. \u21a9 Protein GWAS GCTA/fastGWA employs MAF>=0.001 (~56%) and geno=0.1 so potentially we can have .bgen files as such to speed up. GCTA uses headerless phenotype files, generated by 5_pgwas.sh which is now unnecessary. \u21a9 Incomplete gamma function The .info files for proteins BROX and CT027 could not be obtained from METAL 2020-05-05 with the following error message, FATAL ERROR - a too large, ITMAX too small in gamma countinued fraction (gcf) An attempt was made to fix this and reported as a fixable issue to METAL GitHub respository ( https://github.com/statgen/METAL/issues/24 ). This has enabled Forest plots for the associate pQTLs. \u21a9 HLA A database of 3D structures of Major Histocompatibility Complex, https://www.histo.fyi/ Whole cohort imputation is feasible with a HIBAG reference panel, Locus A B C DPB1 DQA1 DQB1 DRB1 N 1857 2572 1866 1624 1740 1924 2436 SNPs 891 990 1041 689 948 979 891 while the reference panel is based on the 1000Genomes data (N=503) with SNP2HLA and CookHLA. It is of note that 1000G_REF.EUR.chr6.hg18.29mb-34mb.inT1DGC.markers in the 1000Genomes reference panel has 465 variants with HLA prefix and the partition is as follows, Locus A B C DPB1 DQA1 DQB1 DRB1 HLA_ 98 183 69 0 0 33 82 A recent update: PGG.HLA, https://pog.fudan.edu.cn/pggmhc/ , requires data submission. \u21a9","title":"8. HLA imputation5"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index 140d02c..6164ab4 100644
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ