Skip to content

Latest commit

 

History

History
121 lines (86 loc) · 10.1 KB

output.md

File metadata and controls

121 lines (86 loc) · 10.1 KB

Output from the workflow

QC results

Depth of coverage for both tumor and normal can be found in the mosdepth_normal_summary and mosdepth_tumor_summary folders. The mosdepth folder contains the depth of coverage for each chromosome and overall coverage (total at the end of the file).

overall_(tumor\|normal)_alignment_stats contains the alignment statistics for both tumor and normal samples. The statistics are produced by using the seqkit bam command. Per-alignment statistics are also generated in the per_alignment_(tumor\|normal)_stats folders.

Structural variants

The pipeline incorporates the structural variants caller Severus with default parameters. The results from Severus are also filtered with a set of germline structural variants VCF (from the Human Pangenome Reference Consortium) to remove false positives. The filtering was performed using svpack. svpack also provides simple annotation based on Ensembl GFF (v101), and the filtered VCF can be found in the folder Severus_filtered_vcf.

AnnotSV is used to further annotate the structural variants into a TSV file, (AnnotatedSeverusSV). The TSV file format is described in AnnotSV README. This provides more detailed annotation than svpack.

Lastly, to help with prioritizing variants relevant to cancer, the workflow also annotates the SVs with IntOGen Compendium of Cancer Genes (CCG) and produces a final set of structural variants in the Annotated*SV_intogen folder.

SNV/INDEL (Somatic and germline)

Somatic SNV/INDEL VCF can be found in the folder small_variant_vcf. We use DeepSomatic (trained on Revio dataset) or ClairS (default due to it being faster and trained on both Sequel II and Revio) to call both somatic SNV/INDELs. (See input JSON parameters for parameter to switch between them or to switch to Sequel IIe model for ClairS).

DeepSomatic is currently computationally expensive (14-18 hours for 60X/30X tumor/normal) and requires using a separate caller for germline SNV/INDELs. The workflow implements Clair3 to call germline variants in both tumor and normal in addition to somatic variants. In addition, as DeepSomatic does not currently output the depth of coverage of the variants in normal, the VCF cannot be used as an input for Purple for purity and ploidy estimation. If DeepSomatic is used, Purple will run without somatic VCF which may affect estimation in some cases.

For both ClairS and DeepSomatic, The workflow can split the human genome into chunks (default 75 Mbp per chunk) and calls SNV/INDELs in parallel, then gathers the output into a single VCF. This allows the caller to scale to large genomes and large datasets by making use of multiple HPC nodes. Germline variants are called with Clair3 regardless of the somatic variant caller used.

For annotation, we use Ensembl VEP to annotate the VCF file (small_variant_tsv_annotated and small_variant_vcf_annotated folder). As with SV, the workflow also annotates the SNV/INDELs with IntOGen Compendium of Cancer Genes (CCG) and produce a final set of SNV/INDELs in the small_variant_tsv_CCG folder.

Homologous-recombination deficiency prediction

SNV, INDELs and SVs are supplied to CHORD for HRD prediction. The results can be found in the chord_hrd_prediction folder. We've tested internally with two datasets that have HRD and found that the results were accurate. However, the results have not been validated extensively with more samples and should be used with caution, especially at low tumor purity and coverage. E.g. at effective tumor coverage of 15X (30X tumor coverage with 50% tumor purity or 60X with 25% tumor purity), CHORD predicted HCC1395 to be HR-deficient.

Differentially methylated region

CpG calls, at each loci in the human genome, are summarized using pb-CpG-tools. The bed file for the CpG calls are then used to call DMRs using DSS. The pileup bed file is found in the pileup_(normal\/tumor)_bed folder.

DMR_annotated contains 5 folders for each patient. Each folder represents differentially methylated region annotated differently. Annotation is done using annotatr.

├── 0
│   └── COLO829_hg38_genes_1to5kb_dmrs_intogen.nCG50_summary.tsv.gz -> DMRs that are 1 to 5 kbp away from TSS of genes
├── 1
│   └── COLO829_hg38_genes_3UTRs_dmrs_intogen.nCG50_summary.tsv.gz ->  DMRs that are 3'UTR of genes
├── 2
│   └── COLO829_hg38_genes_5UTRs_dmrs_intogen.nCG50_summary.tsv.gz ->  DMRs that are 5'UTR of genes
├── 3
│   └── COLO829_hg38_genes_exons_dmrs_intogen.nCG50_summary.tsv.gz -> DMRs that are exons of genes
├── 4
│   └── COLO829_hg38_genes_introns_dmrs_intogen.nCG50_summary.tsv.gz -> DMRs that are introns of genes
└── 5
    └── COLO829_hg38_genes_promoters_dmrs_intogen.nCG50_summary.tsv.gz -> DMRs that are (known) promoters of genes

For each TSV file, the first 11 columns (areaStat being the last) are identical and are produced by DSS.

  • meanMethyl1 is the mean methylation level in tumor.
  • meanMethyl2 is the mean methylation level in normal.
  • length is the length of the DMR.
  • nCG is the number of CpG sites in the DMR. By default the workflow requires at least 50 CpG sites in any DMR region.
  • areaStat is the area statistic of the DMR. The larger the area statistic, the more significant the DMR is.

annot.X columns are produced by annotatr and all upper-case columns are extracted from the IntOGen Compendium of Cancer Genes TSV file.

Somatic SNV/INDEL annotation

small_variant_vcf_annotated contains somatic variants annotated using Ensembl VEP. To obtain readable annotations, try using the split-vep tool from bcftools. The workflow by default produces a TSV using split-vep in the small_variant_tsv folder.

Prioritization

The workflow implements simple prioritization based on the IntOGen Compendium of Cancer Genes (CCG). The prioritization is done by subsetting the DMR regions involving any gene in CCG with >50 CpG sites. For SV/SNV/INDELs, similar prioritization is done by subsetting the variants involving any gene in CCG.

Mutational signatures

Mutational signatures are determined using the MutationalPattern R package. There are two folders in the output:

  • mutsig_SNV contains TSV files of the mutational signatures fitted to known COSMIC signatures. The TSVs are:
    • mut_sigs.tsv contains the contribution of the fitted signatures. The signature is fitted using the fit_to_signatures_strict function.
    • reconstructed_sigs.tsv contains the reconstructed trinucleotide context based on the fitted signatures.
    • type_occurences.tsv contains the frequency of each mutation type in the sample.
    • mut_sigs_bootstrapped.tsv contains the bootstrap results of the fitted signatures using the fit_to_signatures_bootstrapped function. The bootstrap results can be used to determine how stable the fitted signatures are.

Purity and ploidy estimation

The workflow currently implements the HMFtools suite to estimate purity and ploidy based on Amber, Cobalt and Purple. For Cobalt, due to noises in long-reads based read-depth and B-allele frequency, we set PCF gamma to 1000 to allow for better segmentation. See GitHub issue for discussion. Practically, this means that the CNV calls may miss smaller focal events, but those should be picked up by the SV caller.

The purity and ploidy estimates were found to be reasonably robust in our experience with HCC1395 and COLO829, but should be used with caution. Purity and ploidy estimates can be found in the *.purity.tsv file in Purple_outputs folder.

Copy number variation

CNVKit is used to segment copy numbers from the matched tumor/normal BAM files. To optimize for long-reads, we set bin size to 10 kbp and found it to be optimal based on COLO829. The workflow also uses purity and ploidy estimates from HMFtools in combination with ClairS heterozygous SNVs to estimate allele-specific major and minor copy numbers in the cnvkit_cns_with_major_minor_CN folder.

A downside of CNVKit is that the recall mode requires integer ploidy, which can fail in cases where there are subclonal CNV. Purple also calls allele-specific copy numbers and is able to account for non-integer ploidy (subclonal CNV). The CNV segments from Purple can be found in the Purple_outputs folder and has the suffix tumor.purple.cnv.somatic.tsv. We visualize the results from Purple in the final report.

Report html

The workflow produces a report html file that summarizes the results from the workflow. The report html file can be found in the report folder. The report also contains variants filtered with a set of criteria documented in the HTML file to help with the interpretation of the samples. However, the filtering criteria are not meant to be used as a hard rule and should be used with caution. They may not be optimal for all datasets and we encourate users to analyse the pipeline results in more detail.