Skip to content

1. Analysis tracks of the pipeline

Verena Kutschera edited this page Mar 1, 2022 · 2 revisions

The GenErode pipeline can be run up to any point. Note that most steps depend on each other.

1) Data processing track

Required for BAM file track and VCF file track.

  • Reference genome indexing
  • Repeat element identification from reference genome
  • FASTQ file processing
    • Adapter trimming (modern samples)
    • Adapter trimming and merging of overlapping paired-end reads (historical samples)
  • Optional data processing steps:
    • Mapping to mitochondrial genomes of the target species and of potential contaminating species (the output from this step is not used downstream in the pipeline)
  • Mapping of historical and/or modern samples to a reference genome
  • BAM file processing:
    • Merge samples from different lanes per PCR/index
    • Remove duplicates
    • Merge BAM files per sample
    • Realign indels
    • Calculate average genome-wide depth
  • Optional data processing steps:
    • Base quality rescaling (mapDamage2) for historical samples
    • Subsampling to target depth
  • Genotyping
  • Optional data processing steps:
    • CpG site identification (three different methods)

2) BAM file track

  • Optional data processing steps:
    • BED files with sex chromosomal or autosomal contigs
  • Downstream analyses:
    • mlRho (default filtering for quality, depth, repeat elements and optional filtering for sex chromosomal or autosomal contigs and CpG sites)

3) VCF file track

  • Optional data processing steps:
    • VCF file filtering for CpG sites
  • VCF file processing per sample:
    • Filtering for quality, depth, allelic imbalance
    • Remove SNPs near indels and indels
    • Remove repeat regions
  • VCF file merging and processing:
    • Merge VCF files from all samples
    • Filter to keep only biallelic SNPs
    • Remove sites with more than a certain fraction of missing genotypes across all samples
    • Extract samples from each dataset (historical samples, modern samples)
  • Optional downstream analyses:
    • PCA
    • Runs of homozygosity (ROH)
    • snpEff

4) GERP score track

  • GERP score calculation from reference genome and genomes of additional outgroup species
  • Calculation of relative mutational load per sample from processed VCF files