Skip to content

Latest commit

 

History

History
executable file
·
148 lines (99 loc) · 4.92 KB

README.md

File metadata and controls

executable file
·
148 lines (99 loc) · 4.92 KB

About

  • This is an alignment pipeline for use on a SLURM cluster. It uses BWA, SAMTools, Picard and GATK to carry paired end FASTQ files through to g.vcf
  • Basic flow is FASTQ -> BWA primary alignment -> Picard SortSAM -> SAMtools split -> Picard Mark Duplicates -> GATK Base Recalibration -> GATK PrintReads -> GATK DepthOfCoverage -> GATK Haplotype caller
  • The FASTQ to BWA step involves breaking the input FASTQ up into an arbitrary number of chunks and passing each of these to their own alignment sequence.
  • BWA uses the MEM algorith and passes the output straight to SAMTools to convert to BAM format as a space saving measure with minimal time impact.
  • After Picard SortSAM the blocks are then split into separate contigs based on the reference sequence used in alignment and all subsequent steps.
  • These contig blocks are then merged into full contigs prior to Picard Mark Duplicates running on each contig separately.
  • The contigs travel down through GATK Base Recalibration and into GATK PrintReads.
  • The GATK PrintReads function then splits into two separate paths: GATK DepthOfCoverage and SAMTools Cat.
  • The SAMTools cat path combines the separate contig BAMs into a single file.
  • GATK DepthOfCoverage provides basic Depth of Coverage information for all contigs. It also calculates overall coverage for all autosomal contigs and compares them to the Gender contig coverage to calculate correct X & Y coverage. This allows for correct sample ploidy settings in GATK HaplotypeCaller for X (X0), XY, XX, XXY XYY, XXX, XXXY, etc anueploidies in the event there are any sex chromosome specific disorders or if these disorders can affect expression it will be known.
  • For all but the sex and mitochondrial chromosomes, GATK HaplotypeCaller uses a ploidy of 2 and will start immediately after GATK PrintReads is completed.
  • The sex chromosomes will be start once all the GATK DepthOfCoverages have completed and are collated by the coverage comparison.
  • The Mitochondiral chromosomes are not passed to GATK HaplotypeCaller as their ploidy is in the hundreds.
  • The individual contigs from GATK HaplotypeCaller are then merged using GATK CatVariants once all contigs (barring MT) are completed.
  • Primary output contains a merge BAM file from GATK PrintReads and a merged genomic VCF from GATK HaplotypeCaller.
  • Secondary output contains the coverage map, command history and execution metrics.

Update history

2019-10-31

Fixed

  • sed quiet doesn't produce any output, resulting in no readgroups.

2019-09-12

Fixed

  • Illumina platinum readgroups have spaces in funny places.

2019-09-02

Fixed

  • Contig blocks sort order

Added

  • PCR Free option via local contigs

2018-11-05

Fixed

  • Legacy reference to MemPerCore
  • split dir referencing root.
  • File pathing

Added

  • Manually settable GATK_JAR path.

2018-10-10

Changed

  • Increased core use per contig from 7 to 8 as to not choke the cluster.
  • Decreased FASTQ_MAXREAD to 40m to decrease chunk size, increasing number of chunks, decreasing time to complete each chunk.
  • converted catreadindex script to bash argument format.

Fixed

  • Specifying the same input file throws an error.

2018-07-20

Changed

  • baserefs export order to suit external per usercfg file settings
  • script tarball named after script folder.
  • script folder now automatically determined via dirname
  • Tidied up usage output.

Added

  • Exclude version folders from tarball

Fixed

  • Reference dictionary location.

2018-06-11

Changed

  • Rearranged baserefs for better per user config.

2018-06-08

Fixed

  • Final output type (-t) not being respected.

Changed

  • MaxWallTime value set appropriately.
  • Baseref source from absolute to relative pathing.
  • Increased default usage output.

Added

  • Requeue to email list since it results in a failure overall.
  • Added mechanism for localized, per user settings via $HOME/fq2vcf.sh file

2018-03-14

Changed

  • CPUs Per Task from 8 to 7 to fill Nodes.
  • printTime functions to allow usage with stdin.

Removed

  • Annotation from HaplotypeCaller as generates excessive errors in combining phase.

Fixed

  • Spelling mistakes in coverage script.

2017-12-13

Added

  • Alternate Max-Walltime for Exome captures
  • Variable array task throttle setting.

Fixed

  • Allow REF to retain previous value instead of resetting to default.

2017-12-11

Fixed

  • Final out being blank!
  • HaplotypeCaller output missing.

Added

  • Pre-Haplotyping Index attempt if .bai missing can fail correctly.
  • Platform list to usage page.
  • Final output type specification.

Changed

  • Cleaned up log output for gender determination.
  • Relevant usage output on incorrect parameter settings.

2017-12-04

Changed

  • Switched from mem-per-cpu (--mem-per-cpu) to mem-per-node (--mem)
  • Gender determination alert notification contains more information.

Added

  • Automatic email detection can handle ID aliases

2017-11-29

-Initial commit