- This is an alignment pipeline for use on a SLURM cluster. It uses BWA, SAMTools, Picard and GATK to carry paired end FASTQ files through to g.vcf
- Basic flow is FASTQ -> BWA primary alignment -> Picard SortSAM -> Picard Mark Duplicates -> GATK Base Recalibration -> GATK PrintReads -> GATK DepthOfCoverage -> GATK Haplotype caller
- The FASTQ to BWA step involves breaking the input FASTQ up into an arbitrary number of chunks and passing each of these to their own alignment sequence.
- BWA uses the MEM algorith and passes the output straight to SAMTools to convert to BAM format as a space saving measure with minimal time impact.
- After Picard SortSAM the blocks are then split into separate contigs based on the reference sequence used in alignment and all subsequent steps.
- These contig blocks are then merged into full contigs prior to Picard Mark Duplicates running on each contig separately.
- The contigs travel down through GATK Base Recalibration and into GATK PrintReads.
- The GATK PrintReads function then splits into two separate paths: GATK DepthOfCoverage and SAMTools Cat.
- The SAMTools cat path combines the separate contig BAMs into a single file.
- GATK DepthOfCoverage provides basic Depth of Coverage information for all contigs. It also calculates overall coverage for all autosomal contigs and compares them to the Gender contig coverage to calculate correct X & Y coverage. This allows for correct sample ploidy settings in GATK HaplotypeCaller for X (X0), XY, XX, XXY XYY, XXX, XXXY, etc anueploidies in the event there are any sex chromosome specific disorders or if these disorders can affect expression it will be known.
- For all but the sex and mitochondrial chromosomes, GATK HaplotypeCaller uses a ploidy of 2 and will start immediately after GATK PrintReads is completed.
- The sex chromosomes will be start once all the GATK DepthOfCoverages have completed and are collated by the coverage comparison.
- The Mitochondiral chromosomes are not passed to GATK HaplotypeCaller as their ploidy is in the hundreds.
- The individual contigs from GATK HaplotypeCaller are then merged using GATK CatVariants once all contigs (barring MT) are completed.
- Primary output contains a merge BAM file from GATK PrintReads and a merged genomic VCF from GATK HaplotypeCaller.
- Secondary output contains the coverage map, command history and execution metrics.
- Switched default reference to GRCh37 decoy
- Email address to primary user. Need to make this dynamic.
- Removed HG38 specific handling for time being as it breaks decoy sequence.
- Depth of Coverage dying on pointless contigs.
- HaplotypeCaller annotations that produce extreme warning logs later on.
- Coverage calculation checking GL contigs.
- Default run-time to 5h, 59m for high priority queue.
- Separated FASTQ size data.
- Ramdisk allocation above baserefs call to affect ram allocation correctly.
- Command line options to spool_sample to streamline multi-run, gender analysis, etc.
- Update all scripts to accept dynamic command line options
- Support for multi-run jobs via stop-after-alignment marker.
- Array dependency corrilation not working correctly.
- Cleanup dependency list being incorrectly defined.
- Spool_Sample not resuming a job after fastq files have been purged.
- Cleanup script tarball arguments.
- Cleanup stores a copy of the script bundle.
- Cleanup script to correct tarball only the log files etc.
- Allow purging FastQ files once read-split has completed.
- Transfer puts files in correct place. Permissions issues will persist as Globus daemon 'owns' HCS files.
- Cleanup script trying to purge working folder as transfers may be active. Manual it is boys.
- Contig block merge as Picard MarkDuplicates takes multiple inputs and outputs correctly sorted BAM with no significant penalty.
- Removed array depedency correspondence from depCheck function as this doesn't work for cases of partial completion in array chains.
- Begun migration to HG38 with Alt-Aware alignment.
- Switched alignment phase to use RAM disk instead of temp storage to minimize network traffic while not risking running out of node temp space.
- Java memory allocation due to ram-disk usage requirements.
- storeMetrics function to try and output job statistics (cpu,mem,etc usage) Not much luck here as sstat seem kinda broken and sacct is slow to update.
- Minor code cleanup.
- Job position tracking.
- Initial Job auto-restart mechanism.
- Cleanup gzip command on folder structure.
- Extracted read-group from read files for alignment since readsplit doens't generate them any more.
- Job cleanup step to roll up output and remove left-overs.
- Transfer jobs will wait until transfer is complete before exiting.
- Cancelling extra alignment blocks causes merge step to be cancelled.
- Removed piped output from ContigMerge to MarkDuplicates as this isn't handled well.
- Merged ContigMerge and MarkDuplicate functions to cut task count in half: 1x84 vs 2x84
- Merged BQSR and PrintRead functions to cut task count in half: 1x84 vs 2x84
- Merged initial three steps into 1 to save BLOCKS x CONTIGS job submissions. ~30m Primary Alignment, ~30m SortSAM and 84x <10m Contig split takes <120m. Will padd to 240m as sometimes nodes are bogged.
- Migrated second phase Merge, MarkDup to CatVariants to main spool script as we don't have to collect contig split dependencies any more.
- Exit code not being passed correctly by cmdFail function.
- Test module for picard GatherVcfs as an alternative to CatVariants. Nul cahar issues abound!
- Workflow pdf
- Missing ! for file transfer failure detection.
- Missing error messages on job failure.
- Automatic re-submission of job when SIGTERM detected as job is about to be killed for exceeding walltime. Resubmitted job has double the wall-time. This will still result in a failure unless I can dynamically update the subsequent process's dependency.
- outFile function will exit cleanly if .done file exists. Will allow overwrite if output already exists but no done file found as likely interupted.
- ReadSplit awk system call missing a quote.
- Increase submissions delay to 6 seconds per submissions to hard block at 600/hour submissions limit of 6 seconds per submissions instead of 20/minutes's 3 seconds per job submission.
- SIGTERM trap incorporated into storeMetrics output as EXIT trap is called even on error exit.
- Reduced ContigSplit from 2 cores to 1.
- tieTaskDeps incorrectly unlocking child array elements repeatedly if current cycle found no match.
- ReadSplit not stopping on block spin-off failure.
- DepthOfCoverage runtime doubled from 30m to 60m as exceeding runtime allowance with ~4m to go.
- Set Gender haplotypecalling to correct array position. Cosmetic only as operational is defined via command-line.
- Rearranged ReadSplit logging output to more logical sequence.
- Merged Spool_Sample's file size detection for read pair to facilitate overall walltime multiplier based on input file size vs calibrated file size.
- Moved SIGTERM metric entry to end of line to maintain output consistency.
- Moved storeMetrics function to exit trap.
- Array elements that do not have a matching element in the previous array were left dependent on the entire previous array
- ExitCodes. 10: IO error. 15: Pipeline command error. 20: File move error.
- Gender contig definition in baserefs to allow seemless haplotypecaller array.
- Base Walltime multiplier to baserefs to allow per-contig walltimes in dispatch function.
- Signal trap for pre-kill event detection and logging of processes that are about to be killed.
- Runtime for ContigSplit increased to 30m as it seems to exceed 15m on some nodes.
- Walltime values converted to minutes for easy manipulation by dispatch function.
- Align & Sort purge taking purging one too many jobs.
- Non arrayed list of contigs. No point in having the same data twice.
- Printreads input validation missing a fi to close close the if.
- Check-blocks erroneous output.
- Metrics logging output format
- Moved file transfer script into main script body to prevent un-needed extra job submissions. The transfers are handed off to Globus so take virtually no time. This happens after the file has been moved to the storage folder so if transfer fails, you don't have to start from scratch.
- Changed in-script --array definition to more accurately reflect their general/ideal submission state. The array is still generated dynamically by the submission script.
- CheckBlock reading 1 block to far when restarted a from a point after ReadSplit had completed: blocks - 1 cuz they start at 0
- Submission delay mechanism so we don't submit a job within a certain amount of time of the previous.
- File transfer scripts from main check_block script.
- Increased ReadSplit block size and run-times by 50% to try and fit under NeSI submission rate limits.
- Global job category resource definitions. Second attempt.
- Delay on script re-start to reduce failures due to job submission rate limit.
- SortSam runtime reduced from 3 hours to 1 hour as highest seen runtime is 45 minutes.
- Replaced Align and Sort with array variant.
- Alignment array not collecting read-group header info for file-name pickup: cat block/readgroup.file in job script.
- Minimum meory being exceeded in java VM: specified memory is 1gb less than allocated.
- Sort array not copying .bai file to storage area after job completion.
- Moved files to their own repo folder for commandline commits etc.
- Moved BWA alignment command to variable then execute with eval. This allows seemless echoing to log on changes.
- Format of ReadSplit block number passed to check_block to allow array manipulation. Can't have five digit element counts can we?
- ReadSplit runtime reduced from 5 hours to 2.5 hours as highest seen runtime is 2 hours.
- PrimaryAlign runtime reduced 2 hours to 45 minutes as highest seen runtime is 20 minutes.
- Moved Alignment and Sort array dispatch to be above ReadSplit so we can pass ReadSplit the array job ids.
- Alignment and Sort arrays submitted at start point. 1000 elements each that are purged of excess jobs once ReadSplit completes.
- cmdFailed function doesn't function as expected. Reverted to if ! ${CMD}; then...
- HaplotypeCaller array elements tied to PrintReads array elements.
- Migrated file IO validation to baserefs.sh
- Large output files are first written to local node's tmp space, then moved to output folder on completion.
- ReadSplit launches as array job.
- BaseRecalibrator array job using singleton out and err file definitions.
- Job output not printing job/array ids correctly.
- File exist checking to each sbatch script with detailed output.
- Clean up sorted block output when all contig splitting has completed after any mergecontig runs
- DepthOfCoverage and HaplotypeCaller input and output incorrectly defined.
- DepthOfCoverage no being passed the capture platform correctly.
- Array element linker throwing errors when previous array is empty which occurs when all previous array was alread completed. Gotta quote those possibly empty inputs!
- HaplotypeCaller seeking incorrect input for non-arrayed Sex chromosome jobs
- Array element linker can accept non numerical values to compare. Never know if/when that'll come in handy.
- Late stage sample fingerprinting sequence. (haplotypecaller -> selectvariants)
- Readgroup variable from coverage mapping. Doesn't do anything.
- HaplotypeCaller array (autosomal and centromere contigs) to DepthOfCoverage array output and delayed start-time as no dependency tying required.
- MarkDuplicates, BaseRecalibrator, PrintReads, DepthOfCoverage and Haplotype caller (non MT or sex Chr) to array submission and tied array element dependency to matching element in previous array. Testing shows individual array element will wait only for their own dependency before starting.
- Separated CatReads from ReadIndex to allow job chaining.
- CatReads to sequential job chain for catting -> transfer reads & ReadIndex.
- ReadIndex to sequential job chain for indexing -> transfer index.
- CatVariants to sequential job chain for catting -> transfer variants & transfer index.
- Function comments, because everyone loves comments.
- tieTaskDeps function to tie a given task array's dependencies to the previous task array's matching element via SCONTROL UPDATE.
- Start-time delay for subsequent arrays so they cannot start until per-array-element dependencies are set correctly.
- Initial job dependencies for subsequent arrays as these will prevent the entire array from starting until the entire previous array has completed.
- MergeContig not collecting ContigSplit array dependencies.
- MergeContig trying to write to wrong location.
- Reformatted changelog/readme for github display.
- MergeContig converted to array submission method.
- Logging output method to sequencial instead of buffered to more accurately represent submission rate.
- File path simplification to enable better clean-up later on.
- About section to README.md
- Delay on each contig for markdup to haplotype segment as job submission rate exceeded limits.
- Skip message for X, Y and MT contigs in primary calling loop.
- ContigSplit function to be array job. Now submits 1 job array per block instead of 84. Array is dynamically created based on .done file existence.
- Log output header to 2 character combos. RS: Read Split, PA: Primary alignment, SS: SortSAM, etc...
- Log output to single line per post-merge contig.
- changelot.txt to README.md so changelog is visible.
- CatReads and CatReadsIndex jobs into 1. Increase walltime to accomodate both jobs. ~30m cat, 45m index.
- Contig Count to baserefs.sh
- ASP seeing a failed grep as a failed ssh connection. grep failure is ExitCode 1, ssh network failure exitcode is > 1
- Check for job submission failures. Pipeline will exit 1 in that event.
- Local spooler lauching sequencial jobs without input.
- X sub contigs not being linked correctly in their folders .../X/X:1-2699520.g.vxf.gz, etc.
- MT contig being included in HaplotypeCaller after removing input collection from primary loop.
- CatVarInputs to generate independent of main cycle so X&Y don't need to be resorted by CatVar.
- Global temp directory to be SLURM provided one so minimize chances of job fail due to temp directory failure.
- CatPrintReads walltime from 3 hours to 1. 6x to 2x 30m known run-time.
- CatPrintReadsIndex walltime from 1 hour to 1.5. 1.25x to 2x 45m known run-time.
- HaplotypeCaller walltime from 6 hours to 3 as parallelization with -nct has proven effective and cpus-per-task & mem-per-cpu from 4x8192 to 8x4096.
- CatVariants walltime to 3 hours as longest runtime is ~2 hours
- SortSam walltime from 3 hours to 1.5 as longest runtime is 30 minutes
- Mark Duplicates memory allocation from 32G to 16G as no different in runtime and walltime from 6 hours to 2 as longest runtime is just over 1 hour.
- scriptFailed function to all non-0 exit points.
- scontrol show job $SLURM_JOBID to scriptFailed function.
- Reads file size to job name to size-vs-speed filtering.
- Automatic sample progression. (ASP)
- scriptFailed function to collect basic node data when a job fails for any reason.
- Coverage & Gender Determination not collecting GL* PrintRead dependencies.
- HaplotypeCaller cpu-per-task & mem-per-cpu from 2x32 to 4x8 to boost parallelization but reduce overall memory requirement as this extra memory no longer affects run-time.
- SortSam mem-per-cpu from 32G to 16G as this has no effect on runtime.
- CatPrintReads & Index scripts to replace MergeReads. Picard MergeSamFiles takes 6 to 10 hours to complete. SAMTools Cat takes 30 minutes.
- Global temp directory definition to minimise network thrashing.
- HaplotypeCaller cpu-per-task from 1 to 2 to test if -nct parallelization has been resolved in this version of GATK.
- SLURM based parallelization calculation to GATK arguments.
- HaplotypeCaller mem-per-cpu from 16G to 32G to test if extra memory decreases run-time
- parallelization option to HaplotypeCaller function as this has been resolved in GATK and it will reduce runtime.
- duplicate metrics storage for sample.
- global module version definitions.
- check for previously finished jobs.
- SLURM based memory calculation to java arguments.
- FASTQ scan definitions.
- contig definition to for automation.
- base reference file to reduce data replication across multiple scripts.
- trimming process as BWA & GATK tools can handle quality based trimming and adapters.
- Cluster project.
- Continue to work out minimum requirement to obtain 1 hour max runtimme per segment within high partition.
- Build list of wall-time ratios for each job type and contig to set per-contig wall-times.
- Determine sample coverage based wall-times for dynamic allocation.
- Build multi-sample pipeline with identity comparison and merge function.
- Add fingerprinting function post bam collection?
- Add ability to re-try a job if it fails because of a node issues. Could trigger a chain of dependant jobs to take it as far as it can outwith the overall process. This will minimize delay on restarting the job later on.
- Migrate input file collection to slurm script. This will allow failed jobs to possibly be run in time for the catvar or catreads functions to collect them. Still wont be 100% effective as job execution depends on cluster availability. Should work well for smaller contigs.
- Re-write everything. Check-Blocks is a bloated mess.