This takes two paired end FASTQ files and searches for Duplex Sequencing amplicons, then matches valid Duplex Sequencing Duplex Consensus Sequences against CODIS sequences.
-
python3 streamFastqTransform.py -s1 seq1.fastq -s2 seq2.fastq -o sample_spark_input.txt
-
spark-submit dsSeqModel.py -i sample_spark_input.txt -o output_prefix -s sample_name
Creates three output files:
- output_prefix.txt - A list of all of the DCS and the counts of sequences that support it with the CODIS match
- output_prefix.vcf - A VCF format list of all DCS and counts.
- output_prefix_sum.txt - A list of all clans that support a CODIS allele (or a matching sequence in the default CODIS_surrounding.tsv or a similar format file containing other sequences supplied with the -c flag).
Example The repository provides example CODIS_spark_input.txt and CODIS_surrounding.tsv that can be used to run an example analysis with:
spark-submit dsSeqModel.py -i examples/CODIS_spark_input.txt -o CODIS_output -s CODIS -c examples/CODIS_surrounding.tsv
Note: D21S11 is too long for a 156 bp read, so it requires a match count (mc) setting of 2 to match CODIS allele and it will output an incorrect allele count because the read cannot span the full allele sequence.
Requires Pyspark >= 2.x, TRAL, and TRF
Install Spark https://spark.apache.org/downloads.html
Install TRAL http://elkeschaper.github.io/tral/
pip install TRAL
Install TRF http://tandem.bu.edu/trf/trf.html
-
use streamFastqTransform.py in parallel to upload FASTQ files to a HDFS
-
use dsSeqModel.py to analyze files
A Docker repository will be available soon.