Skip to content

Calling Duplex Consensus Reads

Nils Homer edited this page Feb 16, 2021 · 6 revisions

Overview

Calling consensus reads from duplex data is the process of taking reads that were generated in a way such as to allow post-hoc identification of which sequencing reads are derived from the paired strands of an original duplex or double-stranded source molecule of DNA. One such example is the process outlined by Kennedy et al which attaches UMIs to each end of a source molecule.

Mathematically the process is very similar to the one outlined for calling consensus reads from single-UMI data, though the mechanics are somewhat different.

The process outlined below is implemented in the CallDuplexConsensusReads program in fgbio, which is run after first grouping reads with GroupReadsByUmi --strategy=paired.

Process

The high level process starts with a group of reads identified as originating from the same double-stranded source molecule. The two strands of the original molecule are labeled, arbitrarily, as A and B and each read is known to have originated from either the A strand or the B strand. The process proceeds through the following steps:

  1. Reads are split into four sub-groups:
    • Strand A and read 1 (A1s)
    • Strand A and read 2 (A2s)
    • Strand B and read 1 (B1s)
    • Strand B and read 2 (B2s)
  2. Reads are unmapped and, if necessary, reverted to sequencing order
  3. Quality trimming, if enabled
  4. Remaining low-quality bases are masked (i.e. converted to Ns)
  5. Reads are further trimmed to the length of the insert if the insert is shorter than the read length
  6. Reads are filtered based on their Cigar (alignment structure) to ensure reads are always in phase (in two groups: A1s and B2s, and A2s and B1s).
  7. Four single-strand consensus reads are generated, one each for A1s, A2s, B1s, and B2s
  8. Two duplex consensus reads are generated by combining the A1 and B2 consensus reads, and the A2 and B1 consensus reads

Each step is further detailed below.

Splitting raw reads into groups

Reads are split into groups based on the strand of the original molecule (A or B) and whether they are sequencing read 1 or 2. The expectation is that read 1s derived from the A strand will match read 2s derived from the B strand and vice versa.

Unmapping

In order to generate consensus reads it is necessary to have all reads in the same orientation. To achieve this reads are reverted to sequencing order, even if they were aligned to the negative strand of the genome.

Quality trimming

An optional step, reads can be end-trimmed to remove low quality bases at the end of sequencing. This is highly recommended as removing low-quality bases will reduce the number of disagreements in the consensus and reduce the number of no-calls (Ns) produced in the consensus reads. Trimming is performed using the same running-sum algorithm as implemented in bwa.

Masking low quality bases

A parameter to the consensus caller is a minimum input base quality. Bases below the minimum quality threshold are converted to Ns in the input reads to ensure that they are not used elsewhere in the process. If quality trimming it disabled, the read is truncated at this step to remove any contiguous Ns at the end of the read.

Trimming to insert length

Reads longer than the insert length will necessarily read into the adapter sequence. For single-UMI data producing consensus calls of the adapter sequence is not a significant problem. However for duplex data, where we combine A1 and B2 reads, we generate consensus reads from raw reads that may read into different adapter sequences! Attempting to call consensus bases from these sequences will result in a high number of disagreements and many no-calls, which may cause consensus reads to be erroneously filtered out as low-quality at a later stage. Therefore reads are trimmed to their insert length prior to consensus calling.

Cigar Filtering

Since we do not generate a multiple-alignment of all raw reads to each other, any length errors (insertions or deletions) in the raw reads may result in reads being out of alignment with one another, and erroneous consensus calls being made. Consider the following reads:

1: ACGTGACTGACTGACTAGCTTTTTTTAGACTAGCTACTACT
2: ACGTGACTGACTGACTAGCTTTTTTTAGACTAGCTACTACT
3: ACGTGACTGACTGACTAGCTTTTTTTTAGACTAGCTACTAC

Without correction the third read, with an additional T incorporated into it's poly-T, will generate many disagreements with the other two reads. Since all reads are derived from the same source molecule any such differences must be errors.

To handle this we create groups of reads who's alignments are compatible with one another, and use only the reads from the largest group for generating the consensus. For example, given the following raw reads post-trimming (padding with - added for ease of reading):

1: ACATGGTACTTAGGTACGTACGGGG-TGATATG
2: ACATGGTACTTAGGTACGTACGGGG-TGATATG
3: ACATGGTACTTAGGTACGTACGGGG-TGATATG
4: ACATGGTACTTAGGTACGTACGGGGGTGATATG
5: ACATGGTACTTAGGTACGTACGGG--TGATATG
6: ACATGGTACTTAGGTACGTACG

three groups would be created: (1,2,3,6), (4,6) and (5,6), and the first group selected.

This process is performed independently on A1+B2 and B1+A2 reads.

Calling Single-Strand Consensus Reads

The penultimate step is to generate four single-strand consensus reads, one for each group of reads: A1, A2, B1, B2. This is done by feeding the raw reads for each group through the vanilla consensus calling process as documented here.

Calling Double-Stranded Consensus Reads

The final step is generation of the R1 and R2 duplex consensus reads. This is done very simply on a base by base basis by merging the appropriate A read with the appropriate B read as follows:

  1. If the bases agree between A and B reads, the resulting quality score is simply Q(A) + Q(B)
  2. If the bases disagree between A and B reads and the quality scores differ, the resulting base is the base with the higher quality score, and the resulting quality score is Q(higher) - Q(lower).
  3. If the bases disagree and have identical quality the base is arbitrarily the base from A and the quality score is 2 (the minimum Phred quality score).