Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupReadsByUmi may fail when marking duplicates including secondary/supplementary reads #964

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 45 additions & 4 deletions src/main/scala/com/fulcrumgenomics/bam/Bams.scala
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
package com.fulcrumgenomics.bam

import com.fulcrumgenomics.FgBioDef._
import com.fulcrumgenomics.alignment.Cigar
import com.fulcrumgenomics.bam.api.SamOrder.Queryname
import com.fulcrumgenomics.bam.api._
import com.fulcrumgenomics.commons.collection.{BetterBufferedIterator, SelfClosingIterator}
Expand All @@ -41,6 +42,36 @@
import java.io.Closeable
import scala.math.{max, min}



case class Supplementary(refName: String, start: Int, positiveStrand: Boolean, cigar: Cigar, mapq: Int, nm: Int) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scaladocs needed later

def negativeStrand: Boolean = !positiveStrand
def refIndex(header: SAMFileHeader): Int = header.getSequence(refName).getSequenceIndex
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may invert this, and store refIndex instead, since when we have the SAM header when want refName (in toString)


def end: Int = start + cigar.lengthOnTarget - 1

Check warning on line 51 in src/main/scala/com/fulcrumgenomics/bam/Bams.scala

View check run for this annotation

Codecov / codecov/patch

src/main/scala/com/fulcrumgenomics/bam/Bams.scala#L51

Added line #L51 was not covered by tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question Is end inclusive or exclusive?

(And maybe add scaladoc to clarify)

I recently added an equivalent property to fgpyo and made it exclusive; if this is the same I think we don't want to subtract 1

https://github.com/fulcrumgenomics/fgpyo/blob/8738a1de868fc6c76a59ad68b29b4c537e660b97/fgpyo/sam/__init__.py#L557

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fgbio is generally 1-based inclusive, but I don't think we say that anywhere.

def unclippedStart: Int = {
SAMUtils.getUnclippedStart(start, cigar.toHtsjdkCigar)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably could just compute these directly without having to route to htsjdk

}

def unclippedEnd: Int = {
SAMUtils.getUnclippedEnd(end, cigar.toHtsjdkCigar)

Check warning on line 57 in src/main/scala/com/fulcrumgenomics/bam/Bams.scala

View check run for this annotation

Codecov / codecov/patch

src/main/scala/com/fulcrumgenomics/bam/Bams.scala#L57

Added line #L57 was not covered by tests
}
}

object Supplementary {
/** Returns a formatted alignment as per the SA tag: `(rname ,pos ,strand ,CIGAR ,mapQ ,NM ;)+` */
def toString(rec: SamRecord): String = {
val strand = if (rec.positiveStrand) '+' else '-'
f"${rec.refName},${rec.start},${strand},${rec.cigar},${rec.mapq},${rec.getOrElse(SAMTag.NM.name(),0)}"
}


def apply(sa: String): Supplementary = {
Comment on lines +61 to +69
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would prefer to have two apply methods, one for SamRecord and one for String, and a class toString method that converts an instance of Supplementary to String

val parts = sa.split(",")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably good to check we get 6 parts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Is validation of the type/value of each part also necessary?

Supplementary(parts(0), parts(1).toInt, parts(2) == "+", Cigar(parts(3)), parts(4).toInt, parts(5).toInt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Supplementary(parts(0), parts(1).toInt, parts(2) == "+", Cigar(parts(3)), parts(4).toInt, parts(5).toInt)
Supplementary(parts(0), parts(1).toInt - 1, parts(2) == "+", Cigar(parts(3)), parts(4).toInt, parts(5).toInt)

suggestion We need to subtract 1 if start is zero-based.

Without scaladoc I'm not sure, but I'm assuming it's zero based; and the SA tag is 1-based.

Pos is a 1-based coordinate.

https://samtools.github.io/hts-specs/SAMtags.pdf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: fgbio is generally 1-based inclusive

}
}

/**
* Class that represents all reads from a template within a BAM file.
*/
Expand Down Expand Up @@ -107,11 +138,21 @@

/** Fixes mate information and sets mate cigar on all primary and supplementary (but not secondary) records. */
def fixMateInfo(): Unit = {
for (primary <- r1; supp <- r2Supplementals) {
SamPairUtil.setMateInformationOnSupplementalAlignment(supp.asSam, primary.asSam, true)
// Set all mate info on BOTH secondary and supplementary records, not just supplementary records. We also need to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on line 139 should be updated (or removed) to reflect this

// add the "pa" and "pm" tags with information about the primary alignments. Finally, we need the MQ tag!
val r1NonPrimary = r1Supplementals ++ r1Secondaries
val r2NonPrimary = r2Supplementals ++ r2Secondaries
for (primary <- r1; nonPrimary <- r2NonPrimary) {
SamPairUtil.setMateInformationOnSupplementalAlignment(nonPrimary.asSam, primary.asSam, true)
nonPrimary(SAMTag.MQ.name()) = primary.mapq
nonPrimary("mp") = Supplementary.toString(primary)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: store these tag definitions somewhere else

r2.foreach(r => nonPrimary("rp") = Supplementary.toString(r))
}
for (primary <- r2; supp <- r1Supplementals) {
SamPairUtil.setMateInformationOnSupplementalAlignment(supp.asSam, primary.asSam, true)
for (primary <- r2; nonPrimary <- r1NonPrimary) {
SamPairUtil.setMateInformationOnSupplementalAlignment(nonPrimary.asSam, primary.asSam, true)
nonPrimary(SAMTag.MQ.name()) = primary.mapq
nonPrimary("mp") = Supplementary.toString(primary)
r1.foreach(r => nonPrimary("rp") = Supplementary.toString(r))
Comment on lines +145 to +155
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question Would you find it more legible to extract these for loops into a helper so we don't repeat it twice?

}
for (first <- r1; second <- r2) {
SamPairUtil.setMateInfo(first.asSam, second.asSam, true)
Expand Down
65 changes: 47 additions & 18 deletions src/main/scala/com/fulcrumgenomics/bam/api/SamOrder.scala
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,15 @@

package com.fulcrumgenomics.bam.api

import com.fulcrumgenomics.bam.{Bams, Supplementary}
import com.fulcrumgenomics.umi.ConsensusTags
import htsjdk.samtools.SAMFileHeader.{GroupOrder, SortOrder}
import htsjdk.samtools.util.Murmur3
import htsjdk.samtools.{SAMFileHeader, SAMUtils}
import org.apache.commons.math3.genetics.RandomKey

import scala.reflect.runtime.universe.Template

Comment on lines +34 to +35
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import scala.reflect.runtime.universe.Template


/** Trait for specifying BAM orderings. */
sealed trait SamOrder extends Product {
Expand Down Expand Up @@ -175,24 +178,50 @@ object SamOrder {
override val groupOrder: GroupOrder = GroupOrder.query
override val subSort: Option[String] = Some("template-coordinate")
override val sortkey: SamRecord => A = rec => {
val readChrom = if (rec.unmapped) Int.MaxValue else rec.refIndex
val mateChrom = if (rec.unpaired || rec.mateUnmapped) Int.MaxValue else rec.mateRefIndex
val readNeg = rec.negativeStrand
val mateNeg = if (rec.paired) rec.mateNegativeStrand else false
val readPos = if (rec.unmapped) Int.MaxValue else if (readNeg) rec.unclippedEnd else rec.unclippedStart
val matePos = if (rec.unpaired || rec.mateUnmapped) Int.MaxValue else if (mateNeg) SAMUtils.getMateUnclippedEnd(rec.asSam) else SAMUtils.getMateUnclippedStart(rec.asSam)
val lib = Option(rec.readGroup).flatMap(rg => Option(rg.getLibrary)).getOrElse("Unknown")
val mid = rec.get[String](ConsensusTags.MolecularId).map { m =>
val index: Int = m.lastIndexOf('/')
if (index >= 0) m.substring(0, index) else m
}.getOrElse("")

if (readChrom < mateChrom || (readChrom == mateChrom && readPos < matePos) ||
(readChrom == mateChrom && readPos == matePos && !readNeg)) {
TemplateCoordinateKey(readChrom, mateChrom, readPos, matePos, readNeg, mateNeg, lib, mid, rec.name, false)
}
else {
TemplateCoordinateKey(mateChrom, readChrom, matePos, readPos, mateNeg, readNeg, lib, mid, rec.name, true)
// For non-secondary/non-supplementary alignments, use the info in the record. For secondary and supplementary
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: how can we simplify these two branches, since they're very similar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you additionally set mp on the primary alignments, not just the supplementaries, (and also take my suggestion to define an apply for SamRecord 🙂 ) you could do the following:

val primary = if (!rec.secondary && !rec.supplementary) Supplementary(rec) else Supplementary(rec[String]("rp"))
val mate = Supplementary(rec[String]("mp"))

// Just the second branch, using the info from `Supplementary` instead of `SamRecord`
...

// alignments, use the info in the pa/pm tags.
if (!rec.secondary && !rec.supplementary) {
val readChrom = if (rec.unmapped) Int.MaxValue else rec.refIndex
val mateChrom = if (rec.unpaired || rec.mateUnmapped) Int.MaxValue else rec.mateRefIndex
val readNeg = rec.negativeStrand
val mateNeg = if (rec.paired) rec.mateNegativeStrand else false
val readPos = if (rec.unmapped) Int.MaxValue else if (readNeg) rec.unclippedEnd else rec.unclippedStart
val matePos = if (rec.unpaired || rec.mateUnmapped) Int.MaxValue else if (mateNeg) SAMUtils.getMateUnclippedEnd(rec.asSam) else SAMUtils.getMateUnclippedStart(rec.asSam)
val lib = Option(rec.readGroup).flatMap(rg => Option(rg.getLibrary)).getOrElse("Unknown")
val mid = rec.get[String](ConsensusTags.MolecularId).map { m =>
val index: Int = m.lastIndexOf('/')
if (index >= 0) m.substring(0, index) else m
}.getOrElse("")

if (readChrom < mateChrom || (readChrom == mateChrom && readPos < matePos) ||
(readChrom == mateChrom && readPos == matePos && !readNeg)) {
TemplateCoordinateKey(readChrom, mateChrom, readPos, matePos, readNeg, mateNeg, lib, mid, rec.name, false)
}
else {
TemplateCoordinateKey(mateChrom, readChrom, matePos, readPos, mateNeg, readNeg, lib, mid, rec.name, true)
}
} else {
val primary = Supplementary(rec[String]("rp"))
val mate = Supplementary(rec[String]("mp"))
Comment on lines +204 to +205
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todo, better error message or fallback

val readChrom = if (rec.unmapped) Int.MaxValue else primary.refIndex(rec.header)
val mateChrom = if (rec.unpaired || rec.mateUnmapped) Int.MaxValue else mate.refIndex(rec.header)
val readNeg = primary.negativeStrand
val mateNeg = if (rec.paired) mate.negativeStrand else false
val readPos = if (rec.unmapped) Int.MaxValue else if (readNeg) primary.unclippedEnd else primary.unclippedStart
val matePos = if (rec.unpaired || rec.mateUnmapped) Int.MaxValue else if (mateNeg) mate.unclippedEnd else mate.unclippedStart
val lib = Option(rec.readGroup).flatMap(rg => Option(rg.getLibrary)).getOrElse("Unknown")
val mid = rec.get[String](ConsensusTags.MolecularId).map { m =>
val index: Int = m.lastIndexOf('/')
if (index >= 0) m.substring(0, index) else m
}.getOrElse("")

if (readChrom < mateChrom || (readChrom == mateChrom && readPos < matePos) ||
(readChrom == mateChrom && readPos == matePos && !readNeg)) {
TemplateCoordinateKey(readChrom, mateChrom, readPos, matePos, readNeg, mateNeg, lib, mid, rec.name, false)
}
else {
TemplateCoordinateKey(mateChrom, readChrom, matePos, readPos, mateNeg, readNeg, lib, mid, rec.name, true)
}
}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -719,7 +719,7 @@ class GroupReadsByUmi

// Then output the records in the right order (assigned tag, read name, r1, r2)
templatesByMi.keys.toSeq.sortBy(id => (id.length, id)).foreach(tag => {
templatesByMi(tag).sortBy(t => t.name).flatMap(t => t.primaryReads).foreach(rec => {
templatesByMi(tag).sortBy(t => t.name).flatMap(t => t.allReads).foreach(rec => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question Where are --include-supplementary and --include-secondary taken into consideration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the initial filter of the BAM file (see the _includeSecondaryReads variabel)

out += rec
})
})
Expand Down
30 changes: 29 additions & 1 deletion src/test/scala/com/fulcrumgenomics/bam/api/SamOrderTest.scala
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,13 @@
package com.fulcrumgenomics.bam.api

import java.util.Random

import com.fulcrumgenomics.FgBioDef._
import com.fulcrumgenomics.bam.Bams
import com.fulcrumgenomics.bam.Bams.{MaxInMemory, templateIterator}
import com.fulcrumgenomics.commons.util.SimpleCounter
import com.fulcrumgenomics.testing.SamBuilder.{Minus, Plus}
import com.fulcrumgenomics.testing.{SamBuilder, UnitSpec}
import com.fulcrumgenomics.util.Io
import htsjdk.samtools.{SAMFileHeader, SAMRecordCoordinateComparator, SAMRecordQueryNameComparator}
import htsjdk.samtools.SAMFileHeader.{GroupOrder, SortOrder}
import htsjdk.samtools.util.Murmur3
Expand Down Expand Up @@ -217,4 +219,30 @@ class SamOrderTest extends UnitSpec {

actual should contain theSameElementsInOrderAs expected
}

it should "handle supplementary alignments" in {
val builder = new SamBuilder(readLength=100, sort=Some(SamOrder.Queryname))
val exp = ListBuffer[SamRecord]()
// primary read pairs for q1, that map to different contigs
exp ++= builder.addPair("q1", contig=1, contig2=Some(2), start1=66, start2=47, cigar1="60M40S", cigar2="55M45S", strand2=Plus)
// supplementary R2 (ignore R1), which maps to the same chromosome as the primary R1
val Seq(_, s2) = builder.addPair("q1", contig=1, contig2=Some(1), start1=66, start2=66, cigar1="60M40S", strand2=Minus)
s2.supplementary = true
s2.properlyPaired = true
exp += s2
// primary read pairs for q2, that map to different contigs, but earlier that q1
exp ++= builder.addPair("q2", contig = 1, contig2 = Some(2), start1 = 50, start2 = 30, cigar1 = "60M40S", cigar2 = "55M45S")

// Fix the mate information. Note: sorting here to get a template-iterator will write the records to disk first,
// so we cannot use the records in builder/exp.
val records = Bams.templateIterator(iterator=exp.iterator, header=builder.header, maxInMemory=MaxInMemory, tmpDir=Io.tmpDir).flatMap { template =>
template.fixMateInfo()
template.allReads
}.toList

val expected = List("q2/1", "q2/2", "q1/1", "q1/2", "q1/2:sup")
val actual = records.sortBy(r => SamOrder.TemplateCoordinate.sortkey(r)).map(_.id)

actual should contain theSameElementsInOrderAs expected
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,14 @@
*/
package com.fulcrumgenomics.umi

import com.fulcrumgenomics.bam.Template
import com.fulcrumgenomics.bam.api.SamOrder
import com.fulcrumgenomics.bam.{Bams, Template}
import com.fulcrumgenomics.bam.api.{SamOrder, SamWriter}
import com.fulcrumgenomics.bam.api.SamOrder.TemplateCoordinate
import com.fulcrumgenomics.cmdline.FgBioMain.FailureException
import com.fulcrumgenomics.testing.SamBuilder.{Minus, Plus}
import com.fulcrumgenomics.testing.{SamBuilder, UnitSpec}
import com.fulcrumgenomics.umi.GroupReadsByUmi._
import com.fulcrumgenomics.util.Io
import org.scalatest.{OptionValues, PrivateMethodTester}

import java.nio.file.Files
Expand Down Expand Up @@ -338,6 +339,46 @@ class GroupReadsByUmiTest extends UnitSpec with OptionValues with PrivateMethodT
recs.filter(_.name.equals("a04")).forall(_.duplicate == true) shouldBe true
}

it should "mark duplicates on supplementary reads" in {
val builder = new SamBuilder(readLength = 100, sort = Some(SamOrder.Coordinate))
// primary read pairs for q1, that map to different contigs
builder.addPair("q1", contig = 1, contig2 = Some(2), start1 = 66, start2 = 47, cigar1 = "60M40S", cigar2 = "55M45S", strand2 = Plus, attrs = Map("RX" -> "ACT"))
// supplementary R2, which maps to the same chromosome as the primary R1
val Seq(s1, s2) = builder.addPair("q1", contig = 1, contig2 = Some(1), start1 = 66, start2 = 66, cigar1 = "60M40S", strand2 = Minus, attrs = Map("RX" -> "ACT"))
s1.supplementary = true
s2.supplementary = true
s2.properlyPaired = true
// primary read pairs for q2, that map to different contigs, but earlier that q1
builder.addPair("q2", contig = 1, contig2 = Some(2), start1 = 50, start2 = 30, cigar1 = "60M40S", cigar2 = "55M45S", attrs = Map("RX" -> "ACT"))

// primary read pairs for q3, that are duplicates of q2
builder.addPair("q3", contig = 1, contig2 = Some(2), start1 = 50, start2 = 30, cigar1 = "60M40S", cigar2 = "55M45S", attrs = Map("RX" -> "ACT"))

// Fix the mate information and write the input. Note: sorting here to get a template-iterator will write the
// records to disk first, so we cannot use the records in builder and therefore must write to the input file
// directly.
val in = Files.createTempFile("raw_reads", ".sam")
val writer = SamWriter(in, builder.header, sort = builder.sort)
Bams.templateIterator(iterator = builder.iterator, header = builder.header, maxInMemory = Bams.MaxInMemory, tmpDir = Io.tmpDir).foreach { template =>
template.fixMateInfo()
writer ++= template.allReads
}
writer.close()

val out = Files.createTempFile("umi_grouped.", ".sam")
val hist = Files.createTempFile("umi_grouped.", ".histogram.txt")
val gr = new GroupReadsByUmi(input = in, output = out, familySizeHistogram = Some(hist), strategy = Strategy.Adjacency, edits = 1, markDuplicates = true)

gr.markDuplicates shouldBe true
gr.execute()

val recs = readBamRecs(out)
recs.length shouldBe 8
recs.filter(_.name.equals("q1")).forall(_.duplicate == false) shouldBe true
recs.filter(_.name.equals("q2")).forall(_.duplicate == false) shouldBe true
recs.filter(_.name.equals("q3")).forall(_.duplicate == true) shouldBe true
}

it should "correctly group reads with the paired assigner when the two UMIs are the same" in {
val builder = new SamBuilder(readLength=100, sort=Some(SamOrder.Coordinate))
builder.addPair(name="a01", start1=100, start2=300, strand1=Plus, strand2=Minus, attrs=Map("RX" -> "ACT-ACT"))
Expand Down
Loading