Allow the PairedAssigner to use UMI pairs where one is absent #954

clintval · 2023-12-30T19:53:43Z

No description provided.

codecov · 2023-12-30T19:55:02Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (9c475fd) 95.62% compared to head (f806f57) 95.62%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #954   +/-   ##
=======================================
  Coverage   95.62%   95.62%           
=======================================
  Files         126      126           
  Lines        7354     7355    +1     
  Branches      502      512   +10     
=======================================
+ Hits         7032     7033    +1     
  Misses        322      322

Flag	Coverage Δ
unittests	`95.62% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nh13 · 2023-12-30T22:00:34Z

Instead of empty UMIs, why not just use Ns?

clintval · 2023-12-30T22:18:38Z

@nh13 because an IUPAC N stands for either A, C, G, T and for a paired library preparation in which one end of the source molecule does not have a UMI, then adding an n-length UMI of just Ns would be disingenuous. It will also affect metrics about the UMIs (see DuplexUmiMetric). This PR seems relatively harmless to consider and the output data will be more accurate to the true biochemical situation.

clintval · 2023-12-30T22:25:38Z

But the real show-stopper is the logic scattered about that removes or ignores reads/templates with Ns in the UMI:

fgbio/src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala

Line 517 in 9d5a422

|4. Templates are filtered if any UMI sequence contains one or more `N` bases

fgbio/src/main/scala/com/fulcrumgenomics/umi/GroupReadsByUmi.scala

Line 654 in 9d5a422

    
           .filter(r => !r.get[String](rawTag).exists(_.contains('N'))                   || { filteredNsInUmi += 1; false })

Instead of regressing on how we choose to filter/handle UMIs, this PR seemed like a compelling alternative. (The other idea is to use a non-N IUPAC code placeholder UMI, but that would be potentially confusing).

nh13 · 2023-12-30T22:25:56Z

What do you meant by “none of those”? What other base would the missing UMI be if you could have observed it? Could you update the description to have a better explanation why the UMI is omitted, as clearly I’m missing something?

nh13 · 2023-12-30T22:28:57Z

Why not an opt-in option to allow Ns?

clintval · 2023-12-30T22:31:58Z

Why not an opt-in option to allow Ns?

Because I favored the approach where the data now most closely represents the actual biochemical situation (see first comment here #954 (comment)).

If you want an opt-in option to allow Ns, let me know and I can send up a new PR. If we went that route, then I would allow UMIs with all-Ns but then I would lose the feature to filter out UMIs with Ns for that sole UMI remaining (which I don't really want to do...). I personally favor the approach in this PR because the data is more representative of the actual biochemistry, and all metrics I could possibly generate on the UMIs are going to be more sensible for this type of biochemical situation.

nh13

I am a bit concerned that we are using a regular expression split versus an optimized single-character split. I don't think we need the former. Thoughts?

nh13 · 2024-01-03T00:09:52Z

src/main/scala/com/fulcrumgenomics/umi/CollectDuplexSeqMetrics.scala

@@ -530,7 +531,7 @@ class CollectDuplexSeqMetrics
    val uniqueTotal      = metrics.map(_.unique_observations).sum.toDouble

    metrics.foreach { m =>
-      val Array(umi1, umi2)                   = m.umi.split('-')
+      val Array(umi1, umi2)                   = UmiSeparatorPattern.split(m.umi, -1)


Do we have to worry about performance being reduced, since we're moving from an optimized single character split to something that's not? See: https://stackoverflow.com/questions/11001330/java-split-string-performances

Should we implement our own split that adds a trailing empty string if the string ends with the single-char delimiter

I wondered about this and figured it might impart a slowdown. I'm frustrated Java provides a .split() with the option to preserve empty strings but only for regex patterns and not for single character delimiters. Frustrating! Looks like there are few fast alternatives I'll have to compare and contrast. We do have access to Google's Splitter via an htsjdk dependency:

https://guava.dev/releases/snapshot/api/docs/com/google/common/base/Splitter.html#on-java.lang.String-

Turns out most JDKs have a fastpath!

We have to use a 1-length character string ("-") and index = -1:

String.java

public String[] split(String regex, int limit) { /* fastpath if the regex is a * (1) one-char String and this character is not one of the * RegEx's meta characters ".$|()[{^?*+\\", or * (2) two-char String and the first char is the backslash and * the second is not the ascii digit or ascii letter. */ char ch = 0; if (((regex.length() == 1 && ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) || (regex.length() == 2 && regex.charAt(0) == '\\' && (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 && ((ch-'a')|('z'-ch)) < 0 && ((ch-'A')|('Z'-ch)) < 0)) && (ch < Character.MIN_HIGH_SURROGATE || ch > Character.MAX_LOW_SURROGATE)) { int off = 0; int next = 0; boolean limited = limit > 0; ArrayList<String> list = new ArrayList<>(); while ((next = indexOf(ch, off)) != -1) { if (!limited || list.size() < limit - 1) { list.add(substring(off, next)); off = next + 1; } else { // last one //assert (list.size() == limit - 1); int last = length(); list.add(substring(off, last)); off = last; break; } } // If no match was found, return this if (off == 0) return new String[]{this}; // Add remaining segment if (!limited || list.size() < limit) list.add(substring(off, length())); // Construct result int resultSize = list.size(); if (limit == 0) { while (resultSize > 0 && list.get(resultSize - 1).isEmpty()) { resultSize--; } } String[] result = new String[resultSize]; return list.subList(0, resultSize).toArray(result); } return Pattern.compile(regex).split(this, limit); }

This feels good, thank-you for researching!

clintval · 2024-01-03T03:41:11Z

@nh13 back to you!

nh13

Thank-you!

Allow the PairedAssigner to use UMI pairs where one is absent

d510954

clintval requested a review from tfenne December 30, 2023 19:53

clintval assigned tfenne Dec 30, 2023

Let the DuplexConsensusCaller support UMI pairs with one absent

9d5a422

nh13 requested changes Jan 3, 2024

View reviewed changes

feat: fallback to fast string splitting

f806f57

clintval requested a review from nh13 January 3, 2024 03:40

clintval assigned nh13 and unassigned tfenne Jan 3, 2024

nh13 approved these changes Jan 3, 2024

View reviewed changes

clintval merged commit 97a9e06 into main Jan 3, 2024
6 checks passed

clintval deleted the cv_absent_umi_in_paired branch January 3, 2024 17:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow the PairedAssigner to use UMI pairs where one is absent #954

Allow the PairedAssigner to use UMI pairs where one is absent #954

clintval commented Dec 30, 2023

codecov bot commented Dec 30, 2023 •

edited

Loading

nh13 commented Dec 30, 2023

clintval commented Dec 30, 2023 •

edited

Loading

clintval commented Dec 30, 2023 •

edited

Loading

nh13 commented Dec 30, 2023

nh13 commented Dec 30, 2023

clintval commented Dec 30, 2023 •

edited

Loading

nh13 left a comment

nh13 Jan 3, 2024

clintval Jan 3, 2024

clintval Jan 3, 2024 •

edited

Loading

nh13 Jan 3, 2024

clintval commented Jan 3, 2024

nh13 left a comment

Allow the PairedAssigner to use UMI pairs where one is absent #954

Allow the PairedAssigner to use UMI pairs where one is absent #954

Conversation

clintval commented Dec 30, 2023

codecov bot commented Dec 30, 2023 • edited Loading

Codecov Report

nh13 commented Dec 30, 2023

clintval commented Dec 30, 2023 • edited Loading

clintval commented Dec 30, 2023 • edited Loading

nh13 commented Dec 30, 2023

nh13 commented Dec 30, 2023

clintval commented Dec 30, 2023 • edited Loading

nh13 left a comment

Choose a reason for hiding this comment

nh13 Jan 3, 2024

Choose a reason for hiding this comment

clintval Jan 3, 2024

Choose a reason for hiding this comment

clintval Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

nh13 Jan 3, 2024

Choose a reason for hiding this comment

clintval commented Jan 3, 2024

nh13 left a comment

Choose a reason for hiding this comment

codecov bot commented Dec 30, 2023 •

edited

Loading

clintval commented Dec 30, 2023 •

edited

Loading

clintval commented Dec 30, 2023 •

edited

Loading

clintval commented Dec 30, 2023 •

edited

Loading

clintval Jan 3, 2024 •

edited

Loading