Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow the PairedAssigner to use UMI pairs where one is absent #954

Merged
merged 3 commits into from
Jan 3, 2024

Conversation

clintval
Copy link
Member

No description provided.

Copy link

codecov bot commented Dec 30, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (9c475fd) 95.62% compared to head (f806f57) 95.62%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #954   +/-   ##
=======================================
  Coverage   95.62%   95.62%           
=======================================
  Files         126      126           
  Lines        7354     7355    +1     
  Branches      502      512   +10     
=======================================
+ Hits         7032     7033    +1     
  Misses        322      322           
Flag Coverage Δ
unittests 95.62% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nh13
Copy link
Member

nh13 commented Dec 30, 2023

Instead of empty UMIs, why not just use Ns?

@clintval
Copy link
Member Author

clintval commented Dec 30, 2023

@nh13 because an IUPAC N stands for either A, C, G, T and for a paired library preparation in which one end of the source molecule does not have a UMI, then adding an n-length UMI of just Ns would be disingenuous. It will also affect metrics about the UMIs (see DuplexUmiMetric). This PR seems relatively harmless to consider and the output data will be more accurate to the true biochemical situation.

@clintval
Copy link
Member Author

clintval commented Dec 30, 2023

But the real show-stopper is the logic scattered about that removes or ignores reads/templates with Ns in the UMI:

|4. Templates are filtered if any UMI sequence contains one or more `N` bases

.filter(r => !r.get[String](rawTag).exists(_.contains('N')) || { filteredNsInUmi += 1; false })

Instead of regressing on how we choose to filter/handle UMIs, this PR seemed like a compelling alternative. (The other idea is to use a non-N IUPAC code placeholder UMI, but that would be potentially confusing).

@nh13
Copy link
Member

nh13 commented Dec 30, 2023

What do you meant by “none of those”? What other base would the missing UMI be if you could have observed it? Could you update the description to have a better explanation why the UMI is omitted, as clearly I’m missing something?

@nh13
Copy link
Member

nh13 commented Dec 30, 2023

Why not an opt-in option to allow Ns?

@clintval
Copy link
Member Author

clintval commented Dec 30, 2023

Why not an opt-in option to allow Ns?

Because I favored the approach where the data now most closely represents the actual biochemical situation (see first comment here #954 (comment)).

If you want an opt-in option to allow Ns, let me know and I can send up a new PR. If we went that route, then I would allow UMIs with all-Ns but then I would lose the feature to filter out UMIs with Ns for that sole UMI remaining (which I don't really want to do...). I personally favor the approach in this PR because the data is more representative of the actual biochemistry, and all metrics I could possibly generate on the UMIs are going to be more sensible for this type of biochemical situation.

Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit concerned that we are using a regular expression split versus an optimized single-character split. I don't think we need the former. Thoughts?

@@ -530,7 +531,7 @@ class CollectDuplexSeqMetrics
val uniqueTotal = metrics.map(_.unique_observations).sum.toDouble

metrics.foreach { m =>
val Array(umi1, umi2) = m.umi.split('-')
val Array(umi1, umi2) = UmiSeparatorPattern.split(m.umi, -1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to worry about performance being reduced, since we're moving from an optimized single character split to something that's not? See: https://stackoverflow.com/questions/11001330/java-split-string-performances

Should we implement our own split that adds a trailing empty string if the string ends with the single-char delimiter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about this and figured it might impart a slowdown. I'm frustrated Java provides a .split() with the option to preserve empty strings but only for regex patterns and not for single character delimiters. Frustrating! Looks like there are few fast alternatives I'll have to compare and contrast. We do have access to Google's Splitter via an htsjdk dependency:

https://guava.dev/releases/snapshot/api/docs/com/google/common/base/Splitter.html#on-java.lang.String-

Copy link
Member Author

@clintval clintval Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out most JDKs have a fastpath!

We have to use a 1-length character string ("-") and index = -1:

String.java
public String[] split(String regex, int limit) {
    /* fastpath if the regex is a
     * (1) one-char String and this character is not one of the
     *     RegEx's meta characters ".$|()[{^?*+\\", or
     * (2) two-char String and the first char is the backslash and
     *     the second is not the ascii digit or ascii letter.
     */
    char ch = 0;
    if (((regex.length() == 1 &&
         ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
         (regex.length() == 2 &&
          regex.charAt(0) == '\\' &&
          (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
          ((ch-'a')|('z'-ch)) < 0 &&
          ((ch-'A')|('Z'-ch)) < 0)) &&
        (ch < Character.MIN_HIGH_SURROGATE ||
         ch > Character.MAX_LOW_SURROGATE))
    {
        int off = 0;
        int next = 0;
        boolean limited = limit > 0;
        ArrayList<String> list = new ArrayList<>();
        while ((next = indexOf(ch, off)) != -1) {
            if (!limited || list.size() < limit - 1) {
                list.add(substring(off, next));
                off = next + 1;
            } else {    // last one
                //assert (list.size() == limit - 1);
                int last = length();
                list.add(substring(off, last));
                off = last;
                break;
            }
        }
        // If no match was found, return this
        if (off == 0)
            return new String[]{this};

        // Add remaining segment
        if (!limited || list.size() < limit)
            list.add(substring(off, length()));

        // Construct result
        int resultSize = list.size();
        if (limit == 0) {
            while (resultSize > 0 && list.get(resultSize - 1).isEmpty()) {
                resultSize--;
            }
        }
        String[] result = new String[resultSize];
        return list.subList(0, resultSize).toArray(result);
    }
    return Pattern.compile(regex).split(this, limit);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels good, thank-you for researching!

@clintval clintval requested a review from nh13 January 3, 2024 03:40
@clintval clintval assigned nh13 and unassigned tfenne Jan 3, 2024
@clintval
Copy link
Member Author

clintval commented Jan 3, 2024

@nh13 back to you!

Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank-you!

@clintval clintval merged commit 97a9e06 into main Jan 3, 2024
6 checks passed
@clintval clintval deleted the cv_absent_umi_in_paired branch January 3, 2024 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants