-
-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow the PairedAssigner to use UMI pairs where one is absent #954
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #954 +/- ##
=======================================
Coverage 95.62% 95.62%
=======================================
Files 126 126
Lines 7354 7355 +1
Branches 502 512 +10
=======================================
+ Hits 7032 7033 +1
Misses 322 322
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Instead of empty UMIs, why not just use Ns? |
@nh13 because an IUPAC N stands for either A, C, G, T and for a paired library preparation in which one end of the source molecule does not have a UMI, then adding an n-length UMI of just Ns would be disingenuous. It will also affect metrics about the UMIs (see |
But the real show-stopper is the logic scattered about that removes or ignores reads/templates with Ns in the UMI:
Instead of regressing on how we choose to filter/handle UMIs, this PR seemed like a compelling alternative. (The other idea is to use a non-N IUPAC code placeholder UMI, but that would be potentially confusing). |
What do you meant by “none of those”? What other base would the missing UMI be if you could have observed it? Could you update the description to have a better explanation why the UMI is omitted, as clearly I’m missing something? |
Why not an opt-in option to allow Ns? |
Because I favored the approach where the data now most closely represents the actual biochemical situation (see first comment here #954 (comment)). If you want an opt-in option to allow Ns, let me know and I can send up a new PR. If we went that route, then I would allow UMIs with all-Ns but then I would lose the feature to filter out UMIs with Ns for that sole UMI remaining (which I don't really want to do...). I personally favor the approach in this PR because the data is more representative of the actual biochemistry, and all metrics I could possibly generate on the UMIs are going to be more sensible for this type of biochemical situation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit concerned that we are using a regular expression split versus an optimized single-character split. I don't think we need the former. Thoughts?
@@ -530,7 +531,7 @@ class CollectDuplexSeqMetrics | |||
val uniqueTotal = metrics.map(_.unique_observations).sum.toDouble | |||
|
|||
metrics.foreach { m => | |||
val Array(umi1, umi2) = m.umi.split('-') | |||
val Array(umi1, umi2) = UmiSeparatorPattern.split(m.umi, -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to worry about performance being reduced, since we're moving from an optimized single character split to something that's not? See: https://stackoverflow.com/questions/11001330/java-split-string-performances
Should we implement our own split
that adds a trailing empty string if the string ends with the single-char delimiter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered about this and figured it might impart a slowdown. I'm frustrated Java provides a .split()
with the option to preserve empty strings but only for regex patterns and not for single character delimiters. Frustrating! Looks like there are few fast alternatives I'll have to compare and contrast. We do have access to Google's Splitter
via an htsjdk dependency:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out most JDKs have a fastpath!
We have to use a 1-length character string ("-"
) and index = -1
:
String.java
public String[] split(String regex, int limit) {
/* fastpath if the regex is a
* (1) one-char String and this character is not one of the
* RegEx's meta characters ".$|()[{^?*+\\", or
* (2) two-char String and the first char is the backslash and
* the second is not the ascii digit or ascii letter.
*/
char ch = 0;
if (((regex.length() == 1 &&
".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
(regex.length() == 2 &&
regex.charAt(0) == '\\' &&
(((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
((ch-'a')|('z'-ch)) < 0 &&
((ch-'A')|('Z'-ch)) < 0)) &&
(ch < Character.MIN_HIGH_SURROGATE ||
ch > Character.MAX_LOW_SURROGATE))
{
int off = 0;
int next = 0;
boolean limited = limit > 0;
ArrayList<String> list = new ArrayList<>();
while ((next = indexOf(ch, off)) != -1) {
if (!limited || list.size() < limit - 1) {
list.add(substring(off, next));
off = next + 1;
} else { // last one
//assert (list.size() == limit - 1);
int last = length();
list.add(substring(off, last));
off = last;
break;
}
}
// If no match was found, return this
if (off == 0)
return new String[]{this};
// Add remaining segment
if (!limited || list.size() < limit)
list.add(substring(off, length()));
// Construct result
int resultSize = list.size();
if (limit == 0) {
while (resultSize > 0 && list.get(resultSize - 1).isEmpty()) {
resultSize--;
}
}
String[] result = new String[resultSize];
return list.subList(0, resultSize).toArray(result);
}
return Pattern.compile(regex).split(this, limit);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels good, thank-you for researching!
@nh13 back to you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank-you!
No description provided.