-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MarkDuplicates Missing Optical Duplicates #1441
Comments
Thanks for making this an official issue. If I understand correctly, the effect is localized to a few bands of coordinates where, as your example shows, one read is under the unsigned-short maximum value and the other is over it. This has been discussed offline for a while, but it was deemed not big enough of an issue to warrant the increase in memory and thus disk space/file handles needed to mark duplicates. The effect will be the DT tag and, minimally, the resulting library-size estimate. Otherwise, The duplicate marking will be correctly done. Do you have reason to believe that the value of fixing this bug is larger than I make it? |
Yes, that's right. In the data I was looking at there were two bands around the 32K and 56K coordinate locations where things went wrong.
My Java is rusty but it looks like PhysicalLocationShort is storing the X/Y coordinates as an int with the set methods casting the numbers into shorts. So there would not be an increase in memory usage there. The tmp files read and write the coordinates as shorts, so converting them to ints would add 4 bytes per record. That seems a modest increase though I don't know how it would affect performance in practice.
Agreed.
Are you asking me if I think fixing it is worthwhile? If it was software I had written then I would say yes, I would fix it. Currently MarkDuplicates is silently mislabelling and miscounting optical duplicates and is then doing calculations based on those wrong figures. Admittedly the resulting differences are small in comparison with the rest of the data but the results are still wrong. I would want to maintain the integrity of the results of my own programs. This is starting to sound like a philosophy of software development discussion rather than the simple bug report I had intended, so I will stop now. Thanks for the reply. |
Thanks for your perspective. An alternative solution (that I just thought of) is to divide the coordinates by some factor of the radius, (say OPTICAL_DISTANCE_RADIUS/10) prior to writing them to disk, and use 10 as the new distance. This will get around the problem since I think that this only manifests on flowcells where the coordinates have a wide range, and for those flow-cells one uses radius of about 2500.... What do you (and others) think? |
@yfarjoun I think that sounds like a good idea. The other thing that could be done that might help (depends on how big the overflow is) would be to convert to unsigned shorts and back internally rather than signed shorts. |
I agree. I think that what you said needs to be done in any case... I looked at the code a bit more. The easiest is to divide by a factor in the decoder and multiply back in the encoder...but the problem is that not all read-ends pass through the encoder/decoder and so they will be treated differently...the rest of the conversions happen rather implicitly in the various casts...but I'll dig a bit deeper. |
strawman PR here #1442 Feel free to shoot down. I don't like the fact that the codec makes changes to the data. I think that the correct approach would be to have the PhysicalLocation classes themselves deal with the scaling, but I am unsure where would be the correct place to make that change and the set of classes that implement PhysicalLocation is crazy large (I blame myself for that...) |
Personally I find it disappointing that such horrible flaws are unfixed for ages. First proper int type should be used right away and then, eventually, code be improved by more trickery. |
Bug Report
Affected tool(s)
MarkDuplicates OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 TAGGING_POLICY=All
Affected version(s)
Latest test on 2.21.4-2-ga3021a7-SNAPSHOT. Older versions display the same problem.
Description
For HiSeq X runs the coordinate part of the query name can be greater than 32K. The coordinates are stored using PhysicalLocationShort which converts the read in integer to a short. As you would expect, integers over 32767 become negative. This leads duplicates that should taken as being physically close together instead being calculated as far apart.
From my example file the numbers should look like this in the distance calculation:
abs (33023 - 32760) = 263
instead become:
abs (-32513 - 32760) = 65273
Which is a bit bigger than the optical duplicate distance of 2500.
(This is from the calculations in closeEnough in OpticalDuplicateFinder)
Any duplicates with X/Y coordinates around the short size boundary stand a chance of not being identified as optical duplicates. In the production file I was testing on that amounted to about 22000 missed optical duplicates. The included example file is based off the real data.
Steps to reproduce
MarkDuplicates I=short_test.sam O=short_test_out.sam OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 M=short_test_stat.txt TAGGING_POLICY=All TAG_DUPLICATE_SET_MEMBERS=true
Input file (added .txt suffix to allow upload to GitHub):
short_test.sam.txt
Output file:
short_test_out.sam.txt
Expected behavior
The duplicates should be tagged DT:Z:SQ.
Actual behavior
The duplicates are tagged DT:Z:LB.
The text was updated successfully, but these errors were encountered: