Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MarkDuplicates Missing Optical Duplicates #1441

Open
whitwham opened this issue Dec 11, 2019 · 7 comments
Open

MarkDuplicates Missing Optical Duplicates #1441

whitwham opened this issue Dec 11, 2019 · 7 comments

Comments

@whitwham
Copy link

Bug Report

Affected tool(s)

MarkDuplicates OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 TAGGING_POLICY=All

Affected version(s)

Latest test on 2.21.4-2-ga3021a7-SNAPSHOT. Older versions display the same problem.

Description

For HiSeq X runs the coordinate part of the query name can be greater than 32K. The coordinates are stored using PhysicalLocationShort which converts the read in integer to a short. As you would expect, integers over 32767 become negative. This leads duplicates that should taken as being physically close together instead being calculated as far apart.

From my example file the numbers should look like this in the distance calculation:
abs (33023 - 32760) = 263
instead become:
abs (-32513 - 32760) = 65273

Which is a bit bigger than the optical duplicate distance of 2500.
(This is from the calculations in closeEnough in OpticalDuplicateFinder)

Any duplicates with X/Y coordinates around the short size boundary stand a chance of not being identified as optical duplicates. In the production file I was testing on that amounted to about 22000 missed optical duplicates. The included example file is based off the real data.

Steps to reproduce

MarkDuplicates I=short_test.sam O=short_test_out.sam OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 M=short_test_stat.txt TAGGING_POLICY=All TAG_DUPLICATE_SET_MEMBERS=true

Input file (added .txt suffix to allow upload to GitHub):
short_test.sam.txt
Output file:
short_test_out.sam.txt

Expected behavior

The duplicates should be tagged DT:Z:SQ.

Actual behavior

The duplicates are tagged DT:Z:LB.

@yfarjoun
Copy link
Contributor

Thanks for making this an official issue.

If I understand correctly, the effect is localized to a few bands of coordinates where, as your example shows, one read is under the unsigned-short maximum value and the other is over it.

This has been discussed offline for a while, but it was deemed not big enough of an issue to warrant the increase in memory and thus disk space/file handles needed to mark duplicates.

The effect will be the DT tag and, minimally, the resulting library-size estimate. Otherwise, The duplicate marking will be correctly done. Do you have reason to believe that the value of fixing this bug is larger than I make it?

@whitwham
Copy link
Author

If I understand correctly, the effect is localized to a few bands of coordinates where, as your example shows, one read is under the unsigned-short maximum value and the other is over it.

Yes, that's right. In the data I was looking at there were two bands around the 32K and 56K coordinate locations where things went wrong.

This has been discussed offline for a while, but it was deemed not big enough of an issue to warrant the increase in memory and thus disk space/file handles needed to mark duplicates.

My Java is rusty but it looks like PhysicalLocationShort is storing the X/Y coordinates as an int with the set methods casting the numbers into shorts. So there would not be an increase in memory usage there. The tmp files read and write the coordinates as shorts, so converting them to ints would add 4 bytes per record. That seems a modest increase though I don't know how it would affect performance in practice.

The effect will be the DT tag and, minimally, the resulting library-size estimate. Otherwise, The duplicate marking will be correctly done.

Agreed.

Do you have reason to believe that the value of fixing this bug is larger than I make it?

Are you asking me if I think fixing it is worthwhile? If it was software I had written then I would say yes, I would fix it. Currently MarkDuplicates is silently mislabelling and miscounting optical duplicates and is then doing calculations based on those wrong figures. Admittedly the resulting differences are small in comparison with the rest of the data but the results are still wrong. I would want to maintain the integrity of the results of my own programs.

This is starting to sound like a philosophy of software development discussion rather than the simple bug report I had intended, so I will stop now. Thanks for the reply.

@yfarjoun
Copy link
Contributor

Thanks for your perspective.

An alternative solution (that I just thought of) is to divide the coordinates by some factor of the radius, (say OPTICAL_DISTANCE_RADIUS/10) prior to writing them to disk, and use 10 as the new distance. This will get around the problem since I think that this only manifests on flowcells where the coordinates have a wide range, and for those flow-cells one uses radius of about 2500....

What do you (and others) think?

cc: @tfenne @jamesemery @droazen

@tfenne
Copy link
Collaborator

tfenne commented Dec 12, 2019

@yfarjoun I think that sounds like a good idea. The other thing that could be done that might help (depends on how big the overflow is) would be to convert to unsigned shorts and back internally rather than signed shorts.

@yfarjoun
Copy link
Contributor

I agree. I think that what you said needs to be done in any case...

I looked at the code a bit more. The easiest is to divide by a factor in the decoder and multiply back in the encoder...but the problem is that not all read-ends pass through the encoder/decoder and so they will be treated differently...the rest of the conversions happen rather implicitly in the various casts...but I'll dig a bit deeper.

@yfarjoun
Copy link
Contributor

strawman PR here #1442

Feel free to shoot down.

I don't like the fact that the codec makes changes to the data. I think that the correct approach would be to have the PhysicalLocation classes themselves deal with the scaling, but I am unsure where would be the correct place to make that change and the set of classes that implement PhysicalLocation is crazy large (I blame myself for that...)

@mmokrejs
Copy link

Personally I find it disappointing that such horrible flaws are unfixed for ages. First proper int type should be used right away and then, eventually, code be improved by more trickery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants