MarkDuplicates Missing Optical Duplicates #1441

whitwham · 2019-12-11T12:36:50Z

Bug Report

Affected tool(s)

MarkDuplicates OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 TAGGING_POLICY=All

Affected version(s)

Latest test on 2.21.4-2-ga3021a7-SNAPSHOT. Older versions display the same problem.

Description

For HiSeq X runs the coordinate part of the query name can be greater than 32K. The coordinates are stored using PhysicalLocationShort which converts the read in integer to a short. As you would expect, integers over 32767 become negative. This leads duplicates that should taken as being physically close together instead being calculated as far apart.

From my example file the numbers should look like this in the distance calculation:
abs (33023 - 32760) = 263
instead become:
abs (-32513 - 32760) = 65273

Which is a bit bigger than the optical duplicate distance of 2500.
(This is from the calculations in closeEnough in OpticalDuplicateFinder)

Any duplicates with X/Y coordinates around the short size boundary stand a chance of not being identified as optical duplicates. In the production file I was testing on that amounted to about 22000 missed optical duplicates. The included example file is based off the real data.

Steps to reproduce

MarkDuplicates I=short_test.sam O=short_test_out.sam OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 M=short_test_stat.txt TAGGING_POLICY=All TAG_DUPLICATE_SET_MEMBERS=true

Input file (added .txt suffix to allow upload to GitHub):
short_test.sam.txt
Output file:
short_test_out.sam.txt

Expected behavior

The duplicates should be tagged DT:Z:SQ.

Actual behavior

The duplicates are tagged DT:Z:LB.

yfarjoun · 2019-12-11T13:48:31Z

Thanks for making this an official issue.

If I understand correctly, the effect is localized to a few bands of coordinates where, as your example shows, one read is under the unsigned-short maximum value and the other is over it.

This has been discussed offline for a while, but it was deemed not big enough of an issue to warrant the increase in memory and thus disk space/file handles needed to mark duplicates.

The effect will be the DT tag and, minimally, the resulting library-size estimate. Otherwise, The duplicate marking will be correctly done. Do you have reason to believe that the value of fixing this bug is larger than I make it?

whitwham · 2019-12-12T10:06:07Z

If I understand correctly, the effect is localized to a few bands of coordinates where, as your example shows, one read is under the unsigned-short maximum value and the other is over it.

Yes, that's right. In the data I was looking at there were two bands around the 32K and 56K coordinate locations where things went wrong.

This has been discussed offline for a while, but it was deemed not big enough of an issue to warrant the increase in memory and thus disk space/file handles needed to mark duplicates.

My Java is rusty but it looks like PhysicalLocationShort is storing the X/Y coordinates as an int with the set methods casting the numbers into shorts. So there would not be an increase in memory usage there. The tmp files read and write the coordinates as shorts, so converting them to ints would add 4 bytes per record. That seems a modest increase though I don't know how it would affect performance in practice.

The effect will be the DT tag and, minimally, the resulting library-size estimate. Otherwise, The duplicate marking will be correctly done.

Agreed.

Do you have reason to believe that the value of fixing this bug is larger than I make it?

Are you asking me if I think fixing it is worthwhile? If it was software I had written then I would say yes, I would fix it. Currently MarkDuplicates is silently mislabelling and miscounting optical duplicates and is then doing calculations based on those wrong figures. Admittedly the resulting differences are small in comparison with the rest of the data but the results are still wrong. I would want to maintain the integrity of the results of my own programs.

This is starting to sound like a philosophy of software development discussion rather than the simple bug report I had intended, so I will stop now. Thanks for the reply.

yfarjoun · 2019-12-12T11:35:40Z

Thanks for your perspective.

An alternative solution (that I just thought of) is to divide the coordinates by some factor of the radius, (say OPTICAL_DISTANCE_RADIUS/10) prior to writing them to disk, and use 10 as the new distance. This will get around the problem since I think that this only manifests on flowcells where the coordinates have a wide range, and for those flow-cells one uses radius of about 2500....

What do you (and others) think?

cc: @tfenne @jamesemery @droazen

tfenne · 2019-12-12T13:18:02Z

@yfarjoun I think that sounds like a good idea. The other thing that could be done that might help (depends on how big the overflow is) would be to convert to unsigned shorts and back internally rather than signed shorts.

yfarjoun · 2019-12-12T14:01:10Z

I agree. I think that what you said needs to be done in any case...

I looked at the code a bit more. The easiest is to divide by a factor in the decoder and multiply back in the encoder...but the problem is that not all read-ends pass through the encoder/decoder and so they will be treated differently...the rest of the conversions happen rather implicitly in the various casts...but I'll dig a bit deeper.

yfarjoun · 2019-12-13T13:51:05Z

strawman PR here #1442

Feel free to shoot down.

I don't like the fact that the codec makes changes to the data. I think that the correct approach would be to have the PhysicalLocation classes themselves deal with the scaling, but I am unsure where would be the correct place to make that change and the set of classes that implement PhysicalLocation is crazy large (I blame myself for that...)

mmokrejs · 2023-12-22T12:26:55Z

Personally I find it disappointing that such horrible flaws are unfixed for ages. First proper int type should be used right away and then, eventually, code be improved by more trickery.

yfarjoun mentioned this issue Dec 13, 2019

Partial fix of coordinate casting to short in MarkDuplicates #1442

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MarkDuplicates Missing Optical Duplicates #1441

MarkDuplicates Missing Optical Duplicates #1441

whitwham commented Dec 11, 2019

yfarjoun commented Dec 11, 2019

whitwham commented Dec 12, 2019

yfarjoun commented Dec 12, 2019

tfenne commented Dec 12, 2019

yfarjoun commented Dec 12, 2019

yfarjoun commented Dec 13, 2019

mmokrejs commented Dec 22, 2023

MarkDuplicates Missing Optical Duplicates #1441

MarkDuplicates Missing Optical Duplicates #1441

Comments

whitwham commented Dec 11, 2019

Bug Report

Affected tool(s)

Affected version(s)

Description

Steps to reproduce

Expected behavior

Actual behavior

yfarjoun commented Dec 11, 2019

whitwham commented Dec 12, 2019

yfarjoun commented Dec 12, 2019

tfenne commented Dec 12, 2019

yfarjoun commented Dec 12, 2019

yfarjoun commented Dec 13, 2019

mmokrejs commented Dec 22, 2023