Partial fix of coordinate casting to short in MarkDuplicates #1442

yfarjoun · 2019-12-13T13:11:34Z

Fixed short overflow issue in MarkDuplicates by internally dividing the values so that they are not so large, and a smarter distance metric that is aware of the possibility of short overflow
Included example reads from MarkDuplicates Missing Optical Duplicates #1441 as test

Description

Give your PR a concise yet descriptive title
Please explain the changes you made here.
Explain the motivation for making this change. What existing problem does the pull request solve?
Mention any issues fixed, addressed or otherwise related to this pull request, including issue numbers or hard links for issues in other repos.
You can delete these instructions once you have written your PR description.

Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

Added or modified tests to cover changes and any new functionality
Edited the README / documentation (if applicable)
All tests passing on Travis

Review

Final thumbs-up from reviewer
Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

- fixed short overflow issue in MarkDuplicates by internally dividing

yfarjoun · 2019-12-13T13:12:59Z

@jamesemery this will require a parallel change to GATK's MarkDuplicatesSpark to keep them in sync...right?

jamesemery · 2019-12-13T14:34:56Z

@yfarjoun Yes we will. This shouldn't be too hard though as the optical duplicate finder is already called into by Spark so it would simply be a matter of making sure the scale factor is applied and porting the test. I'll make a ticket to track it and when we bump our picard version we can push this change.

yfarjoun · 2020-02-10T20:03:07Z

@jamesemery could you review this PR?

jamesemery

It looks good. It doesn't fix the overall overflow issue, just mitigate it by a factor of DEFAULT_OPTICAL_DUPLICATE_DISTANCE but the idea is sound. I do think we should be a little more careful about how and why we want to pick a particular scaling value though.

jamesemery · 2020-02-10T22:05:22Z

src/main/java/picard/sam/markduplicates/MarkDuplicates.java

@@ -483,14 +483,15 @@ private void buildSortedReadEndLists(final boolean useBarcodes) {
        log.info("Will retain up to " + maxInMemory + " data points before spilling to disk.");

        final ReadEndsForMarkDuplicatesCodec fragCodec, pairCodec, diskCodec;
+        final double scale = OPTICAL_DUPLICATE_PIXEL_DISTANCE/(double) OpticalDuplicateFinder.DEFAULT_OPTICAL_DUPLICATE_DISTANCE;


In the typical use case this might actually make a difference since we are compressing the pixel distance by 2500/100 or 25 and then int scaling, which will end up introducing approximately a 1% truncation in every case. That is probably fine given how unlikely it is to substantially change the optical duplicate finding. I do not however think this should not be defined by OpticalDuplicateFinder.DEFAULT_OPTICAL_DUPLICATE_DISTANCE and we should fix some other variable instead since the error range shouldn't really have anything to do with the default pixel distance.

It would also seem prudent to ensure that we don't ever set a value below 1 here (i.e. we don't ever want to "inflate" the pixel coordinates if the user selects a pixel distance < 100).

I was trying to think of a scaling factor that keeps the default behaviour the same....I guess I could add an additional argument, but I thought this admitted hack is good enough....if there are other ideas, I'm happy to consider.

That seems reasonable. I do think it seems bizzare that if we were to change the default to say 2500 (which I think is the standard for most of our pipelines) that we would be compressing by a factor of 25 less than we were before which could cause confusion of unintended consequences. I think we should make a separate variable that pegs the to 100 to make it a little clearer why we are doing it this way. I still think we should avoid ever scaling by fractional value however.

final double scale = Math.max(OPTICAL_DUPLICATE_PIXEL_DISTANCE/(double) OpticalDuplicateFinder.OPTICAL_DUPLICATE_PIXEL_SCALING_FACTOR, 1.0);

yfarjoun · 2020-04-06T15:39:34Z

@jamesemery another look?

jamesemery

I think the way you reimplemented the scaling should be more generalized and addresses my comment about default values. I have only one comment about the math. I also wanted to ask if we ever use raw ints in this method anymore (it is somewhat difficult to tell given the number of PhysicalLocation objects and tools that use them) if we ever should need to perform the operation on two ints. If that is the case then we are loosing some resolution here by forcing everything into short space. I leave that decision up to you though.

jamesemery · 2020-06-16T20:29:35Z

src/main/java/picard/sam/markduplicates/util/OpticalDuplicateFinder.java

    }

    private boolean closeEnoughShort(final PhysicalLocation lhs, final PhysicalLocation rhs, final int distance) {
        return lhs != rhs &&
-                Math.abs(lhs.getX() - rhs.getX()) <= distance &&
-                Math.abs(lhs.getY() - rhs.getY()) <= distance;
+                // Since the X and Y coordinates are constrained to short values, they can also become negative when


Hmm... this method should have been named something less overloaded like "closeEnoughQuick" to make it clearer that this checks fewer things and not short casted things given that the underlying values could be ints or shorts... Now that you have made this change you perhaps is a correct description as now this method will always check adjacency in short space even if the underlying physical location still has integers.

I had to do some arithmetic to convince myself that this works. Assuming your goal is to make it so that both 32768 and 32767 adjacent without disturbing the 65536 and 65535 adjacency I think this succeeded. However I do not think it is necessary to perform the operation forwards and backwards, the absolute value operation (even accounting for overflowing from the shorts) will not be changed by flipping the order by virtue of the fact that this calculation is being performed in integer space whereas the underlying data overflowed in short space. Unless your scaling factor is in the region of Short.MAX_VALUE/2 you should never have an overflow here thus you should be able to get away without flipping the comparison.

- added example from GH1141

9095766

- fixed short overflow issue in MarkDuplicates by internally dividing

yfarjoun requested a review from jamesemery December 13, 2019 13:11

yfarjoun mentioned this pull request Dec 13, 2019

MarkDuplicates Missing Optical Duplicates #1441

Open

jamesemery mentioned this pull request Dec 13, 2019

Port new fixes to Picard optical duplicate marking into MarkDuplicatesSpark. broadinstitute/gatk#6316

Open

jamesemery reviewed Feb 10, 2020

View reviewed changes

- responding to review comments

4de3b2b

Merge branch 'master' into yf_fix_short_overflow_in_MD

46f6dc7

jamesemery reviewed Jun 16, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial fix of coordinate casting to short in MarkDuplicates #1442

Partial fix of coordinate casting to short in MarkDuplicates #1442

yfarjoun commented Dec 13, 2019

yfarjoun commented Dec 13, 2019

jamesemery commented Dec 13, 2019

yfarjoun commented Feb 10, 2020

jamesemery left a comment

jamesemery Feb 10, 2020

yfarjoun Feb 11, 2020

jamesemery Feb 11, 2020

yfarjoun commented Apr 6, 2020

jamesemery left a comment

jamesemery Jun 16, 2020

jamesemery Jun 16, 2020

Partial fix of coordinate casting to short in MarkDuplicates #1442

Are you sure you want to change the base?

Partial fix of coordinate casting to short in MarkDuplicates #1442

Conversation

yfarjoun commented Dec 13, 2019

Description

Checklist (never delete this)

Content

Review

yfarjoun commented Dec 13, 2019

jamesemery commented Dec 13, 2019

yfarjoun commented Feb 10, 2020

jamesemery left a comment

Choose a reason for hiding this comment

jamesemery Feb 10, 2020

Choose a reason for hiding this comment

yfarjoun Feb 11, 2020

Choose a reason for hiding this comment

jamesemery Feb 11, 2020

Choose a reason for hiding this comment

yfarjoun commented Apr 6, 2020

jamesemery left a comment

Choose a reason for hiding this comment

jamesemery Jun 16, 2020

Choose a reason for hiding this comment

jamesemery Jun 16, 2020

Choose a reason for hiding this comment