If R1 and R2 map to same position, treat them as proper pair #319

marcelm · 2023-06-28T10:10:57Z

The first change is to factor out an is_proper_nam_pair() function. This changes behavior slightly as the code wasn’t identical at the two call sites. The condition for the distance was a-b < 2000 in one case and a-b < mu+10*sigma in the other. The factored-out function uses the latter because taking the distribution into account seems to be the intended behavior.

The second change is to treat a pair of NAM as being a "proper pair" even if the query start positions are the same.

Both commits change accuracy slightly. The overall difference is as follows:

dataset	`aed4738`	`cd26078`	difference
drosophila-50	90.1908	90.19065	-0.0002
drosophila-100	92.3875	92.38865	+0.0012
drosophila-200	93.5184	93.52115	+0.0028
drosophila-300	95.3617	95.3612	-0.0005
maize-50	71.47185	71.472	+0.0002
maize-100	87.13235	87.13245	+0.0001
maize-200	92.921	92.92045	-0.0005
maize-300	96.7109	96.7086	-0.0023
CHM13-50	90.6348	90.63435	-0.0004
CHM13-100	93.2195	93.21995	+0.0004
CHM13-200	94.43375	94.43415	+0.0004
CHM13-300	95.62465	95.62665	+0.0020

The case that R1 and R2 map to the same location doesn’t occur in our test datasets as far as I can tell, so it’s expected that the numbers don’t improve. The actual benefit would be on real data.

Closes #317

This changes behavior slightly as the conditions were slightly different at the two call sites (a-b < mu+10*sigma vs a-b < 2000), but it seems to make more sense to have consistent behavior. Also, accuracy is nearly unchanged (+/-0.0001)

Closes #317

ksahlin · 2023-06-28T10:13:52Z

Very nice! Approved.

ksahlin · 2023-06-28T10:18:10Z

btw, the change in cd26078 highlights that we use mu +5sigma and mu +10sigma for proper pair at two different places. Should we unify? If sigma is low or poorly estimated (underestimated), it would be very bad. So we would win on having a lenient cutoff I think.

marcelm · 2023-06-28T11:17:48Z

Sure, I can make that more consistent. There’s also the is_proper_pair_function, which uses mu + 6*sigma:

strobealign/src/sam.cpp

Lines 311 to 321 in aed4738

    
           bool is_proper_pair(const Alignment& sam_aln1, const Alignment& sam_aln2, float mu, float sigma) { 
        
               const int dist = sam_aln2.ref_start - sam_aln1.ref_start; 
        
               const bool same_reference = sam_aln1.ref_id == sam_aln2.ref_id; 
        
               const bool both_aligned = same_reference && !sam_aln1.is_unaligned && !sam_aln2.is_unaligned; 
        
               const bool r1_r2 = !sam_aln1.is_rc && sam_aln2.is_rc && dist >= 0; // r1 ---> <---- r2 
        
               const bool r2_r1 = !sam_aln2.is_rc && sam_aln1.is_rc && dist <= 0; // r2 ---> <---- r1 
        
               const bool rel_orientation_good = r1_r2 || r2_r1; 
        
               const bool insert_good = std::abs(dist) <= mu + 6 * sigma; 
        
               return both_aligned && insert_good && rel_orientation_good; 
        
           }

But that one is only for setting the "properly paired" flag in SAM output, not for making any mapping decisions, so I guess it’s ok if it is a bit more restrictive.

marcelm · 2023-06-28T12:31:43Z

I added 47cb5c2 to make things more consistent. For completeness, these are the accuracy changes from that commit alone:

dataset	`cd26078`	`47cb5c2`	difference
drosophila-50	90.19065	90.189	-0.0016
drosophila-100	92.38865	92.38785	-0.0008
drosophila-200	93.52115	93.52085	-0.0003
drosophila-300	95.3612	95.36085	-0.0003
maize-50	71.472	71.4703	-0.0017
maize-100	87.13245	87.13175	-0.0007
maize-200	92.92045	92.92045	+0.0000
maize-300	96.7086	96.70835	-0.0002
CHM13-50	90.63435	90.635	+0.0007
CHM13-100	93.21995	93.21985	-0.0001
CHM13-200	94.43415	94.43405	-0.0001
CHM13-300	95.62665	95.62665	+0.0000

ksahlin · 2023-06-28T13:22:50Z

But that one is only for setting the "properly paired" flag in SAM output, not for making any mapping decisions, so I guess it’s ok if it is a bit more restrictive.

Ah I see. yes agreed. Most important is that pairs within 10sigma is scored higher than e.g. reads on different chromosomes.

Related: do you think it would make sense / be worth to score read pairs on the same chromosome (but far away) higher than reads on separate chromosomes? Currently I think there is only the binary 'close and proper' or 'not close and proper' classification.

marcelm · 2023-06-30T09:02:05Z

I think we’re on the same page about this PR, so merging it now.

marcelm added 4 commits June 27, 2023 23:10

Factor out an is_proper_nam_pair() function

23b245f

This changes behavior slightly as the conditions were slightly different at the two call sites (a-b < mu+10*sigma vs a-b < 2000), but it seems to make more sense to have consistent behavior. Also, accuracy is nearly unchanged (+/-0.0001)

Update baseline commit

afe08e4

Consider R1 and R2 mapping to same location as proper pair

cd26078

Closes #317

Update baseline commit

35cbc09

marcelm mentioned this pull request Jun 28, 2023

prioritize properly mapped read pairs. #317

Closed

More consistent proper-pair threshold

47cb5c2

Update baseline commit

1c2ee05

marcelm mentioned this pull request Jun 29, 2023

Scoring of read pairs on same vs. different chromosome #321

Closed

marcelm merged commit b3bb190 into main Jun 30, 2023

marcelm deleted the proper-pair branch June 30, 2023 09:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If R1 and R2 map to same position, treat them as proper pair #319

If R1 and R2 map to same position, treat them as proper pair #319

marcelm commented Jun 28, 2023

ksahlin commented Jun 28, 2023

ksahlin commented Jun 28, 2023 •

edited

Loading

marcelm commented Jun 28, 2023

marcelm commented Jun 28, 2023

ksahlin commented Jun 28, 2023

marcelm commented Jun 30, 2023

If R1 and R2 map to same position, treat them as proper pair #319

If R1 and R2 map to same position, treat them as proper pair #319

Conversation

marcelm commented Jun 28, 2023

ksahlin commented Jun 28, 2023

ksahlin commented Jun 28, 2023 • edited Loading

marcelm commented Jun 28, 2023

marcelm commented Jun 28, 2023

ksahlin commented Jun 28, 2023

marcelm commented Jun 30, 2023

ksahlin commented Jun 28, 2023 •

edited

Loading