Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If R1 and R2 map to same position, treat them as proper pair #319

Merged
merged 6 commits into from
Jun 30, 2023

Conversation

marcelm
Copy link
Collaborator

@marcelm marcelm commented Jun 28, 2023

The first change is to factor out an is_proper_nam_pair() function. This changes behavior slightly as the code wasn’t identical at the two call sites. The condition for the distance was a-b < 2000 in one case and a-b < mu+10*sigma in the other. The factored-out function uses the latter because taking the distribution into account seems to be the intended behavior.

The second change is to treat a pair of NAM as being a "proper pair" even if the query start positions are the same.

Both commits change accuracy slightly. The overall difference is as follows:

dataset aed4738 cd26078 difference
drosophila-50 90.1908 90.19065 -0.0002
drosophila-100 92.3875 92.38865 +0.0012
drosophila-200 93.5184 93.52115 +0.0028
drosophila-300 95.3617 95.3612 -0.0005
maize-50 71.47185 71.472 +0.0002
maize-100 87.13235 87.13245 +0.0001
maize-200 92.921 92.92045 -0.0005
maize-300 96.7109 96.7086 -0.0023
CHM13-50 90.6348 90.63435 -0.0004
CHM13-100 93.2195 93.21995 +0.0004
CHM13-200 94.43375 94.43415 +0.0004
CHM13-300 95.62465 95.62665 +0.0020

The case that R1 and R2 map to the same location doesn’t occur in our test datasets as far as I can tell, so it’s expected that the numbers don’t improve. The actual benefit would be on real data.

Closes #317

marcelm added 4 commits June 27, 2023 23:10
This changes behavior slightly as the conditions were slightly different at
the two call sites (a-b < mu+10*sigma vs a-b < 2000), but it seems to make
more sense to have consistent behavior. Also, accuracy is nearly unchanged
(+/-0.0001)
@ksahlin
Copy link
Owner

ksahlin commented Jun 28, 2023

Very nice! Approved.

@ksahlin
Copy link
Owner

ksahlin commented Jun 28, 2023

btw, the change in cd26078 highlights that we use mu +5sigma and mu +10sigma for proper pair at two different places. Should we unify? If sigma is low or poorly estimated (underestimated), it would be very bad. So we would win on having a lenient cutoff I think.

@marcelm
Copy link
Collaborator Author

marcelm commented Jun 28, 2023

Sure, I can make that more consistent. There’s also the is_proper_pair_function, which uses mu + 6*sigma:

strobealign/src/sam.cpp

Lines 311 to 321 in aed4738

bool is_proper_pair(const Alignment& sam_aln1, const Alignment& sam_aln2, float mu, float sigma) {
const int dist = sam_aln2.ref_start - sam_aln1.ref_start;
const bool same_reference = sam_aln1.ref_id == sam_aln2.ref_id;
const bool both_aligned = same_reference && !sam_aln1.is_unaligned && !sam_aln2.is_unaligned;
const bool r1_r2 = !sam_aln1.is_rc && sam_aln2.is_rc && dist >= 0; // r1 ---> <---- r2
const bool r2_r1 = !sam_aln2.is_rc && sam_aln1.is_rc && dist <= 0; // r2 ---> <---- r1
const bool rel_orientation_good = r1_r2 || r2_r1;
const bool insert_good = std::abs(dist) <= mu + 6 * sigma;
return both_aligned && insert_good && rel_orientation_good;
}

But that one is only for setting the "properly paired" flag in SAM output, not for making any mapping decisions, so I guess it’s ok if it is a bit more restrictive.

@marcelm
Copy link
Collaborator Author

marcelm commented Jun 28, 2023

I added 47cb5c2 to make things more consistent. For completeness, these are the accuracy changes from that commit alone:

dataset cd26078 47cb5c2 difference
drosophila-50 90.19065 90.189 -0.0016
drosophila-100 92.38865 92.38785 -0.0008
drosophila-200 93.52115 93.52085 -0.0003
drosophila-300 95.3612 95.36085 -0.0003
maize-50 71.472 71.4703 -0.0017
maize-100 87.13245 87.13175 -0.0007
maize-200 92.92045 92.92045 +0.0000
maize-300 96.7086 96.70835 -0.0002
CHM13-50 90.63435 90.635 +0.0007
CHM13-100 93.21995 93.21985 -0.0001
CHM13-200 94.43415 94.43405 -0.0001
CHM13-300 95.62665 95.62665 +0.0000

@ksahlin
Copy link
Owner

ksahlin commented Jun 28, 2023

But that one is only for setting the "properly paired" flag in SAM output, not for making any mapping decisions, so I guess it’s ok if it is a bit more restrictive.

Ah I see. yes agreed. Most important is that pairs within 10sigma is scored higher than e.g. reads on different chromosomes.

Related: do you think it would make sense / be worth to score read pairs on the same chromosome (but far away) higher than reads on separate chromosomes? Currently I think there is only the binary 'close and proper' or 'not close and proper' classification.

@marcelm
Copy link
Collaborator Author

marcelm commented Jun 30, 2023

I think we’re on the same page about this PR, so merging it now.

@marcelm marcelm merged commit b3bb190 into main Jun 30, 2023
@marcelm marcelm deleted the proper-pair branch June 30, 2023 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

prioritize properly mapped read pairs.
2 participants