-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prioritize properly mapped read pairs. #317
Comments
Excellent with these kind of examples, thank you! Could we also get the reference sequences you are mapping to? it would be easier to fix if we can reproduce it. Also, do you have an idea of what the expected alignment should be? For example: |
Btw, I should say that strobealign prioritises properly mapped reads, so this is likely a bug or some weird corner case. |
Hi @ksahlin, thank you for your reply. The testing file is attached. |
R1 and R2 are reverse and complement, so they should be mapped to exact the same reference position, rather than two different positions. |
I see. Thanks for providing us with data. We will have a look at what causes this. Currently strobealign checks if two reads within a pair is proper or not. This is based on deviation from insert size, relative orientation etc. A guess is that the paired mapping to the same chromosome was too far out the insert size threshold and was deemed 'not proper pair'. All non-proper pairs are scored the same, which could be why this happens. We could have a more nuanced approach that also takes the closest ones, occurring on the same chromosome, etc if all are non-proper. |
Thank you for the information. But in this example, the distance of read pair is very close and not 'far away'. I am curious why this one was deemed 'not proper pair'. |
Yes, I hope @marcelm can take a look in the coming week. I should have clarified it better: I meant too far away from the mean of the fragment size distribution, i.e either |
Interesting! I debugged this and found the following. Both reads produce 316 seeds (NAMs) and both of the lists contain both of the mapping locations that we see in the SAM output (the one on This is the relevant code: Lines 612 to 618 in 93be0c2
All except one condition are fulfilled:
The result is that neither Also, we should think about the case that reads overlap in this way:
That is, the TLEN is shorter than the read length. Then the read ends should contain adapter sequence and should not contain seeds, but possibly it could happen by chance that a seed extends a little bit into the adapter. Note that the above does not check whether TLEN is |
Great, so <= and >= sounds good for fixing this.
Is this something that happens with PE reads? I did not implement a check for |
A molecule prepared for paired-end sequencing on an Illumina sequencer will look like this:
The arrows show where R1 and R2 are read off: When reading R1, the sequencer starts just after the 5'-sequencing-adapter. When reading R2, it starts on the opposite strand right before the 3'-sequencing-adapter. Usually, the insert is so long that R1 is a prefix of the insert and R2 is a prefix of the reverse-complemented insert. But if the insert is short, R1 contains the entire insert plus the beginning of the 3' adapter, and R2 contains the entire reverse-complemented insert plus the beginning of the reverse-complemented 5' adapter:
When adapters have been removed using a program such as Cutadapt, R1 should be identical to the insert and R2 identical to the reverse-complemented insert and we get the case this bug report is about. But it would also be nice to handle the case with adapters still attached. We would just need to fix this when testing whether the pairing of reads makes sense. At the alignment stage, the adapters would then be soft clipped. On another note: Maybe it would be good to change strobealign such that it actually makes computations based on the insert size (TLEN). Currently, it uses the difference between the start positions of R1 and R2 on the reference, but the insert size would be the distance between the 5' ends of the two reads. This could be an issue when processing reads of varying lengths because the estimated insert size distribution will be a bit off. |
This bug remind me the similar issue in bwa (lh3/bwa#259). |
cutadapt can get rid of most of the 'readthrough', but still some exceptions.
|
Yes, sure. Sorry if it sounded like it; I didn’t want to suggest that using cutadapt would be the solution here. This needs to be fixed in strobealign. |
Okay @marcelm , you know more about me on this one so I'll let you decide. I thought |
I understand and remember being confused about this as well. Quoting from http://thegenomefactory.blogspot.com/2013/08/paired-end-read-confusion-library.html:
"Fragment size" appears to be a bit ambiguous because for some people, the fragment includes the adapters at either end, but I consider the terms insert size, fragment size and TLEN to mean the same thing. |
With the fix suggested in #319, SAM output for the test read pair is as follows:
|
Hi @marcelm, I quote the specifications of TLEN in the sam format document, which might be helpful.
![]() In this criteria, the TLEN equal to the distance between R1 original start (R1 mapping start) and R2 original start (R2 mappinig end). But I found some tools (such as hisat2) calculate the TLEN as the distance of R1 mapping start and R2 mapping start. |
Thanks for the pointer. So if we come back to the terms, only the older TLEN definition ("TLEN#1") would be equivalent to "insert size". (And of course TLEN can be negative as well.) I fixed the TLEN computation in strobealign a while ago. I did not take into account the ambiguous scenario above at the time, but I checked the code again just now and it appears that we actually compute TLEN#2. That said, the scenario cannot arise in strobealign at the moment because reads that overlap in such a way are not treated as a proper pair. This issue here is about the special case that R1 is identical to the reverse complement of R2, which will be fixed by #319, so to keep track of the more general issue of allowing the above "ambiguous scenario", I have opened issue #320. I suggest we continue discussion over there if necessary. |
Hi @marcelm, I would like to thank you for this important update. Using the lastest branch. The ratio of properly mapped ratio increases significantly. (79.71% to 97.76%) Before:
After (I do not rerun all the data):
|
I tested a read pair that both R1 and R2 are exactly reverse complement. This sequence has multiple copies in the reference genome. When mapped the read, strobealign return a paired of location that not properly mapped (flag 81 and 161). They are located in different chromosomes. But it would be more reasonable if a properly aligned location can be returned.
The text was updated successfully, but these errors were encountered: