-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PE reads - bwa mem alignment and softclipping issues #88
Comments
@irosenboom are there any public PE data which this has been observed on ? Or any data which you can upload after removing the human reads ? |
Hi @colindaven , I uploaded three anonymized PE samples after removal of human reads to my academic cloud. Please feel free to use them for testing: https://sync.academiccloud.de/index.php/s/n6eDexgWjq3M75R |
Where exactly did you find information on the clipping penalty (clipping penalty # was 5, try 20)? The only options set for bwa in |
Adding a |
Thanks for the data Ilona. Will have a look when I get a chance. Half is good news - but 800 species is still bad news ! From what I remember, Kraken, metaphlan and centrifuge also found high numbers of alignments with these data - correct ? So it is unlikely we will ever get to perfection. Because I believe you never saw this with single read data, can you try to just use the R1 reads in single ended alignment mode for your test sample ? Is was thinking about this, and it cannot be the case that bwa mem just aligns 18bp segments of 100bp single reads, and soft masks the rest, and treats all this as a unique alignment. It must be that one of the paired reads eg R1 is aligned (with high score, eg low number of mismatches?), and the other eg R2 is then fit with extensive soft clipping. Yes, I think 5 is the default clipping penalty according to the manual above. You could try "-L 40" or "-L 60" as well to see what happens. |
Hi, I have just repeated the analysis with the R1 reads in single ended alignment mode and wow, this also solved the problem! |
Nice, still need to solve this though as PE reads are the most common entry point I'd guess. Can you please repeat your single ended alignments with the R2 - if this works. Maybe rename them to R1 in a separate directory to trick the tool (it always expects R1 reads is SE alignment mode is selected). I wonder if there is a quality problem with the R2 reads, or if it really a problem with the alignment criteria for PE reads only. It could be a good hit from a poor R2 read, and a very very poor 18bp hit (with the rest soft masked) from a good R1 read. Thanks! |
The files are too big, I can't transfer them reliably. Can you please just supply one sample with R1 and R2 with max size about 50 MB each ? To do this please gunzip the files,
Then check file size with Thanks and apologies. |
Smaller files are now available in the cloud :) Wochenende is also running with the R2 files in SE mode, I'll let you know when it's done. |
Thanks, I had some weird issues today caused by too many nested conda envs. My bad. Anyway, thanks for the new data. These are judged by haybaler output. Keep in mind I am using the user settings in the nextflow.config, not the permissive dev config.
Results
minimap2 params:
cheers |
|
I downloaded alternative PE metagenome data from EBI and can confirm the soft-clipping problem again. Data : first 200k reads , 800k lines from
CIGAR string show 126 soft clipped bases, 21 matches and 3 soft clipped bases. See https://www.drive5.com/usearch/manual/cigar.html I'll try to optimize using this dataset since the tiny ones you gave me lead to 0 alignments with bwa-mem. I even get no alignments in the bam - they might have been eliminated by the non-supplementary non-secondary code I just recently introduced. I'll try to check this. |
I set the bwa mem -L parameter to eg this read pair - alignment built on 41 matches and 19 matches. This has a MQ of 40 so will likely be output ...
Potentially we need to set the scoring differently for PE reads
|
@irosenboom please avoid using the |
After not seeing much improvement despite setting -T 60 (default is 30) for PE reads, I would like to disallow bwa mem for PE reads due to the strong variation in results. Is this ok with you Ilona ? Details here: https://github.com/MHH-RCUG/nf_wochenende/wiki/Troubleshooting#bwa-mem-with-paired-end-pe-reads Also - this is a major issue - minimap2short attributes reads to common airway species from the datasets you gave me in PE mode, but bwa mem does not.
here bwa mem - different datasets, no mock comm, but illustrates my point - tmp_sample should have some reads attributed.
|
Bug report by @irosenboom
It was reported for 2x150bp Novaseq reads that bwa mem performed alignments with extensive softclipping.
For example, some alignments were of 18bp with 82 bp simply softclipped off.
This led to serious problems with the species detected. We never noticed this before with single ended 75bp reads (confirmed here by test with the same datasets).
For these datasets, minimap2short led to much better alignments (provided the supplementary and secondary alignments were removed).
Solutions
bwa mem
@irosenboom can you please try this and see if your results improve ? I don't have comparable datasetsBWA mem options to test:
-M
-P
-L
clipping penalty # was 5, try 20 ? This is the most likely option to solve the issueFrom https://www.mankier.com/1/bwa
The text was updated successfully, but these errors were encountered: