Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --removeDuplicatesProb and --removeDuplicatesByShortId to filter-fasta.py #626

Open
terrycojones opened this issue Sep 19, 2018 · 0 comments

Comments

@terrycojones
Copy link
Member

There are two more removal of duplicates methods I would like.

--removeDuplicatesProb will remove by sequence but will store the MD5 sum of the sequences, not the sequences themselves. So it's only probabilistic. This helps to avoid running out of memory.

--removeDuplicatesByShortId de-duplicates based on the first part of the read id (up to the first space, if any). This is needed because if you combine output from (say) BLAST or DIAMOND with that from an aligner that produces SAM/BAM, the read ids won't match. That's because in a SAM/BAM file the reads have ids only up to the first space. So we need this option to be able to de-duplicate on combined reads from these different matchers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant