-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
seqkit grep consumes large amounts of memory. #487
Comments
weird. try seqkit grep -r -p "^CM". |
Thanks for the prompt answer! Anchoring the regexpr at the beginning of the string with |
Oh, I see, it's quite a big genome. But in my test, the peak RAM is not as high as yours. And there are no results for IDs starting with "CM".
Data from yours.
I used seqkit v2.8.0. |
Here's a low-memory solution with seqtk:
|
Thanks for the tip about using By the way, quick comments on my benchmarks:
If you find a quick fix for Thanks for your care and comments so far! |
Hi Charles, seqkit occupies more RAM than the largest sequence, usually 4X, as it uses several (2) buffers to improve the FASTA/Q record parsing speed. For outputting, if wrapping the sequence (-w 60), another buffer is used, which should there be only one but I found 4-5 ones (this can be fixed by add In conclusion, please add |
Hello,
I am using seqkit to filter in full chromosome sequences from vertebrate genome assemblies, but keeping only sequences whose ID starts with "CM". For instance:
seqkit grep -I -r -p "CM"
. However, it takes surprising amounts of memory, causing my HPC jobs to crash:I am using seqkit 2.8.0 from the Galaxy image for Singularity
depot.galaxyproject.org-singularity-seqkit-2.8.1--h9ee0642_0.img
.Running seqkit 2.3.0 locally (Debian) also shows high memory consumption
This large genome can be downloaded from: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_027579735.1/
I looked for a command-line switch that would force seqkit to act like a true filter, without keeping large amounts of data in memory, but
-I
did not seem to help. Do you have a suggestion?Best,
Charles Plessy
Please check the items below before submitting an issue.
They help to improve the communication efficiency between us.
Thanks!
Prerequisites
For Mac users, Please download
seqkit_darwin_amd64.tar.gz
for Mac with Intel CPUs.seqkit_darwin_arm64.tar.gz
for Mac with M series CPUs.seqkit version -u
.Describe your issue in detail
file xxx
andls -lh xxx
.head -n 5 xxx
orzcat xxx.gz | head -n 5
.The text was updated successfully, but these errors were encountered: