Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seqkit grep consumes large amounts of memory. #487

Open
12 tasks done
charles-plessy opened this issue Sep 27, 2024 · 6 comments
Open
12 tasks done

seqkit grep consumes large amounts of memory. #487

charles-plessy opened this issue Sep 27, 2024 · 6 comments

Comments

@charles-plessy
Copy link

Hello,

I am using seqkit to filter in full chromosome sequences from vertebrate genome assemblies, but keeping only sequences whose ID starts with "CM". For instance: seqkit grep -I -r -p "CM". However, it takes surprising amounts of memory, causing my HPC jobs to crash:

I am using seqkit 2.8.0 from the Galaxy image for Singularity depot.galaxyproject.org-singularity-seqkit-2.8.1--h9ee0642_0.img.

./depot.galaxyproject.org-singularity-seqkit-2.8.1--h9ee0642_0.img /usr/bin/time -v seqkit grep -I -r -p "CM" GCA_027579735.1_aBomBom1.pri_genomic.fna.gz > /dev/null 
Command terminated by signal 9
	Command being timed: "seqkit grep -I -r -p CM GCA_027579735.1_aBomBom1.pri_genomic.fna.gz"
	User time (seconds): 33.13
	System time (seconds): 5.27
	Percent of CPU this job got: 109%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0m 34.97s
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 78569696
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 154
	Minor (reclaiming a frame) page faults: 1688681
	Voluntary context switches: 28374
	Involuntary context switches: 619
	Swaps: 0
	File system inputs: 3903404
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Running seqkit 2.3.0 locally (Debian) also shows high memory consumption

/usr/bin/time -v seqkit grep -r -p "^(CM|CP|FR|L[R-T]|O[U-Z])" GCA_027579735.1_aBomBom1.pri_genomic.fna.gz > /dev/null 
	Command being timed: "seqkit grep -r -p ^(CM|CP|FR|L[R-T]|O[U-Z]) GCA_027579735.1_aBomBom1.pri_genomic.fna.gz"
	User time (seconds): 48.80
	System time (seconds): 7.51
	Percent of CPU this job got: 115%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:48.77
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 9496532
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 826218
	Voluntary context switches: 38384
	Involuntary context switches: 751
	Swaps: 0
	File system inputs: 4949024
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

This large genome can be downloaded from: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_027579735.1/

I looked for a command-line switch that would force seqkit to act like a true filter, without keeping large amounts of data in memory, but -I did not seem to help. Do you have a suggestion?

Best,

Charles Plessy


Please check the items below before submitting an issue.
They help to improve the communication efficiency between us.
Thanks!

Prerequisites

  • Make sure you've installed the correct executable binary file.
    For Mac users, Please download
    • seqkit_darwin_amd64.tar.gz for Mac with Intel CPUs.
    • seqkit_darwin_arm64.tar.gz for Mac with M series CPUs.
  • Make sure you are using the latest version by seqkit version -u.
  • Read the usage and examples for the specific subcommand.

Describe your issue in detail

  • Please copy and paste the command you ran and the error information if reported.
  • It would be more helpful to provide as much information as you can:
    • Are you running on a personal computer or a server?
    • What's the operating system, and how much RAM (memory) is available?
    • Show the types and sizes of input files with file xxx and ls -lh xxx.
    • Show some lines of input files with head -n 5 xxx or zcat xxx.gz | head -n 5.
  • Provide a reproducible example.
    • Has this problem happened many times?
    • Or it only failed with this input file or/and these command/parameters.
@shenwei356
Copy link
Owner

weird.

try seqkit grep -r -p "^CM".

@charles-plessy
Copy link
Author

Thanks for the prompt answer! Anchoring the regexpr at the beginning of the string with ^ did not reduce memory usage (same figures as I reported earlier).

@shenwei356
Copy link
Owner

shenwei356 commented Sep 30, 2024

Oh, I see, it's quite a big genome.

But in my test, the peak RAM is not as high as yours. And there are no results for IDs starting with "CM".

$ memusg -t -s "seqkit stats GCF_027579735.1_aBomBom1.pri_genomic.fna.gz"

file                                         format  type  num_seqs         sum_len  min_len    avg_len        max_len
GCF_027579735.1_aBomBom1.pri_genomic.fna.gz  FASTA   DNA      2,467  10,019,663,791      711  4,061,477  1,601,664,463

elapsed time: 51.976s
peak rss: 9.21 GB


$ memusg -t -s "seqkit grep -r -p ^CM GCF_027579735.1_aBomBom1.pri_genomic.fna.gz  -o t.gz"

elapsed time: 51.165s
peak rss: 9.21 GB

Data from yours.

Maximum resident set size (kbytes): 78,569,696 = 78.5GB.

I used seqkit v2.8.0.

@shenwei356
Copy link
Owner

Here's a low-memory solution with seqtk:

  1. Extracting and filtering sequence IDs.
$ zcat GCF_027579735.1_aBomBom1.pri_genomic.fna.gz | grep '>' | sed 's/^>//' | awk '{print $1}' > ids.txt

$ head -n 3 ids.txt
NC_069499.1
NC_069500.1
NC_069501.1


$ cat ids.txt | csvtk mutate -Ht -p '^(...)' | csvtk freq -Ht -f 2 -nr
NW_     2455
NC_     12

# while, no ids starting with "CM".

  1. Extracting sequences with target sequence IDs.
seqtk subseq xxx.fasta ids.txt

@charles-plessy
Copy link
Author

Thanks for the tip about using seqtk. This said I wish I could use a single-command tool like seqkit because the context of my use is a Nextflow pipeline, where using standard modules with no modifications makes life easier; and there is a nice seqkit/grep module available already. Are you sure you do not have a memory leak of some kind? It is strange that even when there is nothing to match, 9 GB of memory are needed to produce an empty file…

By the way, quick comments on my benchmarks:

  • I used the GenBank (not RefSeq) version of the aBomBom1 genome (GCA_027579735.1_aBomBom1.pri_genomic.fna.gz), and this one has contigs whose names start with CM.
  • In my benchmark that uses ~80GB, seqkit runs from a Singularity image; this probably explains the difference. Please ignore that one if you would like. The benchmark I made on a Debian machine with seqkit 2.3.0 has a very similar memory footprint (~9 GB) as in you recent test with 2.8.0. Therefore, the high memory usage it probably not a recent regression.
  • Of course it is not hard to find a machine that has plenty of memory so that using 9 GB is not a problem, but in the context of HPC computations it is important to have a reasonable optimization of how much memory to request. In my Nextflow pipeline, I run seqkit grep in parallel on hundreds of genomes. If I need to request ~ 100 × 10 GB = 1 TB of memory, then I hit the maximum I can request and the parallel computation gets queued in multiple chunks, which is inefficient. Also, maybe other users have better things to do with that memory in the meantime.

If you find a quick fix for seqkit, please let me know. Otherwise I will follow your advice and write a local module using seqtk plus shell commands instead.

Thanks for your care and comments so far!

@shenwei356
Copy link
Owner

Hi Charles, seqkit occupies more RAM than the largest sequence, usually 4X, as it uses several (2) buffers to improve the FASTA/Q record parsing speed.

For outputting, if wrapping the sequence (-w 60), another buffer is used, which should there be only one but I found 4-5 ones (this can be fixed by add -j 1, and the speed is nearly the same). You can also disable sequence wrapping if it's acceptable by using -w 0, which can slightly improve the speed.

In conclusion, please add -j 1 -w 0, which improves the speed, reduces the memory usage, and does not change the result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants