Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kraken2_build step stalling #27

Open
dgolden96 opened this issue Jul 14, 2022 · 7 comments
Open

Kraken2_build step stalling #27

dgolden96 opened this issue Jul 14, 2022 · 7 comments

Comments

@dgolden96
Copy link

Hi there,

I'm continuing to troubleshoot the db-update process for a kraken2 database, and I've hit a wall at the kraken2_build step. The pipeline doesn't throw any errors; it just continues to run indefinitely (12+ hours without failure or completion). It seems similar to the problem described here: DerrickWood/kraken2#428

So far, I've tried to implement the workaround mentioned in the comments of that issue I linked, where you add the --fast-build flag to the kraken2 call in the db-update snakefile, but it doesn't seem to have solved the issue. Any chance you've seen this before and/or have any thoughts on what might be causing it? I definitely have enough RAM. I'm using 28 cores with 16 Gb per core.

Thanks!

@nick-youngblut
Copy link
Contributor

I've (thankfully) never experienced that issue. How many genomes are included in the build?

@dgolden96
Copy link
Author

The database to be updated is the full GTDB_release207, and the sample TSV I'm trying to add includes ~4,000 genomes

@zoey-rw
Copy link

zoey-rw commented Jul 15, 2022

A related question: if we instead passed the reads that were unclassified from GTDB into a second database (db-create with only the non-GTDB genomes), should that give similar results as a single database via the db-update workflow? There are methods for combining outputs for the same sample from different databases, though I imagine there could be downstream effects on Bracken estimates.

@nick-youngblut
Copy link
Contributor

The downside of a 2-step classification approach versus a 1-step is that there is no direct "competition" during classification across the 2 steps. So, some reads could be falsely classified in the 1st step when they would actually be classified as something in the 2nd step if the 2 reference databases were combined.

@MixalisSn
Copy link

Same problem here. I ran the kraken2 database building using 40 cores (7 GB each), and after 24 hours the process stalled at this point:

Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map already present, skipping map creation.
Estimating required capacity (step 2)...
Estimated hash table requirement: 75566900660 bytes
Capacity estimation complete. [37m21.355s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 16 bits reserved for taxid.

@nick-youngblut
Copy link
Contributor

@MixalisSn do you think that the stalling could be due to limited memory?

@MixalisSn
Copy link

@nick-youngblut I thought the 120 GB were enough. Any way, I added the --fast-build flag, using the same resources, and the build was completed successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants