Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run times slow with 76 genomes #298

Closed
brettChapman opened this issue May 17, 2023 · 13 comments
Closed

Run times slow with 76 genomes #298

brettChapman opened this issue May 17, 2023 · 13 comments

Comments

@brettChapman
Copy link

Hi

Continuing on discussion from waveygang/wfmash#171

PGGB is running slow with 76 haplotypes, ran per chromosome on assembled pseudomolecules, on genomes which are around 4-5Gb in size.

-s 100Kbp
-p 93
-k 316
poa_params="asm20"
poa_length_target="700,900,1100"
transclose_batch=10000000

Remaining parameters default.

@brettChapman
Copy link
Author

Current run time nearing 50 days or more.

@ekg
Copy link
Collaborator

ekg commented May 17, 2023

My first question is if the instruction set in the installed version of pggb is older, leading to less wide vector instructions and slower processing.

We had been dealing with this before. I'm not sure of the state of the binaries accessible from dockerhub and conda.

@brettChapman
Copy link
Author

My installed version, which I pulled from docker hub using Singularity is version 0.5.3 from February 10.

@AndreaGuarracino
Copy link
Member

AndreaGuarracino commented May 17, 2023

Docker/Singularity should be about ~30% slower than building from GitHub source (at least on our cluster).

Can you also share your current (I suppose very long) PGGB .log file?

@brettChapman
Copy link
Author

Surprisingly chromosome 1 has just completed after running for 50 days. I can provide that log file. Chromosome 1 would usually complete first. I expect the other chromosomes will take an additional week or two.

I can provide the log file but it's 1Gb in size. How could I get it to you?

@brettChapman
Copy link
Author

I just gzipped the log file. Now down to 25Mb. What's the limit for file attachments on here?

@AndreaGuarracino
Copy link
Member

LOL! Now it is a nice size! I think sharing it on GitHub could work. Or you could put the file temporarily on Google Drive or similar.

I would like to check if your bottlenecks are in wfmash mapping and/or alignment, the GFA->ODGI conversion (it happens in smoothxg), the PO alignment in smoothxg, etc...

@brettChapman
Copy link
Author

I've attached the gzipped log file. Hopefully no issues with the attachment.

barley_pangenome_1H.fasta.4f79ff6.371d99c.2f0e65c.smooth.03-29-2023_07:39:12.log.gz

@AndreaGuarracino
Copy link
Member

AndreaGuarracino commented May 18, 2023

The log doesn't look 100% healthy, but I can see that the 1st round of "path fragments embedding" took ~18 days! I suppose the other 2 rounds took similar times too. That's surprising.

zgrep 'path fragments' barley_pangenome_1H.fasta.4f79ff6.371d99c.2f0e65c.smooth.03-29-2023_07.39.12.log.gz

[smoothxg::(1-3)::smooth_and_lace] embedding 135544223 path fragments: 100.00% @ 8.57e+01/s elapsed: 18:07:07:39 remain: 00:00:00:00
[smoothxg::(2-3)::smooth_and_lace] embedding 108563793 path fragments:  1.96% @ 2.56e+01/s elapsed: 00:23:07:27 remain: 48:04:05:45gfaffix barley_pangenome_1H_s100000_l0_p93_k316_B10000000_G700-900-1100_Pasm20/barley_pangenome_1H.fasta.4f79ff6.371d99c.2f0e65c.smooth.gfa -o barley_pangenome_1H_s100000_l0_p93_k316_B10000000_G700-900-1100_Pasm20/barley_pangenome_1H.fasta.4f79ff6.371d99c.2f0e65c.smooth.fix.gfa

@AndreaGuarracino
Copy link
Member

AndreaGuarracino commented Aug 20, 2023

Hi @brettChapman, sorry for the extremely long wait. I worked on smoothxg recently and I am still finalizing several hacks to improve both memory usage and runtime. In your case, the "path fragments embedding" takes a lot of time. That step is single-threaded currently. In pangenome/smoothxg#197 there is a version of smoothxg that parallelizes such a step and that introduces several memory optimizations too.

If you can work also with GitHub branches, it would be helpful if you could run the same smoothxg command line by using the avoid_2_graphs_in_memory branch. With 32 threads, I hope the path fragments embedding will finish in a decent amount of hours.

@AndreaGuarracino
Copy link
Member

@brettChapman were you lucky enough to try the updated smoothxg (or pggb) with less issues?

@brettChapman
Copy link
Author

Hi @AndreaGuarracino

Yes, I've used the latest version now and found smoothxg ran a lot faster.

Recently we've gotten access to a larger cluster paying at a higher cost, with SSD and 2TB RAM. We've found our PGGB jobs ran significantly faster, cutting months off the run time. Previous systems we've had access to have been limited to mechanical drives and limited RAM, but these were public funded resources.

@AndreaGuarracino
Copy link
Member

Thanks for the update! Saving months seems to be hot enough for the environment and global warming xD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants