Run times slow with 76 genomes #298

brettChapman · 2023-05-17T07:11:21Z

Hi

Continuing on discussion from waveygang/wfmash#171

PGGB is running slow with 76 haplotypes, ran per chromosome on assembled pseudomolecules, on genomes which are around 4-5Gb in size.

-s 100Kbp
-p 93
-k 316
poa_params="asm20"
poa_length_target="700,900,1100"
transclose_batch=10000000

Remaining parameters default.

brettChapman · 2023-05-17T07:16:24Z

Current run time nearing 50 days or more.

ekg · 2023-05-17T07:18:26Z

My first question is if the instruction set in the installed version of pggb is older, leading to less wide vector instructions and slower processing.

We had been dealing with this before. I'm not sure of the state of the binaries accessible from dockerhub and conda.

brettChapman · 2023-05-17T07:21:53Z

My installed version, which I pulled from docker hub using Singularity is version 0.5.3 from February 10.

AndreaGuarracino · 2023-05-17T14:48:16Z

Docker/Singularity should be about ~30% slower than building from GitHub source (at least on our cluster).

Can you also share your current (I suppose very long) PGGB .log file?

brettChapman · 2023-05-18T01:22:16Z

Surprisingly chromosome 1 has just completed after running for 50 days. I can provide that log file. Chromosome 1 would usually complete first. I expect the other chromosomes will take an additional week or two.

I can provide the log file but it's 1Gb in size. How could I get it to you?

brettChapman · 2023-05-18T01:27:19Z

I just gzipped the log file. Now down to 25Mb. What's the limit for file attachments on here?

AndreaGuarracino · 2023-05-18T01:29:15Z

LOL! Now it is a nice size! I think sharing it on GitHub could work. Or you could put the file temporarily on Google Drive or similar.

I would like to check if your bottlenecks are in wfmash mapping and/or alignment, the GFA->ODGI conversion (it happens in smoothxg), the PO alignment in smoothxg, etc...

brettChapman · 2023-05-18T01:40:18Z

I've attached the gzipped log file. Hopefully no issues with the attachment.

barley_pangenome_1H.fasta.4f79ff6.371d99c.2f0e65c.smooth.03-29-2023_07:39:12.log.gz

AndreaGuarracino · 2023-05-18T02:15:39Z

The log doesn't look 100% healthy, but I can see that the 1st round of "path fragments embedding" took ~18 days! I suppose the other 2 rounds took similar times too. That's surprising.

zgrep 'path fragments' barley_pangenome_1H.fasta.4f79ff6.371d99c.2f0e65c.smooth.03-29-2023_07.39.12.log.gz

[smoothxg::(1-3)::smooth_and_lace] embedding 135544223 path fragments: 100.00% @ 8.57e+01/s elapsed: 18:07:07:39 remain: 00:00:00:00
[smoothxg::(2-3)::smooth_and_lace] embedding 108563793 path fragments:  1.96% @ 2.56e+01/s elapsed: 00:23:07:27 remain: 48:04:05:45gfaffix barley_pangenome_1H_s100000_l0_p93_k316_B10000000_G700-900-1100_Pasm20/barley_pangenome_1H.fasta.4f79ff6.371d99c.2f0e65c.smooth.gfa -o barley_pangenome_1H_s100000_l0_p93_k316_B10000000_G700-900-1100_Pasm20/barley_pangenome_1H.fasta.4f79ff6.371d99c.2f0e65c.smooth.fix.gfa

AndreaGuarracino · 2023-08-20T18:41:31Z

Hi @brettChapman, sorry for the extremely long wait. I worked on smoothxg recently and I am still finalizing several hacks to improve both memory usage and runtime. In your case, the "path fragments embedding" takes a lot of time. That step is single-threaded currently. In pangenome/smoothxg#197 there is a version of smoothxg that parallelizes such a step and that introduces several memory optimizations too.

If you can work also with GitHub branches, it would be helpful if you could run the same smoothxg command line by using the avoid_2_graphs_in_memory branch. With 32 threads, I hope the path fragments embedding will finish in a decent amount of hours.

AndreaGuarracino · 2024-01-02T16:58:43Z

@brettChapman were you lucky enough to try the updated smoothxg (or pggb) with less issues?

brettChapman · 2024-01-08T02:04:32Z

Hi @AndreaGuarracino

Yes, I've used the latest version now and found smoothxg ran a lot faster.

Recently we've gotten access to a larger cluster paying at a higher cost, with SSD and 2TB RAM. We've found our PGGB jobs ran significantly faster, cutting months off the run time. Previous systems we've had access to have been limited to mechanical drives and limited RAM, but these were public funded resources.

AndreaGuarracino · 2024-01-08T03:35:50Z

Thanks for the update! Saving months seems to be hot enough for the environment and global warming xD

This was referenced Jun 13, 2023

Less verbose logging pangenome/smoothxg#191

Merged

Less verbose logging pangenome/odgi#508

Merged

AndreaGuarracino mentioned this issue Aug 20, 2023

Avoid 2 graphs in memory, sample decompressed blocks, parallelize path embedding pangenome/smoothxg#197

Merged

AndreaGuarracino closed this as completed Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run times slow with 76 genomes #298

Run times slow with 76 genomes #298

brettChapman commented May 17, 2023

brettChapman commented May 17, 2023

ekg commented May 17, 2023

brettChapman commented May 17, 2023

AndreaGuarracino commented May 17, 2023 •

edited

Loading

brettChapman commented May 18, 2023

brettChapman commented May 18, 2023

AndreaGuarracino commented May 18, 2023

brettChapman commented May 18, 2023

AndreaGuarracino commented May 18, 2023 •

edited

Loading

AndreaGuarracino commented Aug 20, 2023 •

edited

Loading

AndreaGuarracino commented Jan 2, 2024

brettChapman commented Jan 8, 2024

AndreaGuarracino commented Jan 8, 2024

Run times slow with 76 genomes #298

Run times slow with 76 genomes #298

Comments

brettChapman commented May 17, 2023

brettChapman commented May 17, 2023

ekg commented May 17, 2023

brettChapman commented May 17, 2023

AndreaGuarracino commented May 17, 2023 • edited Loading

brettChapman commented May 18, 2023

brettChapman commented May 18, 2023

AndreaGuarracino commented May 18, 2023

brettChapman commented May 18, 2023

AndreaGuarracino commented May 18, 2023 • edited Loading

AndreaGuarracino commented Aug 20, 2023 • edited Loading

AndreaGuarracino commented Jan 2, 2024

brettChapman commented Jan 8, 2024

AndreaGuarracino commented Jan 8, 2024

AndreaGuarracino commented May 17, 2023 •

edited

Loading

AndreaGuarracino commented May 18, 2023 •

edited

Loading

AndreaGuarracino commented Aug 20, 2023 •

edited

Loading