Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please consider adding support for multi-thread support #60

Open
dbrami opened this issue Jul 5, 2022 · 10 comments
Open

Please consider adding support for multi-thread support #60

dbrami opened this issue Jul 5, 2022 · 10 comments

Comments

@dbrami
Copy link

dbrami commented Jul 5, 2022

Being able to scatter/gather the work over multiple CPU cores would really help speed-up your script.

@susannasiebert
Copy link
Collaborator

Thank you for your interest in VAtools. All of the tools in this toolkit are just IO for the most part so shouldn't take very long. Is there a specific tool that has been running slow for you?

@dbrami
Copy link
Author

dbrami commented Jul 13, 2022

Hi,
Indeed, "vcf-info-annotator" has been working much slower than hoped. My decompressed VCF file is about 75 GB. The program was writing about 1 MB every 5 min on a beefy AWS machine.
I know it's not a lot of info. Let me know what you would like to see in order to figure out how use your script more efficiently or if some code fixes can speed things up.
Thanks

@susannasiebert
Copy link
Collaborator

ah ok. I've definitely never run it on that large of a file. I'll have a look to see how things can be improved.

@lukaas33
Copy link

I am encountering the same issue. A VCF of 11GB has been running for over 12 hours now.

@susannasiebert
Copy link
Collaborator

susannasiebert commented Aug 29, 2024

How big is your TSV? That file is being read into memory so you'll want to make sure that you have at least that amount of memory available. You're process is probably stuck swapping memory and not actually doing any/much work.

@lukaas33
Copy link

lukaas33 commented Aug 29, 2024

The TSV is only 3MB, the gtf around 300MB.

Here is my top output:
image

It seems cpu is used maximally as well as the memory (64GB total).

The exact command that I am using is vcf-expression-annotator -s sample -o /shared_dir/temp.vcf /shared_dir/neoantigen.vep.vcf /shared_dir/neoantigen.transcript.abundance.gtf stringtie transcript && vcf-expression-annotator -s sample -o /shared_dir/neoantigen.stringtie.vcf /shared_dir/temp.vcf /shared_dir/neoantigen.gene.abundance.tsv stringtie gene

I am using the Docker container.

@susannasiebert
Copy link
Collaborator

That's strange. I'm not sure why you are seeing multiple processes either. Do you see the same behavior when you run the two steps as separate commands?

The gtf parsing library we are using uses pandas underneath the hood, which, unfortunately, can use up a lot of memory (more than expected) because of the way it stores some data. Would you mind sending me your gtf file so I can play around with it?

@lukaas33
Copy link

When using the && separator these commands are executed seperately.
I also tried to run only the first command but this also took too long.

I will email you these files.

@susannasiebert
Copy link
Collaborator

Ok, this is definitely not an issue with the GTF file. It's able to read it in just fine but you have over 4.5 million VCF entries so processing just takes a while. I'm not sure if there is a good programatic way to fix this, tbh, while still preserving the ordering of the VCF. You could try manually splitting the VCF into smaller subsets before running them through the annotator.

@lukaas33
Copy link

Ah thank you.
Perhaps another approach would be to filter the VCF or limit the contigs I align to upstream.
But not really an issue with this tool then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants