Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vpf-class --chunk-size parameter #29

Open
wangyaxiang008 opened this issue Aug 3, 2022 · 6 comments
Open

vpf-class --chunk-size parameter #29

wangyaxiang008 opened this issue Aug 3, 2022 · 6 comments

Comments

@wangyaxiang008
Copy link

hi,
Thank you very much for providing such a useful virus classification and host prediction software.
I have a question about chunktest parameters. In your github, it is introduced that adding the --chunk-size parameter will improve the speed, so I compared the results of adding this parameter and not adding this parameter. Adding this parameter does improve the speed, but the results are different. Add This parameter will result in fewer results.
I now need to run a dataset of millions, so had to add the chunktest parameter, but adding that parameter makes the results less, which is really bothering me.
Hope to get your help, thanks

command one :vpf-class-x86_64-linux@dd88a54 -i ../DNA.DB.retain.contig.fa --workers 50 --data-index /hl/YaxiangWang/Soft/vpf-tools/data/index.yaml -o ./
command two :vpf-class-x86_64-linux@dd88a54 -i ../DNA.DB.retain.contig.fa --workers 50 --chunk-size 1000 --data-index /hl/YaxiangWang/Soft/vpf-tools/data/index.yaml -o ./

result one:
image
result two:
image

@bielr
Copy link
Collaborator

bielr commented Aug 3, 2022

Hi,

You're right, --chunk-size affects the number of results. This is because --chunk-size (which defaults to 1, the default clearly needs to be improved) determines the input of each execution of Prodigal, which gives a different number of results depending of the input size (as that is what it uses to train itself). My guess is that the larger the input size, the better the results are, which also means less noise/false positives. The speed improvement comes from less executions (again, the default is to split the input into individual sequences, so that means one execution per input sequence).

@wangyaxiang008
Copy link
Author

Thank you for your reply.
If I input a dataset with 1,000,000 contigs and a dataset with 100,000 contigs, how can I choose this parameter to have the lowest error rate? Do you have any good advice ?

thanks

@bielr
Copy link
Collaborator

bielr commented Aug 3, 2022

Not an expert on prodigal, but probably maximizing --chunk-size is your best bet. The point of this parameter is to decide how granular the parallelism is, so the biggest I would suggest would be number of input sequences / number of workers.

@wangyaxiang008
Copy link
Author

thanks for your help, i will test it as you suggested and let you know my results later

@wangyaxiang008
Copy link
Author

When I set the chunk-size parameter to the number of contigs for my input dataset, I do get fewer results than when the parameter is set to 1.
image
When the task went to the second step - hmmsearch, I found that the command called only one thread, as soon as I set the --workers parameter to 40.

@bielr
Copy link
Collaborator

bielr commented Aug 11, 2022

That makes sense. For the second part, the number of workers can't possibly be greater than the number in which the input is split, and when chunk-size equals the sample size there is exactly one chunk (we disable multithreading in hmmsearch as it is quite limited compared to manually splitting the input).

One possiblity is to split the input after running prodigal. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants