vpf-class --chunk-size parameter #29

wangyaxiang008 · 2022-08-03T09:41:30Z

hi,
Thank you very much for providing such a useful virus classification and host prediction software.
I have a question about chunktest parameters. In your github, it is introduced that adding the --chunk-size parameter will improve the speed, so I compared the results of adding this parameter and not adding this parameter. Adding this parameter does improve the speed, but the results are different. Add This parameter will result in fewer results.
I now need to run a dataset of millions, so had to add the chunktest parameter, but adding that parameter makes the results less, which is really bothering me.
Hope to get your help, thanks

command one ：vpf-class-x86_64-linux@dd88a54 -i ../DNA.DB.retain.contig.fa --workers 50 --data-index /hl/YaxiangWang/Soft/vpf-tools/data/index.yaml -o ./
command two ：vpf-class-x86_64-linux@dd88a54 -i ../DNA.DB.retain.contig.fa --workers 50 --chunk-size 1000 --data-index /hl/YaxiangWang/Soft/vpf-tools/data/index.yaml -o ./

result one：

result two:

The text was updated successfully, but these errors were encountered:

bielr · 2022-08-03T10:15:45Z

Hi,

You're right, --chunk-size affects the number of results. This is because --chunk-size (which defaults to 1, the default clearly needs to be improved) determines the input of each execution of Prodigal, which gives a different number of results depending of the input size (as that is what it uses to train itself). My guess is that the larger the input size, the better the results are, which also means less noise/false positives. The speed improvement comes from less executions (again, the default is to split the input into individual sequences, so that means one execution per input sequence).

wangyaxiang008 · 2022-08-03T13:16:32Z

Thank you for your reply.
If I input a dataset with 1,000,000 contigs and a dataset with 100,000 contigs, how can I choose this parameter to have the lowest error rate？ Do you have any good advice ？

thanks

bielr · 2022-08-03T13:31:40Z

Not an expert on prodigal, but probably maximizing --chunk-size is your best bet. The point of this parameter is to decide how granular the parallelism is, so the biggest I would suggest would be number of input sequences / number of workers.

wangyaxiang008 · 2022-08-03T13:36:07Z

thanks for your help, i will test it as you suggested and let you know my results later

wangyaxiang008 · 2022-08-11T07:20:41Z

When I set the chunk-size parameter to the number of contigs for my input dataset, I do get fewer results than when the parameter is set to 1.

When the task went to the second step - hmmsearch, I found that the command called only one thread, as soon as I set the --workers parameter to 40.

bielr · 2022-08-11T10:49:16Z

That makes sense. For the second part, the number of workers can't possibly be greater than the number in which the input is split, and when chunk-size equals the sample size there is exactly one chunk (we disable multithreading in hmmsearch as it is quite limited compared to manually splitting the input).

One possiblity is to split the input after running prodigal. What do you think?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vpf-class --chunk-size parameter #29

vpf-class --chunk-size parameter #29

wangyaxiang008 commented Aug 3, 2022

bielr commented Aug 3, 2022

wangyaxiang008 commented Aug 3, 2022

bielr commented Aug 3, 2022

wangyaxiang008 commented Aug 3, 2022

wangyaxiang008 commented Aug 11, 2022

bielr commented Aug 11, 2022

vpf-class --chunk-size parameter #29

vpf-class --chunk-size parameter #29

Comments

wangyaxiang008 commented Aug 3, 2022

bielr commented Aug 3, 2022

wangyaxiang008 commented Aug 3, 2022

bielr commented Aug 3, 2022

wangyaxiang008 commented Aug 3, 2022

wangyaxiang008 commented Aug 11, 2022

bielr commented Aug 11, 2022