-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vpf-class --chunk-size parameter #29
Comments
Hi, You're right, --chunk-size affects the number of results. This is because --chunk-size (which defaults to 1, the default clearly needs to be improved) determines the input of each execution of Prodigal, which gives a different number of results depending of the input size (as that is what it uses to train itself). My guess is that the larger the input size, the better the results are, which also means less noise/false positives. The speed improvement comes from less executions (again, the default is to split the input into individual sequences, so that means one execution per input sequence). |
Thank you for your reply. thanks |
Not an expert on prodigal, but probably maximizing --chunk-size is your best bet. The point of this parameter is to decide how granular the parallelism is, so the biggest I would suggest would be number of input sequences / number of workers. |
thanks for your help, i will test it as you suggested and let you know my results later |
That makes sense. For the second part, the number of workers can't possibly be greater than the number in which the input is split, and when chunk-size equals the sample size there is exactly one chunk (we disable multithreading in hmmsearch as it is quite limited compared to manually splitting the input). One possiblity is to split the input after running prodigal. What do you think? |
hi,
Thank you very much for providing such a useful virus classification and host prediction software.
I have a question about chunktest parameters. In your github, it is introduced that adding the --chunk-size parameter will improve the speed, so I compared the results of adding this parameter and not adding this parameter. Adding this parameter does improve the speed, but the results are different. Add This parameter will result in fewer results.
I now need to run a dataset of millions, so had to add the chunktest parameter, but adding that parameter makes the results less, which is really bothering me.
Hope to get your help, thanks
command one :vpf-class-x86_64-linux@dd88a54 -i ../DNA.DB.retain.contig.fa --workers 50 --data-index /hl/YaxiangWang/Soft/vpf-tools/data/index.yaml -o ./
command two :vpf-class-x86_64-linux@dd88a54 -i ../DNA.DB.retain.contig.fa --workers 50 --chunk-size 1000 --data-index /hl/YaxiangWang/Soft/vpf-tools/data/index.yaml -o ./
result one:
![image](https://user-images.githubusercontent.com/54664252/182576855-18f427b7-d5fb-43e0-a281-a1538fd6a158.png)
![image](https://user-images.githubusercontent.com/54664252/182577097-72d9716f-a463-44fa-889d-0702958c8596.png)
result two:
The text was updated successfully, but these errors were encountered: