Total number of attention heads (X) must be divisible by tensor parallel size (Y). #1041
Replies: 5 comments
-
Same problem with me: --tensor-parallel-size
|
Beta Was this translation helpful? Give feedback.
-
same error,32 heads on 3 gpus |
Beta Was this translation helpful? Give feedback.
-
same error, I use starcoder2 but tells me |
Beta Was this translation helpful? Give feedback.
-
What controls total number of attention heads? Can I decrease / change that number rather than change the number of gpus? It doesn't seem to be an available vllm arg: |
Beta Was this translation helpful? Give feedback.
-
Trying to run falcon-7b on multiple nodes however getting the below error. Which is funny since 71 is a prime number. So I can run it on either 1 GPU (1NODE) or on 71 GPUs (NODES). Is there any way to avoid this problem?
My config is
RayHead is running on one node (actually container within kubernetes) via
ray start --head --dashboard-host 0.0.0.0 --num-gpus 1 --num-cpus 7
And RayWorker is working in another container via
ray start --disable-usage-stats --num-gpus 1 --num-cpus 7 --address <address>
ray status is fine
But when trying to run falcon-7b via
python -m vllm.entrypoints.api_server --model tiiuae/falcon-7b --trust-remote-code --tensor-parallel-size 2 --port 8080 --engine-use-ray --worker-use-ray
Below error is raised
Total number of attention heads (71) must be divisible by tensor parallel size (2).
Beta Was this translation helpful? Give feedback.
All reactions