You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem Description
I am using DeepSpeed MII to perform sharding and multi-node inference with generative models. The objective is to distribute a model across two nodes (2 GPUs per node, total of 4 GPUs, each with ~24GB VRAM) and read prompts from JSON files in an input folder to generate responses, which are then saved in an output folder.
However, depending on the model used, I encounter various issues:
1. With the Qwen-32B model:
Initial responses are correct.
After a random number of iterations (even with the same prompt), the code hangs indefinitely during the response generation step, with no errors.
2. With Llama 3.1 8B:
In single-node mode, everything works perfectly.
In multi-node mode, the code does not hang as with Qwen, but the responses are garbled or incorrect. For example:
Prompt: "What is the sun?"
Response: "The sun is a str comTi asTur forBas al aaall wehnd us" (randomly scrambled words).
3. With Mistral 7B Instruct v0.3:
The code hangs after only a few iterations.
Responses are partially scrambled, similar to the Llama case.
Troubleshooting Attempts:
I have tried several things to address these issues, but the following are particularly confusing and raise more doubts than solutions:
Adding/Removing torch.distributed.barrier(): I attempted to synchronize processes using torch.distributed.barrier() both before and after the inference step. However, this did not resolve the hanging or the garbled responses.
Modifying the all_rank_output Parameter: I experimented with enabling and disabling all_rank_output during the pipeline initialization. This also did not resolve the issues.
Problem Description
I am using DeepSpeed MII to perform sharding and multi-node inference with generative models. The objective is to distribute a model across two nodes (2 GPUs per node, total of 4 GPUs, each with ~24GB VRAM) and read prompts from JSON files in an input folder to generate responses, which are then saved in an output folder.
However, depending on the model used, I encounter various issues:
1. With the Qwen-32B model:
2. With Llama 3.1 8B:
Prompt: "What is the sun?"
Response: "The sun is a str comTi asTur forBas al aaall wehnd us" (randomly scrambled words).
3. With Mistral 7B Instruct v0.3:
Troubleshooting Attempts:
System Configuration:
- hostifile:
xxxx.xxx.xxx.xxx slots=2
yyyy.yyy.yyy.yyy slots=2
- Execution Commands:
Node0: deepspeed --hostfile=hostfile --no_ssh --node_rank=0 --master_addr=xxxx.xxx.xxx.xxx --master_port=xxxx multinode_dynamic_inference.py
Node1: deepspeed --hostfile=hostfile --no_ssh --node_rank=1 --master_addr=xxxx.xxx.xxx.xxx --master_port=xxxx multinode_dynamic_inference.py
- Code Used
The text was updated successfully, but these errors were encountered: