Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: Multi-node and Multi-GPU Inference Problems with DeepSpeed MII #545

Open
lcnmzz00 opened this issue Nov 20, 2024 · 0 comments
Open

Comments

@lcnmzz00
Copy link

Problem Description
I am using DeepSpeed MII to perform sharding and multi-node inference with generative models. The objective is to distribute a model across two nodes (2 GPUs per node, total of 4 GPUs, each with ~24GB VRAM) and read prompts from JSON files in an input folder to generate responses, which are then saved in an output folder.

However, depending on the model used, I encounter various issues:

1. With the Qwen-32B model:

  • Initial responses are correct.
  • After a random number of iterations (even with the same prompt), the code hangs indefinitely during the response generation step, with no errors.

2. With Llama 3.1 8B:

  • In single-node mode, everything works perfectly.
  • In multi-node mode, the code does not hang as with Qwen, but the responses are garbled or incorrect. For example:

Prompt: "What is the sun?"
Response: "The sun is a str comTi asTur forBas al aaall wehnd us" (randomly scrambled words).

3. With Mistral 7B Instruct v0.3:

  • The code hangs after only a few iterations.
  • Responses are partially scrambled, similar to the Llama case.

Troubleshooting Attempts:

  • I have tried several things to address these issues, but the following are particularly confusing and raise more doubts than solutions:
  • Adding/Removing torch.distributed.barrier(): I attempted to synchronize processes using torch.distributed.barrier() both before and after the inference step. However, this did not resolve the hanging or the garbled responses.
  • Modifying the all_rank_output Parameter: I experimented with enabling and disabling all_rank_output during the pipeline initialization. This also did not resolve the issues.

System Configuration:

- hostifile:
xxxx.xxx.xxx.xxx slots=2
yyyy.yyy.yyy.yyy slots=2

- Execution Commands:
Node0: deepspeed --hostfile=hostfile --no_ssh --node_rank=0 --master_addr=xxxx.xxx.xxx.xxx --master_port=xxxx multinode_dynamic_inference.py

Node1: deepspeed --hostfile=hostfile --no_ssh --node_rank=1 --master_addr=xxxx.xxx.xxx.xxx --master_port=xxxx multinode_dynamic_inference.py

- Code Used

import json
import os
from pathlib import Path
from time import sleep
import time
import torch
import mii
import gc

# Paths for input and output files
IN_REQUEST_PATH = Path("/path/to/input/")
OUT_REQUEST_PATH = Path("/path/to/output/")

# Local and global rank
local_rank = int(os.getenv("LOCAL_RANK", "-1"))
global_rank = int(os.getenv("RANK", "-1"))

# Initialize the model pipeline
pipe = mii.pipeline("/path/to/model/", all_rank_output=True)

iteration = 0

while True:
   print(iteration)
   iteration += 1

   print(f"GPU memory allocated: {torch.cuda.memory_allocated()}")
   print(f"GPU memory reserved: {torch.cuda.memory_reserved()}")

   # Process input files
   request_paths = list(IN_REQUEST_PATH.iterdir())
   print(f"LOCAL RANK {local_rank}, GLOBAL RANK {global_rank}")
   
   if len(request_paths) > 0:
       requests = [json.loads(path.read_text(encoding="utf-8")) for path in request_paths]
       prompts = [r["prompt"] for r in requests]

       # Perform inference
       start_time = time.time()
       responses = pipe(prompts, max_new_tokens=128)  
       end_time = time.time()
       print(f"Inference time: {end_time - start_time:.2f} seconds")

       # Write results
       if global_rank == 0:
           print("Printing output")
           Path("./responses.json").write_text("\n\n\n".join([r.generated_text for r in responses]))
           
           for request, response in zip(requests, responses):
               request["response"] = response.generated_text
               Path(OUT_REQUEST_PATH / f"{request['id']}.json").write_text(
                   json.dumps(request, ensure_ascii=False), encoding="utf-8"
               )

   # Clear GPU cache
   torch.cuda.empty_cache()
   gc.collect()
   sleep(10)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant