Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate vLLM Evaluator #23

Open
adivekar-utexas opened this issue Feb 15, 2025 · 2 comments
Open

Integrate vLLM Evaluator #23

adivekar-utexas opened this issue Feb 15, 2025 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@adivekar-utexas
Copy link
Contributor

vLLM is a high-throughput LLM evaluator which runs on HuggingFace models, performing various kinds of model sharding across GPUs using Ray backend.
In its basic form, vLLM is a great speedup over AccelerateEvaluator, which is quite slow.

Basic requirements:

  1. Should be compatible with RayEvaluator (and GenerativeLM if needed).
  2. Should support only single-node models; scaling up models should require larger nodes (design choice for better execution speed).
  3. Should integrate with all HF transformers LLMs.
@adivekar-utexas adivekar-utexas self-assigned this Feb 15, 2025
@adivekar-utexas adivekar-utexas added the enhancement New feature or request label Feb 15, 2025
@adivekar-utexas
Copy link
Contributor Author

Initial exploration

Seems like vLLM can run inside a Ray cluster just fine.

Basic working code example

import ray
from vllm import LLM

@ray.remote(num_gpus=2, num_cpus=2)
class VLLMActor:
    def __init__(self, model_name: str, tensor_parallel_size: int):
        # Create a vLLM instance that loads the model across 2 GPUs.
        # Ensure that the tensor_parallel_size is set to 2.
        import os
        from huggingface_hub import snapshot_download
        snapshot_download(
            model_name,
            token="hf_YOUR_KEY_HERE",
        )
        self.llm = LLM(
            model=model_name, 
            tensor_parallel_size=tensor_parallel_size,
            max_model_len=4000,
        )
        print(f"Loaded model {model_name} across {tensor_parallel_size} GPUs.")

    def generate_text(self, prompt: str, **kwargs) -> str:
        # Generate text using the vLLM instance.
        result = self.llm.generate(prompt, **kwargs)
        # Assume that the result object has an 'output' attribute containing the generated text.
        return result

Usage:

# Create an instance of the VLLM actor that loads the Llama‑3 7B model.
actors = [
    VLLMActor.remote(
        model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-14B", 
        tensor_parallel_size=2,
    )
    for _ in range(2)
]

from bears.util import get_result, accumulate
from vllm.sampling_params import SamplingParams
prompt = ["Explain the theory of relativity in simple terms.", "Explain the theory of evolution in simple terms."]

# Generate text
res = (
    actors[0].generate_text.remote(prompt[0], sampling_params=SamplingParams(max_tokens=3000, temperature=0.5))
)
res2 = (
    actors[1].generate_text.remote(prompt[1], sampling_params=SamplingParams(max_tokens=3000, temperature=0.5))
)

print(accumulate(res)[0].outputs[0].text)

print(accumulate(res2)[0].outputs[0].text)

@adivekar-utexas
Copy link
Contributor Author

Notes on initial exploration:

  • Downloading the model is pretty slow (took an hour using snapshot_download). Can we speed this up somehow? I was downloading to the SSD of a g5.12xlarge, not EFS.
  • vLLM itself works very smoothly, was able to run Deepseek R1 Qwen 14B across 2 GPUs with about 90% vRAM usage per GPU (~20GB vRAM used). GPU Util was ~100%.
  • Token generation was decent (~30 tokens/sec). This is a reasoning model so it generated 933 tokens for prompt1 above.

I think this approach is good enough to use. vLLMEvaluator can be a simple usage of vllm, but will need some adapters for sampling and returning logprobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant