Integrate vLLM Evaluator #23

adivekar-utexas · 2025-02-15T11:36:39Z

vLLM is a high-throughput LLM evaluator which runs on HuggingFace models, performing various kinds of model sharding across GPUs using Ray backend.
In its basic form, vLLM is a great speedup over AccelerateEvaluator, which is quite slow.

Basic requirements:

Should be compatible with RayEvaluator (and GenerativeLM if needed).
Should support only single-node models; scaling up models should require larger nodes (design choice for better execution speed).
Should integrate with all HF transformers LLMs.

adivekar-utexas · 2025-02-15T11:38:11Z

Initial exploration

Seems like vLLM can run inside a Ray cluster just fine.

Basic working code example

import ray
from vllm import LLM

@ray.remote(num_gpus=2, num_cpus=2)
class VLLMActor:
    def __init__(self, model_name: str, tensor_parallel_size: int):
        # Create a vLLM instance that loads the model across 2 GPUs.
        # Ensure that the tensor_parallel_size is set to 2.
        import os
        from huggingface_hub import snapshot_download
        snapshot_download(
            model_name,
            token="hf_YOUR_KEY_HERE",
        )
        self.llm = LLM(
            model=model_name, 
            tensor_parallel_size=tensor_parallel_size,
            max_model_len=4000,
        )
        print(f"Loaded model {model_name} across {tensor_parallel_size} GPUs.")

    def generate_text(self, prompt: str, **kwargs) -> str:
        # Generate text using the vLLM instance.
        result = self.llm.generate(prompt, **kwargs)
        # Assume that the result object has an 'output' attribute containing the generated text.
        return result

Usage:

# Create an instance of the VLLM actor that loads the Llama‑3 7B model.
actors = [
    VLLMActor.remote(
        model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-14B", 
        tensor_parallel_size=2,
    )
    for _ in range(2)
]

from bears.util import get_result, accumulate
from vllm.sampling_params import SamplingParams
prompt = ["Explain the theory of relativity in simple terms.", "Explain the theory of evolution in simple terms."]

# Generate text
res = (
    actors[0].generate_text.remote(prompt[0], sampling_params=SamplingParams(max_tokens=3000, temperature=0.5))
)
res2 = (
    actors[1].generate_text.remote(prompt[1], sampling_params=SamplingParams(max_tokens=3000, temperature=0.5))
)

print(accumulate(res)[0].outputs[0].text)

print(accumulate(res2)[0].outputs[0].text)

adivekar-utexas · 2025-02-15T11:47:00Z

Notes on initial exploration:

Downloading the model is pretty slow (took an hour using snapshot_download). Can we speed this up somehow? I was downloading to the SSD of a g5.12xlarge, not EFS.
vLLM itself works very smoothly, was able to run Deepseek R1 Qwen 14B across 2 GPUs with about 90% vRAM usage per GPU (~20GB vRAM used). GPU Util was ~100%.
Token generation was decent (~30 tokens/sec). This is a reasoning model so it generated 933 tokens for prompt1 above.

I think this approach is good enough to use. vLLMEvaluator can be a simple usage of vllm, but will need some adapters for sampling and returning logprobs.

adivekar-utexas self-assigned this Feb 15, 2025

adivekar-utexas added the enhancement New feature or request label Feb 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate vLLM Evaluator #23

Integrate vLLM Evaluator #23

adivekar-utexas commented Feb 15, 2025

adivekar-utexas commented Feb 15, 2025

adivekar-utexas commented Feb 15, 2025

Integrate vLLM Evaluator #23

Integrate vLLM Evaluator #23

Comments

adivekar-utexas commented Feb 15, 2025

adivekar-utexas commented Feb 15, 2025

Initial exploration

adivekar-utexas commented Feb 15, 2025

Notes on initial exploration: