vLLM Integration #1336

jjovalle99 · 2025-02-04T22:11:31Z

Hello!

I am wondering if there is a recommended way to use Instructor with vLLM.

I have been doing:

vllm_client = OpenAI(...)
from_openai(vllm_client, mode=instructor.Mode.JSON)

But in theory the instructor.Mode.TOOLS should work, shouldn't it?
How has been your experience with this?

ivanleomk · 2025-02-05T07:08:12Z

I got it to work with a model hosted on Modal running an OpenAI server https://modal.com/docs/examples/vllm_inference that worked out of the box with the TOOLS mode.

Tested it last week with Qwen-2-VL. Going to close this issue for now since it's not an issue, but feel free to open it again if you encounter the same issues.

jjovalle99 · 2025-02-08T11:21:14Z

hello @ivanleomk!

Sorry to reopen, I tested it with Qwen 2.5 VL 72B but it didn't work with tool mode. Here is how I deployed:

vllm serve Qwen/Qwen2.5-VL-72B-Instruct --port 8000 --host 0.0.0.0 --dtype bfloat16 --tensor-parallel-size 4 \
--limit-mm-per-prompt image=5,video=0 --enable-auto-tool-choice --tool-call-parser hermes

(I also tested without --enable-auto-tool-choice --tool-call-parser hermes)

This is the python code:

class Response(BaseModel):
    reasoning: str
    answer: str

images_path = Path(
    "/Users/juanovalle/Informa Repositories/ingestion_pipeline/data/images_inference/2023"
)
image1 = instructor.Image.from_path(images_path / "2023_0002.png")


vllm_url = "http://192.153.62.139:8000/v1"
vllm_api_key = "emtpy"
model_name="Qwen/Qwen2.5-VL-72B-Instruct"
vllm_client = AsyncOpenAI(base_url=vllm_url, api_key=vllm_api_key)
instructor_client = instructor.from_openai(
    client=vllm_client
)

response = await instructor_client.chat.completions.create_with_completion(
    model=model_name,
    response_model=Response,
    messages=[
        {
            "role": "user",
            "content": ["How many colleagues doe sinforma have", image1],
        },
    ],
    max_tokens=1024,
    temperature=0.0,
)

And I got this error:

RetryError: RetryError[<Future at 0x12975dc40 state=finished raised BadRequestError>]
[...]
InstructorRetryException: Error code: 400 - {'object': 'error', 'message': 'Expecting value: line 1 column 1 (char 0)', 'type': 'BadRequestError', 'param': None, 'code': 400}

These are the logs from the server:

INFO 02-08 11:19:07 logger.py:39] Received request chatcmpl-ae0907b2968c4822a9962a1949cae300: prompt: '<|im_start|>system\nYou are a helpful assistant.\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{"type": "function", "function": {"name": "Response", "description": "Correctly extracted `Response` with all the required parameters with correct types", "parameters": {"properties": {"reasoning": {"title": "Reasoning", "type": "string"}, "answer": {"title": "Answer", "type": "string"}}, "required": ["answer", "reasoning"], "type": "object"}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{"name": <function-name>, "arguments": <args-json-object>}\n</tool_call><|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>\nHow many colleagues doe sinforma have<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=GuidedDecodingParams(json={'properties': {'reasoning': {'title': 'Reasoning', 'type': 'string'}, 'answer': {'title': 'Answer', 'type': 'string'}}, 'required': ['answer', 'reasoning'], 'type': 'object'}, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None)), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 02-08 11:19:07 async_llm.py:161] Added request chatcmpl-ae0907b2968c4822a9962a1949cae300.
INFO 02-08 11:19:09 loggers.py:72] Avg prompt throughput: 2587.6 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs GPU KV cache usage: 4.2%.
INFO:     100.67.5.15:1886 - "POST /v1/chat/completions HTTP/1.1" 200 OK
WARNING 02-08 11:19:09 chat_utils.py:825] Skipping multimodal part (type: 'text')with empty / unparsable content.
ERROR 02-08 11:19:09 serving_chat.py:193] Error in preprocessing prompt inputs
ERROR 02-08 11:19:09 serving_chat.py:193] Traceback (most recent call last):
ERROR 02-08 11:19:09 serving_chat.py:193]   File "/home/ubuntu/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_chat.py", line 177, in create_chat_completion
ERROR 02-08 11:19:09 serving_chat.py:193]     ) = await self._preprocess_chat(
ERROR 02-08 11:19:09 serving_chat.py:193]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-08 11:19:09 serving_chat.py:193]   File "/home/ubuntu/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_engine.py", line 386, in _preprocess_chat
ERROR 02-08 11:19:09 serving_chat.py:193]     conversation, mm_data_future = parse_chat_messages_futures(
ERROR 02-08 11:19:09 serving_chat.py:193]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-08 11:19:09 serving_chat.py:193]   File "/home/ubuntu/.venv/lib/python3.12/site-packages/vllm/entrypoints/chat_utils.py", line 959, in parse_chat_messages_futures
ERROR 02-08 11:19:09 serving_chat.py:193]     _postprocess_messages(conversation)
ERROR 02-08 11:19:09 serving_chat.py:193]   File "/home/ubuntu/.venv/lib/python3.12/site-packages/vllm/entrypoints/chat_utils.py", line 914, in _postprocess_messages
ERROR 02-08 11:19:09 serving_chat.py:193]     item["function"]["arguments"] = json.loads(
ERROR 02-08 11:19:09 serving_chat.py:193]                                     ^^^^^^^^^^^
ERROR 02-08 11:19:09 serving_chat.py:193]   File "/home/ubuntu/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/json/__init__.py", line 346, in loads
ERROR 02-08 11:19:09 serving_chat.py:193]     return _default_decoder.decode(s)
ERROR 02-08 11:19:09 serving_chat.py:193]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-08 11:19:09 serving_chat.py:193]   File "/home/ubuntu/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/json/decoder.py", line 338, in decode
ERROR 02-08 11:19:09 serving_chat.py:193]     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
ERROR 02-08 11:19:09 serving_chat.py:193]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-08 11:19:09 serving_chat.py:193]   File "/home/ubuntu/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/json/decoder.py", line 356, in raw_decode
ERROR 02-08 11:19:09 serving_chat.py:193]     raise JSONDecodeError("Expecting value", s, err.value) from None
ERROR 02-08 11:19:09 serving_chat.py:193] json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
INFO:     100.67.5.15:1886 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
WARNING 02-08 11:19:12 chat_utils.py:825] Skipping multimodal part (type: 'text')with empty / unparsable content.
ERROR 02-08 11:19:12 serving_chat.py:193] Error in preprocessing prompt inputs
ERROR 02-08 11:19:12 serving_chat.py:193] Traceback (most recent call last):
ERROR 02-08 11:19:12 serving_chat.py:193]   File "/home/ubuntu/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_chat.py", line 177, in create_chat_completion
ERROR 02-08 11:19:12 serving_chat.py:193]     ) = await self._preprocess_chat(
ERROR 02-08 11:19:12 serving_chat.py:193]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-08 11:19:12 serving_chat.py:193]   File "/home/ubuntu/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/serving_engine.py", line 386, in _preprocess_chat
ERROR 02-08 11:19:12 serving_chat.py:193]     conversation, mm_data_future = parse_chat_messages_futures(
ERROR 02-08 11:19:12 serving_chat.py:193]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-08 11:19:12 serving_chat.py:193]   File "/home/ubuntu/.venv/lib/python3.12/site-packages/vllm/entrypoints/chat_utils.py", line 959, in parse_chat_messages_futures
ERROR 02-08 11:19:12 serving_chat.py:193]     _postprocess_messages(conversation)
ERROR 02-08 11:19:12 serving_chat.py:193]   File "/home/ubuntu/.venv/lib/python3.12/site-packages/vllm/entrypoints/chat_utils.py", line 914, in _postprocess_messages
ERROR 02-08 11:19:12 serving_chat.py:193]     item["function"]["arguments"] = json.loads(
ERROR 02-08 11:19:12 serving_chat.py:193]                                     ^^^^^^^^^^^
ERROR 02-08 11:19:12 serving_chat.py:193]   File "/home/ubuntu/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/json/__init__.py", line 346, in loads
ERROR 02-08 11:19:12 serving_chat.py:193]     return _default_decoder.decode(s)
ERROR 02-08 11:19:12 serving_chat.py:193]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-08 11:19:12 serving_chat.py:193]   File "/home/ubuntu/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/json/decoder.py", line 338, in decode
ERROR 02-08 11:19:12 serving_chat.py:193]     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
ERROR 02-08 11:19:12 serving_chat.py:193]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-08 11:19:12 serving_chat.py:193]   File "/home/ubuntu/.local/share/uv/python/cpython-3.12.8-linux-x86_64-gnu/lib/python3.12/json/decoder.py", line 356, in raw_decode
ERROR 02-08 11:19:12 serving_chat.py:193]     raise JSONDecodeError("Expecting value", s, err.value) from None
ERROR 02-08 11:19:12 serving_chat.py:193] json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
INFO:     100.67.5.15:1886 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
INFO 02-08 11:19:14 loggers.py:72] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.4 tokens/s, Running: 0 reqs, Waiting: 0 reqs GPU KV cache usage: 0.0%.
INFO 02-08 11:19:19 loggers.py:72] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs GPU KV cache usage: 0.0%.

PD: It works with JSON mode.

github-actions bot added help wanted Extra attention is needed question Further information is requested labels Feb 4, 2025

ivanleomk closed this as completed Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM Integration #1336

vLLM Integration #1336

jjovalle99 commented Feb 4, 2025

ivanleomk commented Feb 5, 2025

jjovalle99 commented Feb 8, 2025 •

edited

Loading

vLLM Integration #1336

vLLM Integration #1336

Comments

jjovalle99 commented Feb 4, 2025

ivanleomk commented Feb 5, 2025

jjovalle99 commented Feb 8, 2025 • edited Loading

jjovalle99 commented Feb 8, 2025 •

edited

Loading