Use correct model_source for loading GPTQ models with FMS #61

tdoublep · 2025-01-16T21:26:58Z

No description provided.

Signed-off-by: Thomas Parnell <[email protected]>

github-actions · 2025-01-16T21:27:10Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

maxdebayser

LGTM

Signed-off-by: Thomas Parnell <[email protected]>

Supports issue #878 This PR comprises an initial set of 5 end-to-end tests (pytest) to verify the operation of vLLM on Spyre. 1) verification of correct output by comparing against output generated using HF transformers 2) verification of correct handling of multiple "overlapping" warmup shapes 3) verification of correct handling of prompts that are exceeding the maximum prompt length defined by the warmup shapes 4) verification of tensor parallel operation 5) verification that seeded sampling should result in deterministic output (covers recent issue) The various tests allow to cover the various backends ('eager', 'inductor', 'sendnn_decoder'), multiple warmup shapes and combinations, and models. These tests also generate additional output that can be used to efficiently analyze and debug any problems. An example of such output is shown below: ``` model: /models/llama-194m warmup shapes: [(64, 20, 8)] backend: sendnn_decoder #prompts: 4 #HF results: 4 #vLLM results: 4 .... prompt: 'What is the weather today like?' generated: HF: ' What is the weather like tomorrow? What is the weather like today? What is the weather like tomorrow' vLLM: ' What is the temperature? What is the humidity? What is the wind? What is the wind direction' ERROR token id. token logprob token id. token logprob HF: 3639 ' What' -1.592316 vLLM: 3639 ' What' -1.583668 HF: 374 ' is' -1.394676 vLLM: 374 ' is' -1.388251 HF: 279 ' the' -0.357713 vLLM: 279 ' the' -0.350707 HF: 9282 ' weather' -1.251681 vLLM: 9499 ' temperature' -1.276881 ERROR HF: 1093 ' like' -0.403686 vLLM: 30 '?' -1.256650 ERROR HF: 16986 ' tomorrow' -1.232682 vLLM: 3639 ' What' -0.992781 ERROR HF: 30 '?' -0.236499 vLLM: 374 ' is' -0.772252 ERROR HF: 3639 ' What' -0.647572 vLLM: 279 ' the' -0.110132 ERROR HF: 374 ' is' -0.691557 vLLM: 38193 ' humidity' -1.615233 ERROR HF: 279 ' the' -0.176384 vLLM: 30 '?' -0.366836 ERROR HF: 9282 ' weather' -0.283581 vLLM: 3639 ' What' -0.616249 ERROR HF: 1093 ' like' -0.266174 vLLM: 374 ' is' -0.546297 ERROR HF: 3432 ' today' -0.595149 vLLM: 279 ' the' -0.066663 ERROR HF: 30 '?' -0.449156 vLLM: 10160 ' wind' -1.652243 ERROR HF: 3639 ' What' -1.047424 vLLM: 30 '?' -1.023496 ERROR HF: 374 ' is' -0.569301 vLLM: 3639 ' What' -0.602964 ERROR HF: 279 ' the' -0.122663 vLLM: 374 ' is' -0.443599 ERROR HF: 9282 ' weather' -0.208832 vLLM: 279 ' the' -0.075392 ERROR HF: 1093 ' like' -0.267763 vLLM: 10160 ' wind' -1.916859 ERROR HF: 16986 ' tomorrow' -0.968443 vLLM: 5216 ' direction' -1.399925 ERROR logprob absolute differences: average=0.413219 maximum=1.649096 logprob relative differences: average=1.180087 maximum=6.158796 ``` As this example illustrates, the test compares the text/tokens and corresponding logprobs that are generated by HF and vLLM. In his example, the logprobs for 'weather' and 'temperature' are very close for the 4th token that was generated (as can be seen above). Due to different processing/precision etc. on the CPU and the Spyre card, 'weather' respectively 'temperature' had the highest probability for HF respectively vLLM/Spyre. The output of the test allows to analyze these cases quickly. Based on this output, the test can then be adapted to use different prompts that only report 'real' errors.

Use correct model_source for loading GPTQ models with FMS

43b5c48

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep requested a review from maxdebayser January 16, 2025 21:27

maxdebayser approved these changes Jan 17, 2025

View reviewed changes

Fix error with deploy test

d19ff3d

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep merged commit 7e068b5 into main Jan 17, 2025
10 checks passed

tdoublep deleted the tpa-gptq-fix branch January 17, 2025 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use correct model_source for loading GPTQ models with FMS #61

Use correct model_source for loading GPTQ models with FMS #61

tdoublep commented Jan 16, 2025

github-actions bot commented Jan 16, 2025

maxdebayser left a comment

Use correct model_source for loading GPTQ models with FMS #61

Use correct model_source for loading GPTQ models with FMS #61

Conversation

tdoublep commented Jan 16, 2025

github-actions bot commented Jan 16, 2025

maxdebayser left a comment

Choose a reason for hiding this comment