-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use correct model_source for loading GPTQ models with FMS #61
Conversation
Signed-off-by: Thomas Parnell <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Thomas Parnell <[email protected]>
Supports issue #878 This PR comprises an initial set of 5 end-to-end tests (pytest) to verify the operation of vLLM on Spyre. 1) verification of correct output by comparing against output generated using HF transformers 2) verification of correct handling of multiple "overlapping" warmup shapes 3) verification of correct handling of prompts that are exceeding the maximum prompt length defined by the warmup shapes 4) verification of tensor parallel operation 5) verification that seeded sampling should result in deterministic output (covers recent issue) The various tests allow to cover the various backends ('eager', 'inductor', 'sendnn_decoder'), multiple warmup shapes and combinations, and models. These tests also generate additional output that can be used to efficiently analyze and debug any problems. An example of such output is shown below: ``` model: /models/llama-194m warmup shapes: [(64, 20, 8)] backend: sendnn_decoder #prompts: 4 #HF results: 4 #vLLM results: 4 .... prompt: 'What is the weather today like?' generated: HF: ' What is the weather like tomorrow? What is the weather like today? What is the weather like tomorrow' vLLM: ' What is the temperature? What is the humidity? What is the wind? What is the wind direction' ERROR token id. token logprob token id. token logprob HF: 3639 ' What' -1.592316 vLLM: 3639 ' What' -1.583668 HF: 374 ' is' -1.394676 vLLM: 374 ' is' -1.388251 HF: 279 ' the' -0.357713 vLLM: 279 ' the' -0.350707 HF: 9282 ' weather' -1.251681 vLLM: 9499 ' temperature' -1.276881 ERROR HF: 1093 ' like' -0.403686 vLLM: 30 '?' -1.256650 ERROR HF: 16986 ' tomorrow' -1.232682 vLLM: 3639 ' What' -0.992781 ERROR HF: 30 '?' -0.236499 vLLM: 374 ' is' -0.772252 ERROR HF: 3639 ' What' -0.647572 vLLM: 279 ' the' -0.110132 ERROR HF: 374 ' is' -0.691557 vLLM: 38193 ' humidity' -1.615233 ERROR HF: 279 ' the' -0.176384 vLLM: 30 '?' -0.366836 ERROR HF: 9282 ' weather' -0.283581 vLLM: 3639 ' What' -0.616249 ERROR HF: 1093 ' like' -0.266174 vLLM: 374 ' is' -0.546297 ERROR HF: 3432 ' today' -0.595149 vLLM: 279 ' the' -0.066663 ERROR HF: 30 '?' -0.449156 vLLM: 10160 ' wind' -1.652243 ERROR HF: 3639 ' What' -1.047424 vLLM: 30 '?' -1.023496 ERROR HF: 374 ' is' -0.569301 vLLM: 3639 ' What' -0.602964 ERROR HF: 279 ' the' -0.122663 vLLM: 374 ' is' -0.443599 ERROR HF: 9282 ' weather' -0.208832 vLLM: 279 ' the' -0.075392 ERROR HF: 1093 ' like' -0.267763 vLLM: 10160 ' wind' -1.916859 ERROR HF: 16986 ' tomorrow' -0.968443 vLLM: 5216 ' direction' -1.399925 ERROR logprob absolute differences: average=0.413219 maximum=1.649096 logprob relative differences: average=1.180087 maximum=6.158796 ``` As this example illustrates, the test compares the text/tokens and corresponding logprobs that are generated by HF and vLLM. In his example, the logprobs for 'weather' and 'temperature' are very close for the 4th token that was generated (as can be seen above). Due to different processing/precision etc. on the CPU and the Spyre card, 'weather' respectively 'temperature' had the highest probability for HF respectively vLLM/Spyre. The output of the test allows to analyze these cases quickly. Based on this output, the test can then be adapted to use different prompts that only report 'real' errors.
No description provided.