Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use correct model_source for loading GPTQ models with FMS #61

Merged
merged 2 commits into from
Jan 17, 2025

Conversation

tdoublep
Copy link
Member

No description provided.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@tdoublep tdoublep requested a review from maxdebayser January 16, 2025 21:27
Copy link
Contributor

@maxdebayser maxdebayser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Thomas Parnell <[email protected]>
@tdoublep tdoublep merged commit 7e068b5 into main Jan 17, 2025
10 checks passed
@tdoublep tdoublep deleted the tpa-gptq-fix branch January 17, 2025 19:49
tdoublep pushed a commit that referenced this pull request Jan 20, 2025
Supports issue #878

This PR comprises an initial set of 5 end-to-end tests (pytest) to
verify the operation of vLLM on Spyre.

1) verification of correct output by comparing against output generated
using HF transformers
2) verification of correct handling of multiple "overlapping" warmup
shapes
3) verification of correct handling of prompts that are exceeding the
maximum prompt length defined by the warmup shapes
4) verification of tensor parallel operation
5) verification that seeded sampling should result in deterministic
output (covers recent issue)

The various tests allow to cover the various backends ('eager',
'inductor', 'sendnn_decoder'), multiple warmup shapes and combinations,
and models. These tests also generate additional output that can be used
to efficiently analyze and debug any problems.

An example of such output is shown below:

```
model:         /models/llama-194m
warmup shapes: [(64, 20, 8)]
backend:       sendnn_decoder

#prompts:      4
#HF results:   4
#vLLM results: 4

....

prompt:        'What is the weather today like?'
generated:
        HF:    ' What is the weather like tomorrow? What is the weather like today? What is the weather like tomorrow'
        vLLM:  ' What is the temperature? What is the humidity? What is the wind? What is the wind direction'  ERROR

   token id. token               logprob         token id. token               logprob
HF:     3639 ' What'             -1.592316  vLLM:     3639 ' What'             -1.583668
HF:      374 ' is'               -1.394676  vLLM:      374 ' is'               -1.388251
HF:      279 ' the'              -0.357713  vLLM:      279 ' the'              -0.350707
HF:     9282 ' weather'          -1.251681  vLLM:     9499 ' temperature'      -1.276881    ERROR
HF:     1093 ' like'             -0.403686  vLLM:       30 '?'                 -1.256650    ERROR
HF:    16986 ' tomorrow'         -1.232682  vLLM:     3639 ' What'             -0.992781    ERROR
HF:       30 '?'                 -0.236499  vLLM:      374 ' is'               -0.772252    ERROR
HF:     3639 ' What'             -0.647572  vLLM:      279 ' the'              -0.110132    ERROR
HF:      374 ' is'               -0.691557  vLLM:    38193 ' humidity'         -1.615233    ERROR
HF:      279 ' the'              -0.176384  vLLM:       30 '?'                 -0.366836    ERROR
HF:     9282 ' weather'          -0.283581  vLLM:     3639 ' What'             -0.616249    ERROR
HF:     1093 ' like'             -0.266174  vLLM:      374 ' is'               -0.546297    ERROR
HF:     3432 ' today'            -0.595149  vLLM:      279 ' the'              -0.066663    ERROR
HF:       30 '?'                 -0.449156  vLLM:    10160 ' wind'             -1.652243    ERROR
HF:     3639 ' What'             -1.047424  vLLM:       30 '?'                 -1.023496    ERROR
HF:      374 ' is'               -0.569301  vLLM:     3639 ' What'             -0.602964    ERROR
HF:      279 ' the'              -0.122663  vLLM:      374 ' is'               -0.443599    ERROR
HF:     9282 ' weather'          -0.208832  vLLM:      279 ' the'              -0.075392    ERROR
HF:     1093 ' like'             -0.267763  vLLM:    10160 ' wind'             -1.916859    ERROR
HF:    16986 ' tomorrow'         -0.968443  vLLM:     5216 ' direction'        -1.399925    ERROR

logprob absolute differences: average=0.413219  maximum=1.649096
logprob relative differences: average=1.180087  maximum=6.158796
```
As this example illustrates, the test compares the text/tokens and
corresponding logprobs that are generated by HF and vLLM.
In his example, the logprobs for 'weather' and 'temperature' are very
close for the 4th token that was generated (as can be seen above). Due
to different processing/precision etc. on the CPU and the Spyre card,
'weather' respectively 'temperature' had the highest probability for HF
respectively vLLM/Spyre. The output of the test allows to analyze these
cases quickly. Based on this output, the test can then be adapted to use
different prompts that only report 'real' errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants