Add vllm_worker support for lora_modules #3534

x22x22 · 2024-09-24T05:27:21Z

usage

start

export VLLM_WORKER_MULTIPROC_METHOD=spawn
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -m fastchat.serve.vllm_worker \
    --model-path /data/models/Qwen/Qwen2-72B-Instruct \
    --tokenizer /data/models/Qwen/Qwen2-72B-Instruct  \
    --enable-lora \
    --lora-modules m1=/data/modules/lora/adapter/m1 m2=/data/modules/lora/adapter/m2 m3=/data/modules/lora/adapter/m3 \
    --model-names qwen2-72b-instruct,m1,m2,m3\
    --controller http://localhost:21001 \
    --host 0.0.0.0 \
    --num-gpus 8 \
    --port 31034 \
    --limit-worker-concurrency 100 \
    --worker-address http://localhost:31034

post

example1

curl --location --request POST 'http://fastchat_address:port/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-xxx' \
--data-raw '{
    "model": "m1",
    "stream": false,
    "temperature": 0.7,
    "top_p": 0.1,
    "max_tokens": 4096,
    "messages": [
      {
        "role": "user",
        "content": "Hi?"
      }
    ]
  }'

example2

curl --location --request POST 'http://fastchat_address:port/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-xxx' \
--data-raw '{
    "model": "qwen2-72b-instruct",
    "stream": false,
    "temperature": 0.7,
    "top_p": 0.1,
    "max_tokens": 4096,
    "messages": [
      {
        "role": "user",
        "content": "Hi?"
      }
    ]
  }'

Why are these changes needed?

Related issue number (if applicable)

Checks

I've run format.sh to lint the changes in this PR.
I've included any doc changes needed.
I've made sure the relevant tests are passing (if applicable).

## usage ### start ```bash export VLLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python3 -m fastchat.serve.vllm_worker \ --model-path /data/models/Qwen/Qwen2-72B-Instruct \ --tokenizer /data/dpo/lora/b15s1/saves/Qwen2-72B-Instruct/v7.9/v7.3 \ --enable-lora \ --lora-modules m1=/data/modules/lora/adapter/m1 m2=/data/modules/lora/adapter/m2 m3=/data/modules/lora/adapter/m3 \ --model-names qwen2-72b-instruct,m1,m2,m3\ --controller http://localhost:21001 \ --host 0.0.0.0 \ --num-gpus 8 \ --port 31034 \ --limit-worker-concurrency 100 \ --worker-address http://localhost:31034 ``` ### post - example1 ```bash curl --location --request POST 'http://llm-gw.sunlinecloud.cn/v1/chat/completions' \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer sk-xxx' \ --data-raw '{ "model": "m1", "stream": false, "temperature": 0.7, "top_p": 0.1, "max_tokens": 4096, "messages": [ { "role": "user", "content": "Hi?" } ] }' ``` - example2 ```bash curl --location --request POST 'http://llm-gw.sunlinecloud.cn/v1/chat/completions' \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer sk-xxx' \ --data-raw '{ "model": "qwen2-72b-instruct", "stream": false, "temperature": 0.7, "top_p": 0.1, "max_tokens": 4096, "messages": [ { "role": "user", "content": "Hi?" } ] }' ```

x22x22 added 3 commits September 24, 2024 13:23

add doc

d36dc74

x22x22 changed the title ~~## Add vllm_worker support for lora_modules~~ Add vllm_worker support for lora_modules Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vllm_worker support for lora_modules #3534

Add vllm_worker support for lora_modules #3534

x22x22 commented Sep 24, 2024 •

edited

Loading

Add vllm_worker support for lora_modules #3534

Are you sure you want to change the base?

Add vllm_worker support for lora_modules #3534

Conversation

x22x22 commented Sep 24, 2024 • edited Loading

usage

start

post

Why are these changes needed?

Related issue number (if applicable)

Checks

x22x22 commented Sep 24, 2024 •

edited

Loading