Skip to content

Commit

Permalink
Add continus batch size benchmark to LLM guide
Browse files Browse the repository at this point in the history
  • Loading branch information
matthewkotila committed Sep 28, 2023
1 parent e10c266 commit 4bbcf05
Showing 1 changed file with 75 additions and 0 deletions.
75 changes: 75 additions & 0 deletions src/c++/perf_analyzer/docs/llm.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,3 +194,78 @@ python3 examples/calculate_avg_token_to_token_latency.py
```

#### 6. Repeat steps 3-5 with different prompt lengths to measure effects of initial prompt size (prefill) on token-to-token latency (generation).

### Benchmark 3: Profiling Continuous Batch Size

In this benchmarking scenario, we want to measure the effect of continuous
batch size on token-to-token latency. We systematically issue requests to the
server of fixed input sizes and request the model to compute a fixed amount of
tokens in order to increase the continuous batching size over time.

#### (Optional) Stop Triton Server if already running

```bash
pkill tritonserver
```

#### 1. Run the following commands to set the `max_tokens` to `16` and `ignore_eos` to `true`

```bash
PATH_TO_MODEL_PY="model_repository/vllm/1/model.py"
MAX_TOKENS=100
sed -i "128s/.*/\ \ \ \ \ \ \ \ params_dict[\"max_tokens\"] = ${MAX_TOKENS}/" ${PATH_TO_MODEL_PY}
sed -i "128s/.*/\ \ \ \ \ \ \ \ params_dict[\"ignore_eos\"] = True" ${PATH_TO_MODEL_PY}
```

#### 2. Start Triton Server

```bash
docker run --gpus all -it --rm -p 8001:8001 --shm-size=1G --ulimit memlock=-1 --ulimit stack=67108864 -v ${PWD}:/work -w /work tritonserver_vllm tritonserver --model-store ./model_repository
# this will run continuously in the current shell
```

#### 3. Generate prompts input data JSON

```bash
# open a new shell in the same directory you were in when running the above command
echo '
{
"data": [
{
"PROMPT": [
"Hello, my name is"
],
"STREAM": [
true
]
}
]
}
' > prompts.json
```

#### 4. Run Perf Analyzer

```bash
perf_analyzer \
-m vllm \
-i grpc \
--async \
--streaming \
--input-data=prompts.json \
--profile-export-file=profile_export.json \
--measurement-mode=count_windows \
--measurement-request-count=10 \
--stability-percentage=999
--periodic-concurrency-range=1:20:1
--request-period=10
```

#### 5. Calculate average token-to-token latency

```bash
python3 examples/calculate_avg_token_to_token_latency.py
# Average token-to-token latency: 0.003090155677419355 s
```

#### 6. Repeat steps 4-5 with different period concurrency range start/end/step and different request period to measure effects of continuous batch size on token-to-token latency (generation).

0 comments on commit 4bbcf05

Please sign in to comment.