Skip to content

Commit

Permalink
Add periodic concurrency mode to inference_load_modes.md (#402)
Browse files Browse the repository at this point in the history
* Initial draft

* Small edit

* Add note

* Address feedback

* Minor fix
  • Loading branch information
nv-hwoo authored Sep 28, 2023
1 parent 2b522e6 commit e10c266
Showing 1 changed file with 34 additions and 0 deletions.
34 changes: 34 additions & 0 deletions src/c++/perf_analyzer/docs/inference_load_modes.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,40 @@ example, when using
will to attempt to have 4 outgoing inference requests at all times during
profiling.

## Periodic Concurrency Mode

In periodic concurrency mode, Perf Analyzer will periodically launch a new set
of inference requests until the total number of inference requests that has been
launched since the beginning reaches N requests.

For example, when using `--periodic-concurrency-range 10:100:30`, Perf Analyzer
will start with 10 concurrent requests and for every step, it will launch 30 new
inference requests until the total number of requests launched since the
beginning reaches 100. Additionally, the user can also specify *when* to launch
the new requests by specifying `--request-period M`. This will set Perf Analyzer
to launch a new set of requests whenever *all* of the latest set of launched
concurrent requests received M number of responses back from the server.

The user can also specify custom parameters to the model using
`--request-parameter <name:value:type>` option.
For instance, passing `--request-parameter max_tokens:256:uint` will set an
additional parameter `max_tokens` of type `int` to 256 as part of the request.

```bash
perf_analyzer -m <model_name> -i grpc --async --streaming \
--profile-export-file profile.json \
--periodic-concurrency-range 10:100:30 \
--request-period 10 \
--request-parameter max_tokens:256:int
```

> **Note**
>
> The periodic concurrency mode is currently supported only by gRPC protocol and
> with [decoupled models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/decoupled_models.md).
> Additionally, the user must also specify a file where Perf Analyzer could dump all the
> profiled data using `--profile-export-file`.
## Request Rate Mode

In request rate mode, Perf Analyzer attempts to send N inference requests per
Expand Down

0 comments on commit e10c266

Please sign in to comment.