From e10c2667abd00f8f7b3c128fbd25a0be4bd183eb Mon Sep 17 00:00:00 2001 From: Hyunjae Woo <107147848+nv-hwoo@users.noreply.github.com> Date: Wed, 27 Sep 2023 17:06:45 -0700 Subject: [PATCH] Add periodic concurrency mode to inference_load_modes.md (#402) * Initial draft * Small edit * Add note * Address feedback * Minor fix --- .../docs/inference_load_modes.md | 34 +++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/src/c++/perf_analyzer/docs/inference_load_modes.md b/src/c++/perf_analyzer/docs/inference_load_modes.md index 10c4a8038..83fa83eb0 100644 --- a/src/c++/perf_analyzer/docs/inference_load_modes.md +++ b/src/c++/perf_analyzer/docs/inference_load_modes.md @@ -40,6 +40,40 @@ example, when using will to attempt to have 4 outgoing inference requests at all times during profiling. +## Periodic Concurrency Mode + +In periodic concurrency mode, Perf Analyzer will periodically launch a new set +of inference requests until the total number of inference requests that has been +launched since the beginning reaches N requests. + +For example, when using `--periodic-concurrency-range 10:100:30`, Perf Analyzer +will start with 10 concurrent requests and for every step, it will launch 30 new +inference requests until the total number of requests launched since the +beginning reaches 100. Additionally, the user can also specify *when* to launch +the new requests by specifying `--request-period M`. This will set Perf Analyzer +to launch a new set of requests whenever *all* of the latest set of launched +concurrent requests received M number of responses back from the server. + +The user can also specify custom parameters to the model using +`--request-parameter ` option. +For instance, passing `--request-parameter max_tokens:256:uint` will set an +additional parameter `max_tokens` of type `int` to 256 as part of the request. + +```bash +perf_analyzer -m -i grpc --async --streaming \ + --profile-export-file profile.json \ + --periodic-concurrency-range 10:100:30 \ + --request-period 10 \ + --request-parameter max_tokens:256:int +``` + +> **Note** +> +> The periodic concurrency mode is currently supported only by gRPC protocol and +> with [decoupled models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/decoupled_models.md). +> Additionally, the user must also specify a file where Perf Analyzer could dump all the +> profiled data using `--profile-export-file`. + ## Request Rate Mode In request rate mode, Perf Analyzer attempts to send N inference requests per