From e10c2667abd00f8f7b3c128fbd25a0be4bd183eb Mon Sep 17 00:00:00 2001
From: Hyunjae Woo <107147848+nv-hwoo@users.noreply.github.com>
Date: Wed, 27 Sep 2023 17:06:45 -0700
Subject: [PATCH] Add periodic concurrency mode to inference_load_modes.md
 (#402)

* Initial draft

* Small edit

* Add note

* Address feedback

* Minor fix
---
 .../docs/inference_load_modes.md              | 34 +++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/src/c++/perf_analyzer/docs/inference_load_modes.md b/src/c++/perf_analyzer/docs/inference_load_modes.md
index 10c4a8038..83fa83eb0 100644
--- a/src/c++/perf_analyzer/docs/inference_load_modes.md
+++ b/src/c++/perf_analyzer/docs/inference_load_modes.md
@@ -40,6 +40,40 @@ example, when using
 will to attempt to have 4 outgoing inference requests at all times during
 profiling.
 
+## Periodic Concurrency Mode
+
+In periodic concurrency mode, Perf Analyzer will periodically launch a new set
+of inference requests until the total number of inference requests that has been
+launched since the beginning reaches N requests.
+
+For example, when using `--periodic-concurrency-range 10:100:30`, Perf Analyzer
+will start with 10 concurrent requests and for every step, it will launch 30 new
+inference requests until the total number of requests launched since the
+beginning reaches 100. Additionally, the user can also specify *when* to launch
+the new requests by specifying `--request-period M`. This will set Perf Analyzer
+to launch a new set of requests whenever *all* of the latest set of launched
+concurrent requests received M number of responses back from the server.
+
+The user can also specify custom parameters to the model using
+`--request-parameter <name:value:type>` option.
+For instance, passing `--request-parameter max_tokens:256:uint` will set an
+additional parameter `max_tokens` of type `int` to 256 as part of the request.
+
+```bash
+perf_analyzer -m <model_name> -i grpc --async --streaming \
+    --profile-export-file profile.json \
+    --periodic-concurrency-range 10:100:30 \
+    --request-period 10 \
+    --request-parameter max_tokens:256:int
+```
+
+> **Note**
+>
+> The periodic concurrency mode is currently supported only by gRPC protocol and
+> with [decoupled models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/decoupled_models.md).
+> Additionally, the user must also specify a file where Perf Analyzer could dump all the
+> profiled data using `--profile-export-file`.
+
 ## Request Rate Mode
 
 In request rate mode, Perf Analyzer attempts to send N inference requests per