-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add histogram support and TTFT histogram metric #396
feat: Add histogram support and TTFT histogram metric #396
Conversation
0c2893d
to
ba4495d
Compare
ba4495d
to
b9231d8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! @rmccorm4 should also take a look before merging.
@@ -214,6 +232,10 @@ InferenceResponse::Send( | |||
TRITONSERVER_TRACE_TENSOR_BACKEND_OUTPUT, "InferenceResponse Send"); | |||
#endif // TRITON_ENABLE_TRACING | |||
|
|||
#ifdef TRITON_ENABLE_METRICS | |||
response->UpdateResponseMetrics(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an initial merge with a default of histograms being disabled - I think this is fine to go ahead with if we need to cherry-pick. However, please take note of the following:
I think this is relatively on the hot path, possibly impacting latency, compared to our other inference metrics (TRITONBACKEND_ReportStatistics
) which are generally reported after response sending in backends (impacting throughput but not response latency).
You can find some perf numbers of each prometheus-cpp metric type at the bottom of the README here: https://github.com/jupp0r/prometheus-cpp
One individual observation for a single metric and a small number of buckets may not be impactful for one request, but as we scale up high concurrency, more metrics, more buckets, etc - this could present a noticeable latency impact.
It would be good to do some light validation of overall latency before/after the feature via genai-perf
. Especially for high concurrency and streaming many responses/tokens - as there can be some synchronization in interaction with the prometheus registry with many concurrent responses as well.
It would probably be advantageous to do the actual prometheus registry interaction after sending the response if possible, such as by only doing the bare minimum of determining if we should report metrics (check if first response and record latency), then using that information to report the metric after initiating response send.
1b4f571
to
b6b5af9
Compare
Deferring to Jacky as I'll be OOO, but let's make sure the test cases give us confidence in the feature and that we're addressing the spirit of the use case that motivates the feature |
…into DLIS-7383-yinggeh-metrics-standardization-TTFT
…into DLIS-7383-yinggeh-metrics-standardization-TTFT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! The test cases are giving us confidence in the feature and the feature is addressing the use case.
Please make sure the CI passes with the all final changes before merging.
What does the PR do?
nv_inference_first_response_histogram_ms
. The new histogram metric is specific to decoupled models only.--metrics-config histogram_latencies=<bool>
to disable histograms.Example a request sent to TRT-LLM backend ensemble model.
Example a request sent to TRT-LLM backend tensorrt_llm_bls model.
Checklist
<commit_type>: <Title>
Commit Type:
Related PRs:
triton-inference-server/server#7694
Where should the reviewer start?
n/a
Test plan:
L0_metrics--base
L0_response_cache--base
19614087
Caveats:
Background
Standardizing Large Model Server Metrics in Kubernetes