Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add histogram support and TTFT histogram metric #396
feat: Add histogram support and TTFT histogram metric #396
Changes from 9 commits
b9231d8
8c9fe0c
db990d3
189cf64
51518d6
da632d1
7f0612c
72d99d7
b6b5af9
fb87d2a
a0ca111
525e238
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an initial merge with a default of histograms being disabled - I think this is fine to go ahead with if we need to cherry-pick. However, please take note of the following:
I think this is relatively on the hot path, possibly impacting latency, compared to our other inference metrics (
TRITONBACKEND_ReportStatistics
) which are generally reported after response sending in backends (impacting throughput but not response latency).You can find some perf numbers of each prometheus-cpp metric type at the bottom of the README here: https://github.com/jupp0r/prometheus-cpp
One individual observation for a single metric and a small number of buckets may not be impactful for one request, but as we scale up high concurrency, more metrics, more buckets, etc - this could present a noticeable latency impact.
It would be good to do some light validation of overall latency before/after the feature via
genai-perf
. Especially for high concurrency and streaming many responses/tokens - as there can be some synchronization in interaction with the prometheus registry with many concurrent responses as well.It would probably be advantageous to do the actual prometheus registry interaction after sending the response if possible, such as by only doing the bare minimum of determining if we should report metrics (check if first response and record latency), then using that information to report the metric after initiating response send.