Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect telemetry metrics from Triton metrics endpoint #26

Merged
merged 9 commits into from
Aug 14, 2024

Conversation

lkomali
Copy link
Contributor

@lkomali lkomali commented Aug 5, 2024

This PR is a part of adding telemetry metrics to GenAI-perf.

Changes implemented in this PR:

  1. Added a new Telemetry metrics class that has the GPU metrics mentioned below.
    • gpu_power_usage
    • gpu_power_limit
    • energy_consumption
    • gpu_utilization
    • total_gpu_memory
    • gpu_memory_used
  2. A new folder to hold all telemetry data collection files
  3. A Base Telemetry Data Collector class that implements a thread to gather metrics from endpoint.
  4. A specific Triton Telemetry Data Collector class that has a specific implementation for Triton metrics.
  5. Modified wrapper.py to start TelemtryDataCollector thread once perf_analyzer subprocess starts running.

This is how the metrics are stored.

image

For every metric, the values are stored as a list of lists.
The outer list represents the sequence of metric measurements over time.
Each inner list contains the metric values for each GPU at a particular time point.

@lkomali lkomali marked this pull request as draft August 5, 2024 22:38
Copy link
Contributor

@debermudez debermudez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job with this.
I want to go through some of it with you but this is a great step towards wrapping this up and getting this out to our users. Thanks for working so hard on this.

genai-perf/genai_perf/constants.py Outdated Show resolved Hide resolved
genai-perf/genai_perf/wrapper.py Outdated Show resolved Hide resolved
genai-perf/genai_perf/wrapper.py Show resolved Hide resolved
@lkomali lkomali marked this pull request as ready for review August 8, 2024 07:27
dyastremsky
dyastremsky previously approved these changes Aug 8, 2024
Copy link
Contributor

@dyastremsky dyastremsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work, Harshini!

debermudez
debermudez previously approved these changes Aug 8, 2024
Copy link
Contributor

@debermudez debermudez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wait to merge until @nv-hwoo closes his comments.
Nice work!

Copy link
Contributor

@nv-hwoo nv-hwoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @lkomali! Thanks for adding such a detailed doc string 👍 And sorry for getting back so late 😅 All looks good with just one small comment about the optional argument for TelemetryDataCollector in the run function.

genai-perf/genai_perf/wrapper.py Outdated Show resolved Hide resolved
@lkomali lkomali dismissed stale reviews from debermudez and dyastremsky via f85bac3 August 13, 2024 19:39
@lkomali lkomali merged commit e67f9ca into main Aug 14, 2024
7 checks passed
@lkomali lkomali deleted the lkomali-tpa-192 branch August 14, 2024 15:03
lkomali added a commit that referenced this pull request Aug 15, 2024
* Collect telemetry metrics from Triton metrics endpoint

* Remove one of the print statements

* Fix comments

* Fix pre-commit errors

* Fix test errors

* Add unit tests and fix code

* Fix pre-commit error

* Fix codeql warnings

* Fix comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants