You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
New Features
JetStream Checkpoint Converter Support: Added support for JetStream checkpoint conversion for Llama models on MaxText. (#840)
Automatic Deployment of CMSA and Autoscaling Config: Enabled automatic deployment of the Custom Metrics Stackdriver Adapter (CMSA) and autoscaling configurations for custom metrics with vLLM. (#825)
Network Optimization DaemonSet: Introduced a new DaemonSet that applies network optimizations to improve performance, including IRQ spreading, TCP settings, and larger GVE driver packet buffers. (#805)
Server-Side Metrics Scraping: Added initial implementation for scraping server-side metrics for analysis, with support for vLLM and Jetstream. (#804)
Pod Label Copying to Node: The TPU Provisioner can now be configured to copy specific Pod labels to Node labels at Node Pool creation time. (#788)
Configurable Prompt Dataset: The prompt dataset is now configurable, allowing you to customize the prompts used for benchmarking and analysis. (#844)
Improvements
Benchmarking Script Enhancements:
The benchmarking script now uses data gathered directly from the script instead of relying on Prometheus, resulting in more accurate and user-relevant metrics. (#836)
Added request_rate to the summary statistics generated by the benchmarking script. (#837)
Made the benchmark time configurable and increased the default time to 2 minutes for improved steady-state analysis. (#833)
Included additional metrics in the Load Profile Generator (LPG) script output for more comprehensive analysis. (#832)
Ensured that the LPG script output can be collected by changing the LPG to a Deployment and enabling --save-json-results by default. (#811)
MLFlow Fixes: Resolved issues with multiple experiment versions, duplicate model registrations, and missing system metrics in multi-node scenarios. (#813)
Removed Unavailable Jetstream Metrics: Removed unavailable Jetstream metrics from the monitoring system. (#841)
Fixed Throughput Metric: Corrected the throughput metric to be in output tokens per second. (#839)
Fixed Missing JSON Fields: Added missing JSON fields to the benchmarking script output. (#835)
Fixed Single GPU Training Job Example: Corrected an issue in the single GPU training job example where the model directory was not being created. (#831)
Re-enabled CI and Fixed Flaky Tests: Re-enabled continuous integration and addressed issues with OOMKill in the fuse sidecar and database connection flakiness. (#827)
Removed GKE Cluster Version: Removed the specified GKE cluster version to allow the Terraform configuration to use the latest version in the REGULAR channel. (#817)
Updated Pause Container Image Location: Updated the location of the pause container image to a more stable and accessible source. (#814)
Upgraded GKE Module: Upgraded the GKE module to avoid compatibility issues with Terraform Provider Google v5.44.0 on Autopilot. (#806)
Other Changes
Moved the ml-platform to the GoogleCloudPlatform/accelerated-platforms repository (#828) and included it as a submodule (#829)
Added a tutorial for installing KServe on GKE Autopilot. (#812)
Removed unneeded dependencies in the LPG container image. (#830)