Skip to content

v1.6

Compare
Choose a tag to compare
@BogdanAtWork BogdanAtWork released this 08 Oct 20:00
· 27 commits to main since this release
b0588cc

New Features

  • JetStream Checkpoint Converter Support: Added support for JetStream checkpoint conversion for Llama models on MaxText. (#840)
  • Automatic Deployment of CMSA and Autoscaling Config: Enabled automatic deployment of the Custom Metrics Stackdriver Adapter (CMSA) and autoscaling configurations for custom metrics with vLLM. (#825)
  • Network Optimization DaemonSet: Introduced a new DaemonSet that applies network optimizations to improve performance, including IRQ spreading, TCP settings, and larger GVE driver packet buffers. (#805)
  • Server-Side Metrics Scraping: Added initial implementation for scraping server-side metrics for analysis, with support for vLLM and Jetstream. (#804)
  • Pod Label Copying to Node: The TPU Provisioner can now be configured to copy specific Pod labels to Node labels at Node Pool creation time. (#788)
  • Configurable Prompt Dataset: The prompt dataset is now configurable, allowing you to customize the prompts used for benchmarking and analysis. (#844)

Improvements

  • Benchmarking Script Enhancements:
    • The benchmarking script now uses data gathered directly from the script instead of relying on Prometheus, resulting in more accurate and user-relevant metrics. (#836)
    • Added request_rate to the summary statistics generated by the benchmarking script. (#837)
    • Made the benchmark time configurable and increased the default time to 2 minutes for improved steady-state analysis. (#833)
    • Included additional metrics in the Load Profile Generator (LPG) script output for more comprehensive analysis. (#832)
    • Ensured that the LPG script output can be collected by changing the LPG to a Deployment and enabling --save-json-results by default. (#811)
  • MLFlow Fixes: Resolved issues with multiple experiment versions, duplicate model registrations, and missing system metrics in multi-node scenarios. (#813)

Bug Fixes

  • Fixed Internal Links: Updated internal links to publicly accessible Cloud Console links. (#843)
  • Removed Unavailable Jetstream Metrics: Removed unavailable Jetstream metrics from the monitoring system. (#841)
  • Fixed Throughput Metric: Corrected the throughput metric to be in output tokens per second. (#839)
  • Fixed Missing JSON Fields: Added missing JSON fields to the benchmarking script output. (#835)
  • Fixed Single GPU Training Job Example: Corrected an issue in the single GPU training job example where the model directory was not being created. (#831)
  • Re-enabled CI and Fixed Flaky Tests: Re-enabled continuous integration and addressed issues with OOMKill in the fuse sidecar and database connection flakiness. (#827)
  • Removed GKE Cluster Version: Removed the specified GKE cluster version to allow the Terraform configuration to use the latest version in the REGULAR channel. (#817)
  • Updated Pause Container Image Location: Updated the location of the pause container image to a more stable and accessible source. (#814)
  • Upgraded GKE Module: Upgraded the GKE module to avoid compatibility issues with Terraform Provider Google v5.44.0 on Autopilot. (#806)

Other Changes

  • Moved the ml-platform to the GoogleCloudPlatform/accelerated-platforms repository (#828) and included it as a submodule (#829)
  • Added a tutorial for installing KServe on GKE Autopilot. (#812)
  • Removed unneeded dependencies in the LPG container image. (#830)