Skip to content

Releases: GoogleCloudPlatform/ai-on-gke

v1.10

13 Feb 20:02
d64781d
Compare
Choose a tag to compare
v1.10 Pre-release
Pre-release

Release v1.10 - 2024-01-07 Onwards

This release focuses on improving TPU support, enhancing security, and refining benchmarking tools for AI workloads on GKE.

New Features:

  • Added Terraform configurations to create A3U and A4M clusters, along with infrastructure switches for A3U and A4U (#969).
  • Introduced a multi-cluster batch processing platform example using GKE Autopilot, DWS, and Kueue with Multikueue enabled (#949).
  • Added a guide for running Skypilot on GKE with Dynamic Workload Scheduling and Kueue (#942).
  • Integrated the Kubernetes Security Validation Service (Shipshape) cluster scan into the project to enhance security validation and compliance (#935).
  • Benchmarking tool now scrapes more vLLM metrics for detailed performance analysis (#937).
  • Benchmarking script's request timeout is now configurable via the --request-timeout flag (#932).
  • Added NCCL test switching logic (#954).
  • Added recipe switching logic (#950).
  • Convert A3 Mega NCCL test from pods to a jobset (#961).
  • Added missing GCS module (#959).

Improvements:

  • Updated the default TPU webhook image to v1.2.2, which includes a fix for incorrect TPU_WORKER_HOSTNAMES caused by KubeRay controller truncation (#972).
  • Improved TPU provisioning by adding support for v6e and cross-project reservations (#851).
  • Prefix GCS bucket names with project_id to avoid conflicts (#975).
  • Replaced subnetwork name with vpc (#971).
  • Added a note to use the nightly Skypilot version when using serve on Autopilot, along with troubleshooting steps (#970).
  • Clarified that container-image should have a tag in disk image building (#939).
  • Made the GCS bucket optional in the benchmarking script (#924).
  • Added "models" to LPG sample.tfvars (#926).
  • Merged Helm scan with cluster scan (#957).
  • Made small fixes and temporarily disabled UI visibility toggle (#955).
  • Fixed header on variables file (#974).
  • Fixed duplicate deployment_name variable (#944).

Bug Fixes:

  • Fixed a bug in the TPU webhook where KubeRay's service name truncation resulted in incorrectly generated TPU_WORKER_HOSTNAMES (#963).
  • Addressed CVE-2024-45338 in the TPU webhook image by fixing an upstream golang.org/x/net vulnerability (#968).
  • Handled the bucket not found exception in the benchmarking script (#929).
  • Reverted the TGI image version to address out-of-GPU memory issues on L4 nodes (#931).

v1.9

14 Jan 22:15
768bd8a
Compare
Choose a tag to compare

This release includes a number of new features, improvements, and bug fixes.

New Features

  • Automated Security Scanning: Introduced a Cloud Build workflow to automatically perform Helm scans on the ai-on-gke repository using the Shipshape validation service along with documentation to guide users on how to handle security violations identified by Shipshape. (#918, #920)
  • 65k Node GKE Benchmark: Added a benchmark for GKE on a simulated AI workload using Terraform and ClusterLoader2, supporting clusters with 65,000 nodes. (#898)

Improvements

  • DWS-Kueue Example: Bumped Kueue quotas to 1 billion per resource to provide almost unlimited admission in examples. (#911)
  • Infrastructure Module Queued Provisioning: Added queued_provisioning setting on all node pools for compatibility with DWS. (#909)
  • TCP Receive Buffer Limit Configuration: Added a daemonset to configure TCP receive buffer limits for improved DCN performance. (#906)
  • SkyPilot Tutorial: Improved formatting and fixed typos in the SkyPilot tutorial. (#905, #907)
    Documentation Clarity: Clarified instructions on obtaining billing account ID and folder ID. (#914)

Bug Fixes

  • E2E Test Flakiness: Increased timeout to mitigate intermittent failures in E2E tests caused by delays in Ray pod startup. (#919)
  • Slurm-on-GKE Container Image: Fixed shebang line position in Slurm-on-GKE container image. (#917)
  • Directory Path: Fixed an incorrect directory path in the SLUM guide. (#912)

New Contributors

Full Changelog: v1.8...v1.9

v1.8

06 Dec 15:26
412621f
Compare
Choose a tag to compare

This release includes a number of new features, improvements and bug fixes.

New Features

  • Add HTTP streaming support for JetStream: Added functionality for streaming responses via HTTP, enhancing real-time data processing. (#877)
  • SkyPilot tutorial: Created a tutorial to demonstrate using SkyPilot to launch batch workloads across regions. (#887)
  • Define ephemeral-storage in ClusterQueue: Introduced support for workloads requesting ephemeral storage. (#872)
  • Slurm on GKE Guide: Published a guide for deploying Slurm clusters on GKE for AI/ML workloads. (#864)

Improvements

  • Benchmarking improvements:
    • Avoid repeating identical prompts by making prompts reusable for large datasets. (#880)
    • Added --output-bucket flag for uploading benchmark results to Google Cloud Storage. (#871)
  • Ray job optimization: Explicitly shut down Ray jobs after completing vector embedding tasks to avoid unnecessary runtime. (#735)
  • Update permissions for guides:
    • Added storage.objectViewer permission to boot disk guide to resolve access issues. (#893)
  • Update Jupyter Notebook image: Introduced a new tag prefix to mitigate internal vulnerability checks. (#886)
  • Use TPU network optimizer image: Shifted to an image-based approach for network optimization, improving maintainability. (#870)

Bug Fixes

  • Fix vLLM PodMonitoring: Addressed issues related to vLLM monitoring configurations. (#889)
  • Fix jupyter hub helm chart version: Pinned the JupyterHub helm chart version to mitigate server spawning errors. (#879)
  • TF version dependency mismatch: Resolved TensorFlow version mismatch issues. (#885)
  • Ray TPU webhook image update: Bumped image version to fix inconsistent PodInformer updates for large TPU slices. (#891)

New Contributors

Full Changelog: v1.7...v1.8

v1.7

04 Nov 21:11
0127782
Compare
Choose a tag to compare

This release includes a number of new features, improvements and bug fixes.

New Features

  • Added a benchmarking tool for measuring data loading performance with gcsfuse. (#863)
  • Added a Prometheus server to the Latency Profile Generator (LPG) running on port 9090, along with new metrics for prompt_length, response_length, and time_per_output_token. (#857)
  • Added support for Google Cloud Monitoring and Managed Collection for the gke-batch-refarch. (#856)
  • Added a tutorial on packaging models and low-rank adapters (LoRA) from Hugging Face as images, pushing them to Artifact Registry, and deploying them in GKE. (#855)

Improvements

  • Updated outdated references to the Text Generation Inference (TGI) container to use the Hugging Face Deep Learning Containers (DLCs) hosted on Google Cloud's Artifact Registry. (#816)
  • Added the ability to benchmark multiple models concurrently in the LPG. (#850)
  • Added support for "inf" (infinity) request rate and number of prompts in the LPG. (#847)
  • Fixed the latency_throughput_curve.sh script to correctly parse non-integer request rates and added "errors" to the benchmark results. (#850)

Bug Fixes

  • Fixed an issue where the README was not rendering correctly. (#862)

New Contributors

Full Changelog: v1.6...v1.7

v1.6

08 Oct 20:00
b0588cc
Compare
Choose a tag to compare

New Features

  • JetStream Checkpoint Converter Support: Added support for JetStream checkpoint conversion for Llama models on MaxText. (#840)
  • Automatic Deployment of CMSA and Autoscaling Config: Enabled automatic deployment of the Custom Metrics Stackdriver Adapter (CMSA) and autoscaling configurations for custom metrics with vLLM. (#825)
  • Network Optimization DaemonSet: Introduced a new DaemonSet that applies network optimizations to improve performance, including IRQ spreading, TCP settings, and larger GVE driver packet buffers. (#805)
  • Server-Side Metrics Scraping: Added initial implementation for scraping server-side metrics for analysis, with support for vLLM and Jetstream. (#804)
  • Pod Label Copying to Node: The TPU Provisioner can now be configured to copy specific Pod labels to Node labels at Node Pool creation time. (#788)
  • Configurable Prompt Dataset: The prompt dataset is now configurable, allowing you to customize the prompts used for benchmarking and analysis. (#844)

Improvements

  • Benchmarking Script Enhancements:
    • The benchmarking script now uses data gathered directly from the script instead of relying on Prometheus, resulting in more accurate and user-relevant metrics. (#836)
    • Added request_rate to the summary statistics generated by the benchmarking script. (#837)
    • Made the benchmark time configurable and increased the default time to 2 minutes for improved steady-state analysis. (#833)
    • Included additional metrics in the Load Profile Generator (LPG) script output for more comprehensive analysis. (#832)
    • Ensured that the LPG script output can be collected by changing the LPG to a Deployment and enabling --save-json-results by default. (#811)
  • MLFlow Fixes: Resolved issues with multiple experiment versions, duplicate model registrations, and missing system metrics in multi-node scenarios. (#813)

Bug Fixes

  • Fixed Internal Links: Updated internal links to publicly accessible Cloud Console links. (#843)
  • Removed Unavailable Jetstream Metrics: Removed unavailable Jetstream metrics from the monitoring system. (#841)
  • Fixed Throughput Metric: Corrected the throughput metric to be in output tokens per second. (#839)
  • Fixed Missing JSON Fields: Added missing JSON fields to the benchmarking script output. (#835)
  • Fixed Single GPU Training Job Example: Corrected an issue in the single GPU training job example where the model directory was not being created. (#831)
  • Re-enabled CI and Fixed Flaky Tests: Re-enabled continuous integration and addressed issues with OOMKill in the fuse sidecar and database connection flakiness. (#827)
  • Removed GKE Cluster Version: Removed the specified GKE cluster version to allow the Terraform configuration to use the latest version in the REGULAR channel. (#817)
  • Updated Pause Container Image Location: Updated the location of the pause container image to a more stable and accessible source. (#814)
  • Upgraded GKE Module: Upgraded the GKE module to avoid compatibility issues with Terraform Provider Google v5.44.0 on Autopilot. (#806)

Other Changes

  • Moved the ml-platform to the GoogleCloudPlatform/accelerated-platforms repository (#828) and included it as a submodule (#829)
  • Added a tutorial for installing KServe on GKE Autopilot. (#812)
  • Removed unneeded dependencies in the LPG container image. (#830)

v1.5

09 Sep 18:05
c1633fa
Compare
Choose a tag to compare

Ray

Ray on GKE Terraform now uses the GKE Ray Add-on when creating GKE clusters (#781)

GKE image builder

Add mirror.gcr.io in containerd configuration to reduce docker rate limiting (#764)

Benchmarks

Add latency profile generator (#775)
Decrease scrape interval of metrics from TGI and DCGM to 15s (#772)
Enable Pod monitoring for vLLM (#796)

Testing

Add e2e tests for Hugging Face TGI tutorial (#780)

v1.4

09 Aug 00:40
3345790
Compare
Choose a tag to compare

Quick start solutions:

Ray

  • Released v1.2.0, supporting autoscaling RayClusters (#740) and adding reliability improvements (#723)
  • Added a helm-chart (#745)
  • Bump Ray TPU webhook image (#763)

Rag

  • Update RAG fronend docker image in (#762)

TPU

  • Add HuggingFace support for automated inference checkpoint conversion (#712)
  • Jetstream Maxtext Deployment Module: All scale rules now in a single HPA in (#730)
  • Update pip in JetStream Pytorch and checkpoint Dockerfiles in (#739)
  • Fix faulty HPA in Jetstream Maxtext module in (#741)
  • Correct tokenizer for Jetstream Module in (#742)
  • Make image names optional in Jetstream Maxtext module in (#744)
  • Terraform modules cleanup in (#758)
  • TPU Metrics Improvements in (#727, #761, #770)

Benchmark

  • update main README.md quickstart guide in (#734)
  • Add Quantization support for TGI in (#757)
  • Update README with the latest input variables in (#759)

Tutorials and Examples

  • update image url for gemma finetune yaml in (#729)
  • NIM on GKE Tutorial in (#737)
  • Add Kueue exemplary setup for reservation and DWS in (#746)

Full Changelog: v1.3...v1.4

v1.3

12 Jul 17:18
2bfbcd7
Compare
Choose a tag to compare

Quick start solutions

  1. Add finetuning gemma on GKE with L4 GPUs example (#697)
  2. Jetstream autoscaling guide (#703)
  3. Enable Ray autoscaler for RAG example application (#722)
  4. Fix GKE training tutorial (#706)
  5. Update Kueue to 0.7.0 (#707)

ML Platform release (#715)

  • Documentation
    1. Add notebook packaging guide to docs (#690)
    2. Added enhancements to the data processing use cases
  • Infrastructure
    1. Added H100 and A100 40GB DWS node pools
    2. Moved cpu node pool from n2 to n4 machines
    3. Updated Kueue to 0.7.0
    4. Added initial test harness
  • Configuration Management
    1. ConfigSync git repository name allows for easier use of multiple environments. Standardized GitOps scripts
    2. Added a GitLab project module and allowed users to choose between GitHub and GitLab
  • Observability
    1. Added NVIDIA DCGM
    2. Added environment_name to the Ray dashboard endpoint
    3. Added Config Controller Terraform module
  • Security
    1. Added allow KubeRay Operator to the namespace network policy
    2. Added Secret Manager add-on to the cluster

TPU Provisioner

  1. Add admission label selectors and e2e test script (#702)

Benchmarking

  1. Add temperature and power usage DCGM metrics (#710)
  2. Support for csv output (#705)

General

Add custom metrics stackdriver adapter Terraform module (#718)
Add prometheus adapter Terraform module (#716)
Add Jetstream MaxText Terraform module (#719)

v.1.2

11 Jun 15:05
218cc62
Compare
Choose a tag to compare

Quick start solutions

Ray

  • Enabled TPU webhook on GKE Autopilot (#585)
  • Support Multi-slice TPU groups (#453)
  • Support multiple worker groups requesting TPUs (#467)
  • Added unit tests for Ray TPU webhook (#578)
  • Fix GMP on GKE Standard (#689)

RAG

  • README updates to fix broken links (#664) and better describe custom domain usage (#681)

ML Platform

  • intended for platform admins to have a multi-tenant AI/ML platform running on GKE)
  • Initial release! (#690)

TPU Provisioner

  • Add fixes relating to interacting with JobSets (#645)
  • Allow forcing use of on-demand nodes and disable auto upgrade for node pools (#656)
  • Support location hint label (#666)
  • Update usage instructions (#684)

Benchmarking

  • Support for Measuring Time to First Token (#650)
  • Allow private clusters (#669)
  • Fix hardcoded gRPC request type (#670)
  • Add JetStream support (#677)
  • Support for vLLM OpenAI API Server (#694)

v1.1.2

06 May 21:31
e7b191a
Compare
Choose a tag to compare

Highlights

  • RAG, Ray & Jupyter terraform solutions now support GKE Autopilot as the default cluster type #635
  • The RAG solution has improved test coverage to (1) validate the notebook that generates vector embeddings as part of the E2E tests #524 (2) validate prompt responses from the LLM with context #511

What's Changed

Full Changelog: v1.1.0...v1.1.2