Releases: GoogleCloudPlatform/ai-on-gke
v1.10
Release v1.10 - 2024-01-07 Onwards
This release focuses on improving TPU support, enhancing security, and refining benchmarking tools for AI workloads on GKE.
New Features:
- Added Terraform configurations to create A3U and A4M clusters, along with infrastructure switches for A3U and A4U (#969).
- Introduced a multi-cluster batch processing platform example using GKE Autopilot, DWS, and Kueue with Multikueue enabled (#949).
- Added a guide for running Skypilot on GKE with Dynamic Workload Scheduling and Kueue (#942).
- Integrated the Kubernetes Security Validation Service (Shipshape) cluster scan into the project to enhance security validation and compliance (#935).
- Benchmarking tool now scrapes more vLLM metrics for detailed performance analysis (#937).
- Benchmarking script's request timeout is now configurable via the
--request-timeout
flag (#932). - Added NCCL test switching logic (#954).
- Added recipe switching logic (#950).
- Convert A3 Mega NCCL test from pods to a jobset (#961).
- Added missing GCS module (#959).
Improvements:
- Updated the default TPU webhook image to v1.2.2, which includes a fix for incorrect
TPU_WORKER_HOSTNAMES
caused by KubeRay controller truncation (#972). - Improved TPU provisioning by adding support for v6e and cross-project reservations (#851).
- Prefix GCS bucket names with
project_id
to avoid conflicts (#975). - Replaced subnetwork name with vpc (#971).
- Added a note to use the nightly Skypilot version when using serve on Autopilot, along with troubleshooting steps (#970).
- Clarified that
container-image
should have a tag in disk image building (#939). - Made the GCS bucket optional in the benchmarking script (#924).
- Added "models" to
LPG sample.tfvars
(#926). - Merged Helm scan with cluster scan (#957).
- Made small fixes and temporarily disabled UI visibility toggle (#955).
- Fixed header on variables file (#974).
- Fixed duplicate
deployment_name
variable (#944).
Bug Fixes:
- Fixed a bug in the TPU webhook where KubeRay's service name truncation resulted in incorrectly generated
TPU_WORKER_HOSTNAMES
(#963). - Addressed CVE-2024-45338 in the TPU webhook image by fixing an upstream
golang.org/x/net
vulnerability (#968). - Handled the bucket not found exception in the benchmarking script (#929).
- Reverted the TGI image version to address out-of-GPU memory issues on L4 nodes (#931).
v1.9
This release includes a number of new features, improvements, and bug fixes.
New Features
- Automated Security Scanning: Introduced a Cloud Build workflow to automatically perform Helm scans on the ai-on-gke repository using the Shipshape validation service along with documentation to guide users on how to handle security violations identified by Shipshape. (#918, #920)
- 65k Node GKE Benchmark: Added a benchmark for GKE on a simulated AI workload using Terraform and ClusterLoader2, supporting clusters with 65,000 nodes. (#898)
Improvements
- DWS-Kueue Example: Bumped Kueue quotas to 1 billion per resource to provide almost unlimited admission in examples. (#911)
- Infrastructure Module Queued Provisioning: Added queued_provisioning setting on all node pools for compatibility with DWS. (#909)
- TCP Receive Buffer Limit Configuration: Added a daemonset to configure TCP receive buffer limits for improved DCN performance. (#906)
- SkyPilot Tutorial: Improved formatting and fixed typos in the SkyPilot tutorial. (#905, #907)
Documentation Clarity: Clarified instructions on obtaining billing account ID and folder ID. (#914)
Bug Fixes
- E2E Test Flakiness: Increased timeout to mitigate intermittent failures in E2E tests caused by delays in Ray pod startup. (#919)
- Slurm-on-GKE Container Image: Fixed shebang line position in Slurm-on-GKE container image. (#917)
- Directory Path: Fixed an incorrect directory path in the SLUM guide. (#912)
New Contributors
- @volatilemolotov made their first contribution in #905
- @besher-massri made their first contribution in #898
- @janetkuo made their first contribution in #914
Full Changelog: v1.8...v1.9
v1.8
This release includes a number of new features, improvements and bug fixes.
New Features
- Add HTTP streaming support for JetStream: Added functionality for streaming responses via HTTP, enhancing real-time data processing. (#877)
- SkyPilot tutorial: Created a tutorial to demonstrate using SkyPilot to launch batch workloads across regions. (#887)
- Define ephemeral-storage in ClusterQueue: Introduced support for workloads requesting ephemeral storage. (#872)
- Slurm on GKE Guide: Published a guide for deploying Slurm clusters on GKE for AI/ML workloads. (#864)
Improvements
- Benchmarking improvements:
- Ray job optimization: Explicitly shut down Ray jobs after completing vector embedding tasks to avoid unnecessary runtime. (#735)
- Update permissions for guides:
- Added storage.objectViewer permission to boot disk guide to resolve access issues. (#893)
- Update Jupyter Notebook image: Introduced a new tag prefix to mitigate internal vulnerability checks. (#886)
- Use TPU network optimizer image: Shifted to an image-based approach for network optimization, improving maintainability. (#870)
Bug Fixes
- Fix vLLM PodMonitoring: Addressed issues related to vLLM monitoring configurations. (#889)
- Fix jupyter hub helm chart version: Pinned the JupyterHub helm chart version to mitigate server spawning errors. (#879)
- TF version dependency mismatch: Resolved TensorFlow version mismatch issues. (#885)
- Ray TPU webhook image update: Bumped image version to fix inconsistent PodInformer updates for large TPU slices. (#891)
New Contributors
- @dsafdsa1 made their first contribution in #872
- @danielmarzini made their first contribution in #864
- @darinpeetz made their first contribution in #870
Full Changelog: v1.7...v1.8
v1.7
This release includes a number of new features, improvements and bug fixes.
New Features
- Added a benchmarking tool for measuring data loading performance with gcsfuse. (#863)
- Added a Prometheus server to the Latency Profile Generator (LPG) running on port 9090, along with new metrics for prompt_length, response_length, and time_per_output_token. (#857)
- Added support for Google Cloud Monitoring and Managed Collection for the gke-batch-refarch. (#856)
- Added a tutorial on packaging models and low-rank adapters (LoRA) from Hugging Face as images, pushing them to Artifact Registry, and deploying them in GKE. (#855)
Improvements
- Updated outdated references to the Text Generation Inference (TGI) container to use the Hugging Face Deep Learning Containers (DLCs) hosted on Google Cloud's Artifact Registry. (#816)
- Added the ability to benchmark multiple models concurrently in the LPG. (#850)
- Added support for "inf" (infinity) request rate and number of prompts in the LPG. (#847)
- Fixed the latency_throughput_curve.sh script to correctly parse non-integer request rates and added "errors" to the benchmark results. (#850)
Bug Fixes
- Fixed an issue where the README was not rendering correctly. (#862)
New Contributors
- @alvarobartt made their first contribution in #816
- @liu-cong made their first contribution in #850
- @coolkp made their first contribution in #855
- @JamesDuncanNz made their first contribution in #856
Full Changelog: v1.6...v1.7
v1.6
New Features
- JetStream Checkpoint Converter Support: Added support for JetStream checkpoint conversion for Llama models on MaxText. (#840)
- Automatic Deployment of CMSA and Autoscaling Config: Enabled automatic deployment of the Custom Metrics Stackdriver Adapter (CMSA) and autoscaling configurations for custom metrics with vLLM. (#825)
- Network Optimization DaemonSet: Introduced a new DaemonSet that applies network optimizations to improve performance, including IRQ spreading, TCP settings, and larger GVE driver packet buffers. (#805)
- Server-Side Metrics Scraping: Added initial implementation for scraping server-side metrics for analysis, with support for vLLM and Jetstream. (#804)
- Pod Label Copying to Node: The TPU Provisioner can now be configured to copy specific Pod labels to Node labels at Node Pool creation time. (#788)
- Configurable Prompt Dataset: The prompt dataset is now configurable, allowing you to customize the prompts used for benchmarking and analysis. (#844)
Improvements
- Benchmarking Script Enhancements:
- The benchmarking script now uses data gathered directly from the script instead of relying on Prometheus, resulting in more accurate and user-relevant metrics. (#836)
- Added request_rate to the summary statistics generated by the benchmarking script. (#837)
- Made the benchmark time configurable and increased the default time to 2 minutes for improved steady-state analysis. (#833)
- Included additional metrics in the Load Profile Generator (LPG) script output for more comprehensive analysis. (#832)
- Ensured that the LPG script output can be collected by changing the LPG to a Deployment and enabling --save-json-results by default. (#811)
- MLFlow Fixes: Resolved issues with multiple experiment versions, duplicate model registrations, and missing system metrics in multi-node scenarios. (#813)
Bug Fixes
- Fixed Internal Links: Updated internal links to publicly accessible Cloud Console links. (#843)
- Removed Unavailable Jetstream Metrics: Removed unavailable Jetstream metrics from the monitoring system. (#841)
- Fixed Throughput Metric: Corrected the throughput metric to be in output tokens per second. (#839)
- Fixed Missing JSON Fields: Added missing JSON fields to the benchmarking script output. (#835)
- Fixed Single GPU Training Job Example: Corrected an issue in the single GPU training job example where the model directory was not being created. (#831)
- Re-enabled CI and Fixed Flaky Tests: Re-enabled continuous integration and addressed issues with OOMKill in the fuse sidecar and database connection flakiness. (#827)
- Removed GKE Cluster Version: Removed the specified GKE cluster version to allow the Terraform configuration to use the latest version in the REGULAR channel. (#817)
- Updated Pause Container Image Location: Updated the location of the pause container image to a more stable and accessible source. (#814)
- Upgraded GKE Module: Upgraded the GKE module to avoid compatibility issues with Terraform Provider Google v5.44.0 on Autopilot. (#806)
Other Changes
v1.5
Ray
Ray on GKE Terraform now uses the GKE Ray Add-on when creating GKE clusters (#781)
GKE image builder
Add mirror.gcr.io in containerd configuration to reduce docker rate limiting (#764)
Benchmarks
Add latency profile generator (#775)
Decrease scrape interval of metrics from TGI and DCGM to 15s (#772)
Enable Pod monitoring for vLLM (#796)
Testing
Add e2e tests for Hugging Face TGI tutorial (#780)
v1.4
Quick start solutions:
Ray
- Released v1.2.0, supporting autoscaling RayClusters (#740) and adding reliability improvements (#723)
- Added a helm-chart (#745)
- Bump Ray TPU webhook image (#763)
Rag
- Update RAG fronend docker image in (#762)
TPU
- Add HuggingFace support for automated inference checkpoint conversion (#712)
- Jetstream Maxtext Deployment Module: All scale rules now in a single HPA in (#730)
- Update pip in JetStream Pytorch and checkpoint Dockerfiles in (#739)
- Fix faulty HPA in Jetstream Maxtext module in (#741)
- Correct tokenizer for Jetstream Module in (#742)
- Make image names optional in Jetstream Maxtext module in (#744)
- Terraform modules cleanup in (#758)
- TPU Metrics Improvements in (#727, #761, #770)
Benchmark
- update main README.md quickstart guide in (#734)
- Add Quantization support for TGI in (#757)
- Update README with the latest input variables in (#759)
Tutorials and Examples
- update image url for gemma finetune yaml in (#729)
- NIM on GKE Tutorial in (#737)
- Add Kueue exemplary setup for reservation and DWS in (#746)
Full Changelog: v1.3...v1.4
v1.3
Quick start solutions
- Add finetuning gemma on GKE with L4 GPUs example (#697)
- Jetstream autoscaling guide (#703)
- Enable Ray autoscaler for RAG example application (#722)
- Fix GKE training tutorial (#706)
- Update Kueue to 0.7.0 (#707)
ML Platform release (#715)
- Documentation
- Add notebook packaging guide to docs (#690)
- Added enhancements to the data processing use cases
- Infrastructure
- Added H100 and A100 40GB DWS node pools
- Moved cpu node pool from n2 to n4 machines
- Updated Kueue to 0.7.0
- Added initial test harness
- Configuration Management
- ConfigSync git repository name allows for easier use of multiple environments. Standardized GitOps scripts
- Added a GitLab project module and allowed users to choose between GitHub and GitLab
- Observability
- Added NVIDIA DCGM
- Added environment_name to the Ray dashboard endpoint
- Added Config Controller Terraform module
- Security
- Added allow KubeRay Operator to the namespace network policy
- Added Secret Manager add-on to the cluster
TPU Provisioner
- Add admission label selectors and e2e test script (#702)
Benchmarking
General
Add custom metrics stackdriver adapter Terraform module (#718)
Add prometheus adapter Terraform module (#716)
Add Jetstream MaxText Terraform module (#719)
v.1.2
Quick start solutions
Ray
- Enabled TPU webhook on GKE Autopilot (#585)
- Support Multi-slice TPU groups (#453)
- Support multiple worker groups requesting TPUs (#467)
- Added unit tests for Ray TPU webhook (#578)
- Fix GMP on GKE Standard (#689)
RAG
ML Platform
- intended for platform admins to have a multi-tenant AI/ML platform running on GKE)
- Initial release! (#690)
TPU Provisioner
- Add fixes relating to interacting with JobSets (#645)
- Allow forcing use of on-demand nodes and disable auto upgrade for node pools (#656)
- Support location hint label (#666)
- Update usage instructions (#684)
Benchmarking
v1.1.2
Highlights
- RAG, Ray & Jupyter terraform solutions now support GKE Autopilot as the default cluster type #635
- The RAG solution has improved test coverage to (1) validate the notebook that generates vector embeddings as part of the E2E tests #524 (2) validate prompt responses from the LLM with context #511
What's Changed
- Cherrypick AP cloud build stockout mitigation onto release-1.1 by @artemvmin in #580
- Jupyter notebook cherry pick by @chiayi in #600
- quick fix or rag prompt test output by @chiayi in #612
- Fetch the cached weights for Mistral-7B-Instruct-v0.1 from GCS bucket… by @gongmax in #621
- Cherry-pick #599 and #618 to release-1.1 by @roberthbailey in #627
- Cherry-pick #631 to release-1.1 branch by @roberthbailey in #632
- Cherry-pick #635 to release-1.1 branch by @roberthbailey in #637
Full Changelog: v1.1.0...v1.1.2