13 Feb 20:02

d64781d

v1.10 Pre-release

Pre-release

Release v1.10 - 2024-01-07 Onwards

This release focuses on improving TPU support, enhancing security, and refining benchmarking tools for AI workloads on GKE.

New Features:

Added Terraform configurations to create A3U and A4M clusters, along with infrastructure switches for A3U and A4U (#969).
Introduced a multi-cluster batch processing platform example using GKE Autopilot, DWS, and Kueue with Multikueue enabled (#949).
Added a guide for running Skypilot on GKE with Dynamic Workload Scheduling and Kueue (#942).
Integrated the Kubernetes Security Validation Service (Shipshape) cluster scan into the project to enhance security validation and compliance (#935).
Benchmarking tool now scrapes more vLLM metrics for detailed performance analysis (#937).
Benchmarking script's request timeout is now configurable via the --request-timeout flag (#932).
Added NCCL test switching logic (#954).
Added recipe switching logic (#950).
Convert A3 Mega NCCL test from pods to a jobset (#961).
Added missing GCS module (#959).

Improvements:

Updated the default TPU webhook image to v1.2.2, which includes a fix for incorrect TPU_WORKER_HOSTNAMES caused by KubeRay controller truncation (#972).
Improved TPU provisioning by adding support for v6e and cross-project reservations (#851).
Prefix GCS bucket names with project_id to avoid conflicts (#975).
Replaced subnetwork name with vpc (#971).
Added a note to use the nightly Skypilot version when using serve on Autopilot, along with troubleshooting steps (#970).
Clarified that container-image should have a tag in disk image building (#939).
Made the GCS bucket optional in the benchmarking script (#924).
Added "models" to LPG sample.tfvars (#926).
Merged Helm scan with cluster scan (#957).
Made small fixes and temporarily disabled UI visibility toggle (#955).
Fixed header on variables file (#974).
Fixed duplicate deployment_name variable (#944).

Bug Fixes:

Fixed a bug in the TPU webhook where KubeRay's service name truncation resulted in incorrectly generated TPU_WORKER_HOSTNAMES (#963).
Addressed CVE-2024-45338 in the TPU webhook image by fixing an upstream golang.org/x/net vulnerability (#968).
Handled the bucket not found exception in the benchmarking script (#929).
Reverted the TGI image version to address out-of-GPU memory issues on L4 nodes (#931).

Assets 2

14 Jan 22:15

roberthbailey

v1.9

768bd8a

v1.9 Latest

Latest

This release includes a number of new features, improvements, and bug fixes.

New Features

Automated Security Scanning: Introduced a Cloud Build workflow to automatically perform Helm scans on the ai-on-gke repository using the Shipshape validation service along with documentation to guide users on how to handle security violations identified by Shipshape. (#918, #920)
65k Node GKE Benchmark: Added a benchmark for GKE on a simulated AI workload using Terraform and ClusterLoader2, supporting clusters with 65,000 nodes. (#898)

Improvements

DWS-Kueue Example: Bumped Kueue quotas to 1 billion per resource to provide almost unlimited admission in examples. (#911)
Infrastructure Module Queued Provisioning: Added queued_provisioning setting on all node pools for compatibility with DWS. (#909)
TCP Receive Buffer Limit Configuration: Added a daemonset to configure TCP receive buffer limits for improved DCN performance. (#906)
SkyPilot Tutorial: Improved formatting and fixed typos in the SkyPilot tutorial. (#905, #907)
Documentation Clarity: Clarified instructions on obtaining billing account ID and folder ID. (#914)

Bug Fixes

E2E Test Flakiness: Increased timeout to mitigate intermittent failures in E2E tests caused by delays in Ray pod startup. (#919)
Slurm-on-GKE Container Image: Fixed shebang line position in Slurm-on-GKE container image. (#917)
Directory Path: Fixed an incorrect directory path in the SLUM guide. (#912)

New Contributors

@volatilemolotov made their first contribution in #905
@besher-massri made their first contribution in #898
@janetkuo made their first contribution in #914

Full Changelog: v1.8...v1.9

Contributors

janetkuo, besher-massri, and volatilemolotov

Assets 2

06 Dec 15:26

AlexBulankou

v1.8

412621f

v1.8

This release includes a number of new features, improvements and bug fixes.

New Features

Add HTTP streaming support for JetStream: Added functionality for streaming responses via HTTP, enhancing real-time data processing. (#877)
SkyPilot tutorial: Created a tutorial to demonstrate using SkyPilot to launch batch workloads across regions. (#887)
Define ephemeral-storage in ClusterQueue: Introduced support for workloads requesting ephemeral storage. (#872)
Slurm on GKE Guide: Published a guide for deploying Slurm clusters on GKE for AI/ML workloads. (#864)

Improvements

Benchmarking improvements:
- Avoid repeating identical prompts by making prompts reusable for large datasets. (#880)
- Added --output-bucket flag for uploading benchmark results to Google Cloud Storage. (#871)
Ray job optimization: Explicitly shut down Ray jobs after completing vector embedding tasks to avoid unnecessary runtime. (#735)
Update permissions for guides:
- Added storage.objectViewer permission to boot disk guide to resolve access issues. (#893)
Update Jupyter Notebook image: Introduced a new tag prefix to mitigate internal vulnerability checks. (#886)
Use TPU network optimizer image: Shifted to an image-based approach for network optimization, improving maintainability. (#870)

Bug Fixes

Fix vLLM PodMonitoring: Addressed issues related to vLLM monitoring configurations. (#889)
Fix jupyter hub helm chart version: Pinned the JupyterHub helm chart version to mitigate server spawning errors. (#879)
TF version dependency mismatch: Resolved TensorFlow version mismatch issues. (#885)
Ray TPU webhook image update: Bumped image version to fix inconsistent PodInformer updates for large TPU slices. (#891)

New Contributors

@dsafdsa1 made their first contribution in #872
@danielmarzini made their first contribution in #864
@darinpeetz made their first contribution in #870

Full Changelog: v1.7...v1.8

Contributors

darinpeetz, danielmarzini, and dsafdsa1

Assets 2

04 Nov 21:11

roberthbailey

v1.7

0127782

v1.7

This release includes a number of new features, improvements and bug fixes.

New Features

Added a benchmarking tool for measuring data loading performance with gcsfuse. (#863)
Added a Prometheus server to the Latency Profile Generator (LPG) running on port 9090, along with new metrics for prompt_length, response_length, and time_per_output_token. (#857)
Added support for Google Cloud Monitoring and Managed Collection for the gke-batch-refarch. (#856)
Added a tutorial on packaging models and low-rank adapters (LoRA) from Hugging Face as images, pushing them to Artifact Registry, and deploying them in GKE. (#855)

Improvements

Updated outdated references to the Text Generation Inference (TGI) container to use the Hugging Face Deep Learning Containers (DLCs) hosted on Google Cloud's Artifact Registry. (#816)
Added the ability to benchmark multiple models concurrently in the LPG. (#850)
Added support for "inf" (infinity) request rate and number of prompts in the LPG. (#847)
Fixed the latency_throughput_curve.sh script to correctly parse non-integer request rates and added "errors" to the benchmark results. (#850)

Bug Fixes

Fixed an issue where the README was not rendering correctly. (#862)

New Contributors

@alvarobartt made their first contribution in #816
@liu-cong made their first contribution in #850
@coolkp made their first contribution in #855
@JamesDuncanNz made their first contribution in #856

Full Changelog: v1.6...v1.7

Contributors

JamesDuncanNz, liu-cong, and 2 other contributors

Assets 2

08 Oct 20:00

BogdanAtWork

v1.6

b0588cc

v1.6

New Features

JetStream Checkpoint Converter Support: Added support for JetStream checkpoint conversion for Llama models on MaxText. (#840)
Automatic Deployment of CMSA and Autoscaling Config: Enabled automatic deployment of the Custom Metrics Stackdriver Adapter (CMSA) and autoscaling configurations for custom metrics with vLLM. (#825)
Network Optimization DaemonSet: Introduced a new DaemonSet that applies network optimizations to improve performance, including IRQ spreading, TCP settings, and larger GVE driver packet buffers. (#805)
Server-Side Metrics Scraping: Added initial implementation for scraping server-side metrics for analysis, with support for vLLM and Jetstream. (#804)
Pod Label Copying to Node: The TPU Provisioner can now be configured to copy specific Pod labels to Node labels at Node Pool creation time. (#788)
Configurable Prompt Dataset: The prompt dataset is now configurable, allowing you to customize the prompts used for benchmarking and analysis. (#844)

Improvements

Benchmarking Script Enhancements:
- The benchmarking script now uses data gathered directly from the script instead of relying on Prometheus, resulting in more accurate and user-relevant metrics. (#836)
- Added request_rate to the summary statistics generated by the benchmarking script. (#837)
- Made the benchmark time configurable and increased the default time to 2 minutes for improved steady-state analysis. (#833)
- Included additional metrics in the Load Profile Generator (LPG) script output for more comprehensive analysis. (#832)
- Ensured that the LPG script output can be collected by changing the LPG to a Deployment and enabling --save-json-results by default. (#811)
MLFlow Fixes: Resolved issues with multiple experiment versions, duplicate model registrations, and missing system metrics in multi-node scenarios. (#813)

Bug Fixes

Fixed Internal Links: Updated internal links to publicly accessible Cloud Console links. (#843)
Removed Unavailable Jetstream Metrics: Removed unavailable Jetstream metrics from the monitoring system. (#841)
Fixed Throughput Metric: Corrected the throughput metric to be in output tokens per second. (#839)
Fixed Missing JSON Fields: Added missing JSON fields to the benchmarking script output. (#835)
Fixed Single GPU Training Job Example: Corrected an issue in the single GPU training job example where the model directory was not being created. (#831)
Re-enabled CI and Fixed Flaky Tests: Re-enabled continuous integration and addressed issues with OOMKill in the fuse sidecar and database connection flakiness. (#827)
Removed GKE Cluster Version: Removed the specified GKE cluster version to allow the Terraform configuration to use the latest version in the REGULAR channel. (#817)
Updated Pause Container Image Location: Updated the location of the pause container image to a more stable and accessible source. (#814)
Upgraded GKE Module: Upgraded the GKE module to avoid compatibility issues with Terraform Provider Google v5.44.0 on Autopilot. (#806)

Other Changes

Moved the ml-platform to the GoogleCloudPlatform/accelerated-platforms repository (#828) and included it as a submodule (#829)
Added a tutorial for installing KServe on GKE Autopilot. (#812)
Removed unneeded dependencies in the LPG container image. (#830)

Assets 2

09 Sep 18:05

andrewsykim

v1.5

c1633fa

v1.5

Ray

Ray on GKE Terraform now uses the GKE Ray Add-on when creating GKE clusters (#781)

GKE image builder

Add mirror.gcr.io in containerd configuration to reduce docker rate limiting (#764)

Benchmarks

Add latency profile generator (#775)
Decrease scrape interval of metrics from TGI and DCGM to 15s (#772)
Enable Pod monitoring for vLLM (#796)

Testing

Add e2e tests for Hugging Face TGI tutorial (#780)

Assets 2

09 Aug 00:40

CindyXing

v1.4

3345790

v1.4

Quick start solutions:

Ray

Released v1.2.0, supporting autoscaling RayClusters (#740) and adding reliability improvements (#723)
Added a helm-chart (#745)
Bump Ray TPU webhook image (#763)

Rag

Update RAG fronend docker image in (#762)

TPU

Add HuggingFace support for automated inference checkpoint conversion (#712)
Jetstream Maxtext Deployment Module: All scale rules now in a single HPA in (#730)
Update pip in JetStream Pytorch and checkpoint Dockerfiles in (#739)
Fix faulty HPA in Jetstream Maxtext module in (#741)
Correct tokenizer for Jetstream Module in (#742)
Make image names optional in Jetstream Maxtext module in (#744)
Terraform modules cleanup in (#758)
TPU Metrics Improvements in (#727, #761, #770)

Benchmark

update main README.md quickstart guide in (#734)
Add Quantization support for TGI in (#757)
Update README with the latest input variables in (#759)

Tutorials and Examples

update image url for gemma finetune yaml in (#729)
NIM on GKE Tutorial in (#737)
Add Kueue exemplary setup for reservation and DWS in (#746)

Full Changelog: v1.3...v1.4

Assets 2

12 Jul 17:18

rngy

v1.3

2bfbcd7

v1.3

Quick start solutions

Add finetuning gemma on GKE with L4 GPUs example (#697)
Jetstream autoscaling guide (#703)
Enable Ray autoscaler for RAG example application (#722)
Fix GKE training tutorial (#706)
Update Kueue to 0.7.0 (#707)

ML Platform release (#715)

Documentation
1. Add notebook packaging guide to docs (#690)
2. Added enhancements to the data processing use cases
Infrastructure
1. Added H100 and A100 40GB DWS node pools
2. Moved cpu node pool from n2 to n4 machines
3. Updated Kueue to 0.7.0
4. Added initial test harness
Configuration Management
1. ConfigSync git repository name allows for easier use of multiple environments. Standardized GitOps scripts
2. Added a GitLab project module and allowed users to choose between GitHub and GitLab
Observability
1. Added NVIDIA DCGM
2. Added environment_name to the Ray dashboard endpoint
3. Added Config Controller Terraform module
Security
1. Added allow KubeRay Operator to the namespace network policy
2. Added Secret Manager add-on to the cluster

TPU Provisioner

Add admission label selectors and e2e test script (#702)

Benchmarking

Add temperature and power usage DCGM metrics (#710)
Support for csv output (#705)

General

Add custom metrics stackdriver adapter Terraform module (#718)
Add prometheus adapter Terraform module (#716)
Add Jetstream MaxText Terraform module (#719)

Assets 2

11 Jun 15:05

AlexBulankou

v1.2

218cc62

v.1.2

Quick start solutions

Ray

Enabled TPU webhook on GKE Autopilot (#585)
Support Multi-slice TPU groups (#453)
Support multiple worker groups requesting TPUs (#467)
Added unit tests for Ray TPU webhook (#578)
Fix GMP on GKE Standard (#689)

RAG

README updates to fix broken links (#664) and better describe custom domain usage (#681)

ML Platform

intended for platform admins to have a multi-tenant AI/ML platform running on GKE)
Initial release! (#690)

TPU Provisioner

Add fixes relating to interacting with JobSets (#645)
Allow forcing use of on-demand nodes and disable auto upgrade for node pools (#656)
Support location hint label (#666)
Update usage instructions (#684)

Benchmarking

Support for Measuring Time to First Token (#650)
Allow private clusters (#669)
Fix hardcoded gRPC request type (#670)
Add JetStream support (#677)
Support for vLLM OpenAI API Server (#694)

Assets 2

06 May 21:31

imreddy13

v1.1.2

e7b191a

v1.1.2

Highlights

RAG, Ray & Jupyter terraform solutions now support GKE Autopilot as the default cluster type #635
The RAG solution has improved test coverage to (1) validate the notebook that generates vector embeddings as part of the E2E tests #524 (2) validate prompt responses from the LLM with context #511

What's Changed

Cherrypick AP cloud build stockout mitigation onto release-1.1 by @artemvmin in #580
Jupyter notebook cherry pick by @chiayi in #600
quick fix or rag prompt test output by @chiayi in #612
Fetch the cached weights for Mistral-7B-Instruct-v0.1 from GCS bucket… by @gongmax in #621
Cherry-pick #599 and #618 to release-1.1 by @roberthbailey in #627
Cherry-pick #631 to release-1.1 branch by @roberthbailey in #632
Cherry-pick #635 to release-1.1 branch by @roberthbailey in #637

Full Changelog: v1.1.0...v1.1.2

Contributors

roberthbailey, gongmax, and 2 other contributors

Assets 2

Releases: GoogleCloudPlatform/ai-on-gke

v1.10

Release v1.10 - 2024-01-07 Onwards

v1.9

New Features

Improvements

Bug Fixes

New Contributors

Contributors

v1.8

New Features

Improvements

Bug Fixes

New Contributors

Contributors

v1.7

New Features

Improvements

Bug Fixes

New Contributors

Contributors

v1.6

New Features

Improvements

Bug Fixes

Other Changes

v1.5

Ray

GKE image builder

Benchmarks

Testing

v1.4

Quick start solutions:

Ray

Rag

TPU

Benchmark

Tutorials and Examples

v1.3

Quick start solutions

ML Platform release (#715)

TPU Provisioner

Benchmarking

General

v.1.2

Quick start solutions

Ray

RAG

ML Platform

TPU Provisioner

Benchmarking

v1.1.2

Highlights

What's Changed

Contributors