Skip to content

Releases: ray-project/kuberay

v1.2.2

29 Sep 08:40
0ea404b
Compare
Choose a tag to compare

Highlights

  • (alpha) Ray kubectl plugin
    • get, session, log, job submit
  • (alpha) Kubernetes events: create Kubernetes events for important information about the interactions between KubeRay and the Kubernetes API server
  • (alpha) Apache YuniKorn integration

Changelog

v1.2.1 release

31 Aug 06:43
fa3d8ee
Compare
Choose a tag to compare

Compared to KubeRay v1.2.0, KubeRay v1.2.1 includes an additional commit (#2243). This commit fixes the issue where a RayService created by a KubeRay version older than v1.2.0 does not support zero-downtime upgrades after upgrading to KubeRay v1.2.0.

v1.2.0 release

29 Aug 21:44
58ba733
Compare
Choose a tag to compare

Highlights

  • RayCluster CRD status observability improvement: design doc
  • Support retry in RayJob: #2192
  • Coding style improvement

RayCluster

RayCluster CRD status improvement

Coding style improvement

RayJob

  • [RayJob] allow create verb for services/proxy, which is required for HTTPMode (#2321, @andrewsykim)
  • [Fix][Sample-Yaml] Increase ray head CPU resource for pytorch minst (#2330, @MortalHappiness)
  • Support Apache YuniKorn as one batch scheduler option (#2184, @yangwwei)
  • [RayJob] add RayJob pass Deadline e2e-test with retry (#2241, @karta1502545)
  • add feature gate mechanism to ray-operator (#2219, @andrewsykim)
  • [RayJob] add Failing RayJob in HTTPMode e2e test for rayjob with retry (#2242, @tinaxfwu)
  • [Feat][RayJob] Delete RayJob CR after job termination (#2225, @MortalHappiness)
  • reconcile concurrency flag should apply for RayJob and RayService controllers (#2228, @andrewsykim)
  • [RayJob] add Failing submitter K8s Job e2e ...
Read more

v1.1.1 release

08 May 20:14
f460fda
Compare
Choose a tag to compare

Compared to KubeRay v1.1.0, KubeRay v1.1.1 includes four cherry-picked commits.

  • [Bug] Ray operator crashes when specifying RayCluster with resources.limits but no resources.requests (#2077, @kevin85421)
  • [CI] Pin kustomize to v5.3.0 (#2067, @kevin85421)
  • [Bug] All worker Pods are deleted if using KubeRay v1.0.0 CRD with KubeRay operator v1.1.0 image (#2087, @kevin85421)
  • [Hotfix][CI] Pin setup-envtest dep (#2038, @kevin85421)

v1.1.0 release

23 Mar 04:05
8adc538
Compare
Choose a tag to compare

Highlights

  • RayJob improvements

  • Structured logging

    • In KubeRay v1.1.0, we have changed the KubeRay logs to JSON format, and each log message includes context information such as the custom resource’s name and reconcileID. Hence, users can filter out logs associated with a RayCluster, RayJob, or RayService CR by its name.
  • RayService improvements

    • Refactor health check mechanism to improve the stability.
    • Deprecate the deploymentUnhealthySecondThreshold and serviceUnhealthySecondThreshold to avoid unintentional preparation of new RayCluster custom resource.
  • TPU multi-host PodSlice support

    • The KubeRay team is actively working with the Google GKE and TPU teams on integration. The required changes in KubeRay have already been completed. The GKE team will complete some tasks on their side this week or next. Then, users should be able to use multi-host TPU PodSlice with a static RayCluster (without autoscaling).
  • Stop publishing images on DockerHub; instead, we will only publish on Quay.

RayJob

RayJob state machine refactor

  • [RayJob][Status][1/n] Redefine the definition of JobDeploymentStatusComplete (#1719, @kevin85421)
  • [RayJob][Status][2/n] Redefine ready for RayCluster to avoid using HTTP requests to check dashboard status (#1733, @kevin85421)
  • [RayJob][Status][3/n] Define JobDeploymentStatusInitializing (#1737, @kevin85421)
  • [RayJob][Status][4/n] Remove some JobDeploymentStatus and updateState function calls (#1743, @kevin85421)
  • [RayJob][Status][5/n] Refactor getOrCreateK8sJob (#1750, @kevin85421)
  • [RayJob][Status][6/n] Redefine JobDeploymentStatusComplete and clean up K8s Job after TTL (#1762, @kevin85421)
  • [RayJob][Status][7/n] Define JobDeploymentStatusNew explicitly (#1772, @kevin85421)
  • [RayJob][Status][8/n] Only a RayJob with the status Running can transition to Complete at this moment (#1774, @kevin85421)
  • [RayJob][Status][9/n] RayJob should not pass any changes to RayCluster (#1776, @kevin85421)
  • [RayJob][10/n] Add finalizer to the RayJob when the RayJob status is JobDeploymentStatusNew (#1780, @kevin85421)
  • [RayJob][Status][11/n] Refactor the suspend operation (#1782, @kevin85421)
  • [RayJob][Status][12/n] Resume suspended RayJob (#1783, @kevin85421)
  • [RayJob][Status][13/n] Make suspend operation atomic by introducing the new status Suspending (#1798, @kevin85421)
  • [RayJob][Status][14/n] Decouple the Initializing status and Running status (#1801, @kevin85421)
  • [RayJob][Status][15/n] Unify the codepath for the status transition to Suspended (#1805, @kevin85421)
  • [RayJob][Status][16/n] Refactor Running status (#1807, @kevin85421)
  • [RayJob][Status][17/n] Unify the codepath for status updates (#1814, @kevin85421)
  • [RayJob][Status][18/n] Control the entire lifecycle of the Kubernetes submitter Job using KubeRay (#1831, @kevin85421)
  • [RayJob][Status][19/n] Transition to Complete if the K8s Job fails (#1833, @kevin85421)

Others

  • [Refactor] Remove global utils.GetRayXXXClientFuncs (#1727, @rueian)
  • [Feature] Warn Users When Updating the RayClusterSpec in RayJob CR (#1778, @Yicheng-Lu-llll)
  • Add apply configurations to generated client (#1818, @astefanutti)
  • RayJob: inject RAY_DASHBOARD_ADDRESS envariable variable for user provided submiter templates (#1852, @andrewsykim)
  • [Bug] Submitter K8s Job fails even though the RayJob has a JobDeploymentStatus Complete and a JobStatus SUCCEEDED (#1919, @kevin85421)
  • add toleration for GPUs in sample pytorch RayJob (#1914, @andrewsykim)
  • Add a sample RayJob to fine-tune a PyTorch lightning text classifier with Ray Data (#1891, @andrewsykim)
  • rayjob controller: refactor environment variable check in unit tests (#1870, @andrewsykim)
  • RayJob: don't delete submitter job when ShutdownAfterJobFinishes=true (#1881, @andrewsykim)
  • rayjob controller: update EndTime to always be the time when the job deployment transitions to Complete status (#1872, @andrewsykim)
  • chore: remove ConfigMap from ray-job.kueue-toy-sample.yaml (#1976, @kevin85421)
  • [Kueue] Add a sample YAML for Kueue toy sample (#1956, @kevin85421)
  • [RayJob] Support ActiveDeadlineSeconds (#1933, @kevin85421)
  • [Feature][RayJob] Support light-weight job submission (#1893, @kevin85421)
  • [RayJob] Add JobDeploymentStatusFailed Status and Reason Field to Enhance Observability for Flyte/RayJob Integration (#1942, @Yicheng-Lu-llll)
  • [RayJob] Refactor Rayjob E2E Tests to Use Server-Side Apply (#1927, @Yicheng-Lu-llll)
  • [RayJob] Rewrite RayJob envtest (#1916, @kevin85421)
  • [Chore][RayJob] Remove the TODO of verifying the schema of RayJobInfo because it is already correct (#1911, @rueian)
  • [RayJob] Set missing CPU limit (#1899, @kevin85421)
  • [RayJob] Set the timeout of the HTTP client from 2 mins to 2 seconds (#1910, @kevin85421)
  • [Feature][RayJob] Support light-weight job submission with entrypoint_num_cpus, entrypoint_num_gpus and entrypoint_resources (#1904, @rueian)
  • [RayJob] Improve dashboard client log (#1903, @kevin85421)
  • [RayJob] Validate whether runtimeEnvYAML is a valid YAML string (#1898, @kevin85421)
  • [RayJob] Add additional print columns for RayJob (#1895, @andrewsykim)
  • [Test][RayJob] Transition to Complete if the JobStatus is STOPPED (#1871, @kevin85421)
  • [RayJob] Inject RAY_SUBMISSION_ID env variable for user provided submitter template (#1868, @kevin85421)
  • [RayJob] Transition to Complete if the JobStatus is STOPPED (#1855, @kevin85421)
  • [RayJob][Kueue] Move limitation check to validateRayJobSpec (#1854, @kevin85421)
  • [RayJob] Validate RayJob spec (#1813, @kevin85421)
  • [Test][RayJob] Kueue happy-path scenario (#1809, @kevin85421)
  • [RayJob] Delete the Kubernetes Job and its Pods immediately when suspending (#1791, @rueian)
  • [Feature][RayJob] Remove the deprecated RuntimeEnv from CRD. Use RuntimeEnvYAML instead. (#1792, @rueian)
  • [Bug][RayJob] Avoid nil pointer dereference ([#1756](https://github.c...
Read more

v1.0.0 release

07 Nov 06:12
1add258
Compare
Choose a tag to compare

KubeRay is officially in General Availability!

  • Bump the CRD version from v1alpha1 to v1.
  • Relocate almost all documentation to the Ray website.
  • Improve RayJob UX.
  • Improve GCS fault tolerance.

GCS fault tolerance

CRD versioning

RayService

  • [Hotfix][Bug] Avoid unnecessary zero-downtime upgrade (#1581, @kevin85421)
  • [Feature] Add an example for RayService high availability (#1566, @kevin85421)
  • [Feature] Add a flag to make zero downtime upgrades optional (#1564, @kevin85421)
  • [Bug][RayService] KubeRay does not recreate Serve applications if a head Pod without GCS FT recovers from a failure. (#1420, @kevin85421)
  • [Bug] Fix the filename of text summarizer YAML (#1415, @kevin85421)
  • [serve] Change text ml yaml to use french in user config (#1403, @zcin)
  • [services] Add text ml rayservice yaml (#1402, @zcin)
  • [Bug] Fix flakiness of RayService e2e tests (#1385, @kevin85421)
  • Add RayService sample test (#1377, @Darren221)
  • [RayService] Revisit the conditions under which a RayService is considered unhealthy and the default threshold (#1293, @kevin85421)
  • [RayService][Observability] Add more loggings about networking issues (#1282, @kevin85421)

RayJob

RayCluster

  • [Enhancement] Remove unused variables in constant.go (#1474, @evalaiyc98)
  • [Enhancement] GPU RayCluster doesn't work on GKE Autopilot (#1470, @kevin85421)
  • [Refactor] Parameterize TestGetAndCheckServeStatus (#1450, @evalaiyc98)
  • [Feature] Make replicas optional for WorkerGroupSpec (#1443, @kevin85421)
  • use raycluster app's name as podgroup name key word (#1446, @lowang-bh)
  • [Refactor] Make port name variables consistent and meaningful (#1389, @evalaiyc98)
  • [Feature] Use image of Ray head container as the default Ray Autoscaler container (#1401, @kevin85421)
  • Update Autoscaler YAML for the Autoscaler tutorial (#1400, @kevin85421)
  • [Feature] Ray container must be the first application container (#1379, @kevin85421)
  • [release blocker][Feature] Only Autoscaler can make decisions to delete Pods (#1253, @kevin85421)
  • [release blocker][Autoscaler] Randomly delete Pods when scaling down the cluster (#1251, @kevin85421)

Helm charts

KubeRay API Server

Documentation

Read more

v0.6.0 release

26 Jul 22:35
9b21af9
Compare
Choose a tag to compare

Highlights

RayService

  • [RayService][Observability] Add more logging for RayService troubleshooting (#1230, @kevin85421)
  • [Bug] Long image pull time will trigger blue-green upgrade after the head is ready (#1231, @kevin85421)
  • [RayService] Stable Diffusion example (#1181, @kevin85421)
  • [RayService] Update docs to use multi-app (#1179, @zcin)
  • [RayService] Change runtime env for e2e autoscaling test (#1178, @zcin)
  • [RayService] Add e2e tests (#1167, @zcin)
  • [RayService][docs] Improve explanation for config file and in-place updates (#1229, @zcin)
  • [RayService][Doc] RayService troubleshooting handbook (#1221, @kevin85421)
  • [Doc] Improve RayService doc (#1235, @kevin85421)
  • [Doc] Improve FAQ page and RayService troubleshooting guide (#1225, @kevin85421)
  • [RayService] Add RayService alb ingress CR (#1169, @sihanwang41)
  • [RayService] Add support for multi-app config in yaml-string format (#1156, @zcin)
  • [rayservice] Add support for getting multi-app status (#1136, @zcin)
  • [Refactor] Remove Dashboard Agent service (#1207, @kevin85421)
  • [Bug] KubeRay operator fails to get serve deployment status due to 500 Internal Server Error (#1173, @kevin85421)
  • MobileNet example (#1175, @kevin85421)
  • [Bug] fix RayActorOptionSpec.items.spec.serveConfig.deployments.rayActorOptions.memory int32 data type (#1220, @kevin85421)

RayJob

  • [RayJob] Submit job using K8s job instead of checking Status and using DashboardHTTPClient (#1177, @architkulkarni)
  • [Doc] [RayJob] Add documentation for submitterPodTemplate (#1228, @architkulkarni)

Autoscaler

  • [release blocker][Feature] Only Autoscaler can make decisions to delete Pods (#1253, @kevin85421)
  • [release blocker][Autoscaler] Randomly delete Pods when scaling down the cluster (#1251, @kevin85421)

Helm

  • [Helm][RBAC] Introduce the option crNamespacedRbacEnable to enable or disable the creation of Role/RoleBinding for RayCluster preparation (#1162, @kevin85421)
  • [Bug] Allow zero replica for workers for Helm (#968, @ducviet00)
  • [Bug] KubeRay tries to create ClusterRoleBinding when singleNamespaceInstall and rbacEnable are set to true (#1190, @kevin85421)

KubeRay API Server

Documentation

CI

Others

v0.5.2 release

14 Jun 21:10
aeed3cd
Compare
Choose a tag to compare

Changelog for v0.5.2

Highlights

The KubeRay 0.5.2 patch release includes the following improvements.

  • Allow specifying the entire headService and serveService YAML spec. Previously, only certain special fields such as labels and annotations were exposed to the user.
  • RayService stability improvements
    • RayService object’s Status is being updated due to frequent reconciliation (#1065, @kevin85421)
    • [RayService] Submit requests to the Dashboard after the head Pod is running and ready (#1074, @kevin85421)
    • Fix in HeadPod Service Generation logic which was causing frequent reconciliation (#1056, @msumitjain)
  • Allow watching multiple namespaces
    • [Feature] Watch CR in multiple namespaces with namespaced RBAC resources (#1106, @kevin85421)
  • Autoscaler stability improvements

Contributors

We'd like to thank the following contributors for their contributions to this release:

@ByronHsu, @Yicheng-Lu-llll, @anishasthana, @architkulkarni, @blublinsky, @chrisxstyles, @dirtyValera, @ecurtin, @jasoonn, @jjyao, @kevin85421, @kodwanis, @msumitjain, @oginskis, @psschwei, @scarlet25151, @sihanwang41, @tedhtchang, @varungup90, @xubo245

Features

Fixes

  • Fix CI (#1145, @kevin85421)
  • Fix config frequent update (#1014, @sihanwang41)
  • Fix for Sample YAML Config Test - 2.4.0 Failure due to 'suspend' Field (#1096, @Yicheng-Lu-llll)
  • Fix in HeadPod Service Generation logic which was causing frequent reconciliation (#1056, @msumitjain)
  • [Bug] Autoscaler doesn't support TLS (#1119, @chrisxstyles)
  • [Bug] Enable ResourceQuota by adding Resources for the health-check init container (#1043, @kevin85421)
  • [Bug] Fix null map handling in BuildServiceForHeadPod function (#1095, @architkulkarni)
  • [Bug] RayService restarts repeatedly with Autoscaler (#1037, @kevin85421)
  • [Bug] Service (Serve) changing port from 8000 to 9000 doesn't work (#1081, @kevin85421)
  • [Bug] autoscaler not working properly in rayjob (#1064, @Yicheng-Lu-llll)
  • [Bug] compatibility test for the nightly Ray image fails (#1055, @kevin85421)
  • [Bug] rayStartParams is required at this moment. (#1031, @kevin85421)
  • [Bug][Autoscaler] Operator does not remove workers (#1139, @kevin85421)
  • [Bug][Doc] fix the link error of operator document (#1046, @xubo245)
  • [Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed (#1036, @kevin85421)
  • [Bug][breaking change] Unauthorized 401 error on fetching Ray Custom Resources from K8s API server (#1128, @kevin85421)
  • [Bug][k8s compatibility] k8s v1.20.7 ClusterIP svc do not updated under RayService (#1110, @kevin85421)
  • [Helm][ray-cluster] Fix parsing envFrom field in additionalWorkerGroups (#1039, @dirtyValera)

Documentation

v0.5.0 release

11 Apr 07:52
ee982a3
Compare
Choose a tag to compare

Highlights

The KubeRay 0.5.0 release includes the following improvements.

  • Interact with KubeRay via a Python client
  • Integrate KubeRay with Kubeflow to provide an interactive development environment (link).
  • Integrate KubeRay with Ray TLS authentication
  • Improve the user experience for KubeRay on AWS EKS (link)
  • Fix some Kubernetes networking issues
  • Fix some stability bugs in RayJob and RayService

Contributors

The following individuals contributed to KubeRay 0.5.0. This list is alphabetical and incomplete.

@akanso @alex-treebeard @architkulkarni @cadedaniel @cskornel-doordash @davidxia @DmitriGekhtman @ducviet00 @gvspraveen @harryge00 @jasoonn @Jeffwan @kevin85421 @psschwei @scarlet25151 @sihanwang41 @wilsonwang371 @Yicheng-Lu-llll

Python client (alpha)(New!)

Kubeflow (New!)

  • [Feature][Doc] Kubeflow integration (#937, @kevin85421)
  • [Feature] Ray restricted podsecuritystandards for enterprise security and Kubeflow integration (#750, @kevin85421)

TLS authentication (New!)

AWS EKS (New!)

Kubernetes networking (New!)

  • Read cluster domain from resolv.conf or env (#951, @harryge00)
  • [Feature] Replace service name with Fully Qualified Domain Name (#938, @kevin85421)
  • [Feature] Add default init container in workers to wait for GCS to be ready (#973, @kevin85421)

Observability

RayCluster

RayJob (alpha)

  • [Hotfix][release blocker][RayJob] HTTP client from submitting jobs before dashboard initialization completes (#1000, @kevin85421)
  • [RayJob] Propagate error traceback string when GetJobInfo doesn't return valid JSON (#943, @architkulkarni)
  • [RayJob][Doc] Fix RayJob sample config. (#807, @DmitriGekhtman)

RayService (alpha)

Helm

CI

Sample YAML files

Read more

v0.4.0 release

12 Dec 05:36
7374e2c
Compare
Choose a tag to compare

Highlights

The KubeRay 0.4.0 release includes the following improvements.

Contributors

The following individuals contributed to KubeRay 0.4.0. This list is alphabetical and incomplete.

@AlessandroPomponio @architkulkarni @Basasuya @DmitriGekhtman @IceKhan13 @asm582 @davidxia @dhaval0108 @haoxins @iycheng @jasoonn @Jeffwan @jianyuan @kaushik143 @kevin85421 @lizzzcai @orcahmlee @pcmoritz @peterghaddad @rafvasq @scarlet25151 @shrekris-anyscale @sigmundv @sihanwang41 @simon-mo @tbabej @tgaddair @ulfox @wilsonwang371 @wuisawesome

New features and integrations

Helm

These changes pertain to KubeRay's Helm charts.

  • [Bug] Remove an unused field (ingress.enabled) from KubeRay operator chart (#812, @kevin85421)
  • [helm] Add memory limits and resource documentation. (#789, @DmitriGekhtman)
  • [Helm] Expose security context in helm chart. (#773, @DmitriGekhtman)
  • [Helm] Clean up RayCluster Helm chart ahead of KubeRay 0.4.0 release (#751, @DmitriGekhtman)
  • [Feature] Expose initContainer image in RayCluster chart (#674, @kevin85421)
  • [Feature][Helm] Expose the autoscalerOptions (#666, @orcahmlee)
  • [Feature][Helm] Align the key of minReplicas and maxReplicas (#663, @orcahmlee)
  • Helm: add service type configuration to head group for ray-cluster (#614, @IceKhan13)
  • Allow annotations in ray cluster helm chart (#574, @sigmundv)
  • [Feature][Helm] Enable sidecar configuration in Helm chart (#604, @kevin85421)
  • [bugfix][apiserver helm]: Adding missing rbacenable value (#594, @dhaval0108)
  • [Bug] Modification of nameOverride will cause label selector mismatch for head node (#572, @kevin85421)
  • [Helm][minor] Make "disabled" flag for worker groups optional (#548, @kevin85421)
  • helm: Uncomment the disabled key for the default workergroup (#543, @tbabej)
  • Fix Helm chart default configuration (#530, @kevin85421)
  • helm-chart/ray-cluster: Allow setting pod lifecycle (#494, @ulfox)

CI

The changes in this section pertain to KubeRay CI, testing, and developer workflows.

KubeRay Operator deployment

The changes in this section pertain to deployment of the KubeRay Operator.

RayCluster controller

The changes in this section pertain to the RayCluster controller sub-component of the KubeRay Operator.

Read more