Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge LM-Eval dev branch (trustyai-explainability#337)
* Add lm-eval-service controller (trustyai-explainability#258) * feat: Initial database support (trustyai-explainability#246) * Initial database support - Add status checking - Add better storage flags - Add spec.storage.format validation - Add DDL -Add HIBERNATE format to DB (test) - Update service image - Revert identifier to DATABASE - Update CR options (remove mandatory data) * Remove default DDL generation env var * Update service image to latest tag * Add migration awareness * Add updating pods for migration * Change JDBC url from mysql to mariadb * Fix TLS mount * Revert images * Remove redundant logic * Fix comments * feat: Add TLS certificate mount on ModelMesh (trustyai-explainability#255) * feat: Add TLS certificate mount on ModelMesh * Revert from http to https until kserve/modelmesh#147 is merged * Add lm-eval-service controller refactor the existing TrustyAIService controller and add LMEvalService controller Signed-off-by: Yihong Wang <[email protected]> --------- Signed-off-by: Yihong Wang <[email protected]> Co-authored-by: Rui Vieira <[email protected]> * fix: Fix typo in operator's arguments (trustyai-explainability#261) Operator's arguments changed from `--eanble-services` to `--enable-services`. trustyai.opendatahub.io_lmevaljobs.yaml and zz_generated.deepcopy.go regenerated. * feat: Add LMES driver build to GHA (trustyai-explainability#272) * sync: sync dev/lm-eval with main branch (trustyai-explainability#271) * feat: Initial database support (trustyai-explainability#246) * Initial database support - Add status checking - Add better storage flags - Add spec.storage.format validation - Add DDL -Add HIBERNATE format to DB (test) - Update service image - Revert identifier to DATABASE - Update CR options (remove mandatory data) * Remove default DDL generation env var * Update service image to latest tag * Add migration awareness * Add updating pods for migration * Change JDBC url from mysql to mariadb * Fix TLS mount * Revert images * Remove redundant logic * Fix comments * feat: Add TLS certificate mount on ModelMesh (trustyai-explainability#255) * feat: Add TLS certificate mount on ModelMesh * Revert from http to https until kserve/modelmesh#147 is merged * Pin oc version, ubi version (trustyai-explainability#263) * Restore checkout of trustyai-exp (trustyai-explainability#265) * Add operator installation robustness (trustyai-explainability#266) * fix: Skip InferenceService patching for KServe RawDeployment (trustyai-explainability#262) * feat: ConfigMap key to disable KServe Serverless configuration (trustyai-explainability#267) * feat: Add support for custom certificates in database connection (trustyai-explainability#259) * Add TLS endpoint for ModelMesh payload processors. (trustyai-explainability#268) Keep non-TLS endpoint for KServe Serverless (disabled by default) --------- Signed-off-by: Yihong Wang <[email protected]> Co-authored-by: Rui Vieira <[email protected]> Co-authored-by: Rob Geada <[email protected]> * Weekly sync up of dev/lm-eval branch (trustyai-explainability#278) * feat: Initial database support (trustyai-explainability#246) * Initial database support - Add status checking - Add better storage flags - Add spec.storage.format validation - Add DDL -Add HIBERNATE format to DB (test) - Update service image - Revert identifier to DATABASE - Update CR options (remove mandatory data) * Remove default DDL generation env var * Update service image to latest tag * Add migration awareness * Add updating pods for migration * Change JDBC url from mysql to mariadb * Fix TLS mount * Revert images * Remove redundant logic * Fix comments * feat: Add TLS certificate mount on ModelMesh (trustyai-explainability#255) * feat: Add TLS certificate mount on ModelMesh * Revert from http to https until kserve/modelmesh#147 is merged * Pin oc version, ubi version (trustyai-explainability#263) * Restore checkout of trustyai-exp (trustyai-explainability#265) * Add operator installation robustness (trustyai-explainability#266) * fix: Skip InferenceService patching for KServe RawDeployment (trustyai-explainability#262) * feat: ConfigMap key to disable KServe Serverless configuration (trustyai-explainability#267) * feat: Add support for custom certificates in database connection (trustyai-explainability#259) * Add TLS endpoint for ModelMesh payload processors. (trustyai-explainability#268) Keep non-TLS endpoint for KServe Serverless (disabled by default) * fix: Correct maxSurge and maxUnavailable (trustyai-explainability#275) * feat: Add support for custom DB names (trustyai-explainability#257) * feat: Add support for custom DB names * fix: Correct custom DB name --------- Signed-off-by: Yihong Wang <[email protected]> Co-authored-by: Rui Vieira <[email protected]> Co-authored-by: Rob Geada <[email protected]> * Driver updates job's status periodically (trustyai-explainability#280) The driver periodically update the LMEvalJob.Status.Message field with the outputs from the lm-eval. The message pattern the driver captures is like `Running text generation: 81%|`. Then users can use this information to check the progress of the job. Signed-off-by: Yihong Wang <[email protected]> * Add Dockerfile for LMES job image (trustyai-explainability#276) Add Dockerfile for LMES job image and the needed files Signed-off-by: Yihong Wang <[email protected]> * feat: Add overlays (trustyai-explainability#283) * feat: Add overlays * Remove redundant lmes-tas overlay. Change job image name. * Add job image build (trustyai-explainability#284) * Change job image use midstream lm-evaluation-harness (trustyai-explainability#285) * feat: support batch size (trustyai-explainability#290) Add batch size support in the LMEvalJob which leverages the `--batch_size` in the `lm-evaluation-harness`. This only affects the local models. The `--bath_size` doesn't work for remote inference APIs. Signed-off-by: Yihong Wang <[email protected]> * Add the `openai` package into the lmes job image (trustyai-explainability#292) update the LMES job's Dockerfile to include the `openai` package. Signed-off-by: Yihong Wang <[email protected]> * fix: fix dependency error in the job image (trustyai-explainability#296) Split up the unitxt and openai dependencies to avoid the conflict. Signed-off-by: Yihong Wang <[email protected]> * feat: add device detection in lmes driver (trustyai-explainability#298) Added a new feature in LMES driver to detect the available devices by using the PyTorch API. This feature can be disabled by passing the `--detect-device false` option. Signed-off-by: Yihong Wang <[email protected]> * feat: support unitxt recipes (trustyai-explainability#301) Add new fields in the CRD to support unitxt recipes and leverage the driver to create corresponding yaml files of the unitxt recipes. Signed-off-by: Yihong Wang <[email protected]> * feat: support custom dataset (trustyai-explainability#309) Updated the CRD data struct to allow users to specify a custom Unitxt card in JSON format. The custom Unitxt card is equivalent to a custom dataset definition. Also restructured and updated the CRD to support Volumes, VolumeMounts, Env, Resources, Labels, and Annotations. Signed-off-by: Yihong Wang <[email protected]> * feat: new pulling mechanism for job statuses (trustyai-explainability#314) Update the driver to keep running even the user program finishes. The driver provides two APIs: - GetStatus(): retrieve job status - Shutdown(): properly tear down the driver In the controller side, it uses `pod/exec` resource to run the driver command to invoke the driver APIs to retrieve the job status and shutdown the driver when job is done. Signed-off-by: Yihong Wang <[email protected]> * Move operator's cmd/operator/main.go to cmd/main.go to keep operator-sdk compatibility (trustyai-explainability#295) * Remove hardcoded job's user ID (trustyai-explainability#322) * Fix mkdir command in Job dockerfile (trustyai-explainability#330) * Refactor some lmesreconcile methods (trustyai-explainability#323) * Refactor lmes reconcile optoins Signed-off-by: ted chang <[email protected]> * Update controllers/lmes/lmevaljob_controller.go Co-authored-by: Yihong Wang <[email protected]> * Update controllers/lmes/lmevaljob_controller.go Co-authored-by: Yihong Wang <[email protected]> Signed-off-by: ted chang <[email protected]> --------- Signed-off-by: ted chang <[email protected]> Co-authored-by: Yihong Wang <[email protected]> * tidy: clean up lmes-job image (trustyai-explainability#333) remove BAM related packages and patch. Signed-off-by: Yihong Wang <[email protected]> * Enable job suspend for Kueue (trustyai-explainability#317) * Refactor lmes reconcile optoins Signed-off-by: ted chang <[email protected]> * Update controllers/lmes/lmevaljob_controller.go Co-authored-by: Yihong Wang <[email protected]> * Update controllers/lmes/lmevaljob_controller.go Co-authored-by: Yihong Wang <[email protected]> Signed-off-by: ted chang <[email protected]> * Enable job suspend for Kueue Signed-off-by: ted chang <[email protected]> --------- Signed-off-by: ted chang <[email protected]> Co-authored-by: Yihong Wang <[email protected]> * Add overlay placeholders for main merge (trustyai-explainability#334) * sync: sync up dev/lm-eval branch with main branch (trustyai-explainability#336) * [CI] Run tests from trustyai-tests (trustyai-explainability#279) * Change Dockerfile to clone trustyai-tests * Add PYTEST_MARKERS env and remove TESTS_REGEX * RHOAIENG-12274: Update operator's overlays (trustyai-explainability#287) * Update operator's overlays * Update kustomization.yaml * Add devflag printout to GH Action comment (trustyai-explainability#289) * Add timeout loop to DSC install (trustyai-explainability#305) * RHOAIENG-13625: Add DBAvailable status to CR (trustyai-explainability#304) * Add DBAvailable status to CR * Remove probes * Add KServe destination rule for Inference Services in the ServiceMesh (trustyai-explainability#315) * Add DestinationRule creation for KServe serverless * Add permissions for destination rules * Add role for destination rules * Add missing role for creating destination rules * Fix spacing in DestinationRule template * Add check if DestinationRule CRD is present before creating it (trustyai-explainability#316) * Add check for DestinationRule CRD * Add API extensions to operator's scheme * Add permission for CRD resource * Fix operator metrics service target port (trustyai-explainability#320) * Add readiness probes (trustyai-explainability#312) * Enable KServe serverless in the rhoai overlay (trustyai-explainability#321) * Update overlay images (trustyai-explainability#331) * Add correct CA cert to JDBC (trustyai-explainability#324) * Add correct CA cert to JDBC * Add require SSL * Support for VirtualServices for InferenceLogger traffic (trustyai-explainability#332) * Generate KServe Inference Logger in conformance with DestinationRule and VirtualService * Add VirtualService creation for models in the mesh * Add permissions for VirtualServices * Update manifests for VirtualServices * Fix VirtualServiceName variable * fix yaml linter after the sync Signed-off-by: Yihong Wang <[email protected]> * tidy the go.mod and go.sum as well Signed-off-by: Yihong Wang <[email protected]> --------- Signed-off-by: Yihong Wang <[email protected]> Co-authored-by: Adolfo Aguirrezabal <[email protected]> Co-authored-by: Rui Vieira <[email protected]> Co-authored-by: Rob Geada <[email protected]> Co-authored-by: Rui Vieira <[email protected]> --------- Signed-off-by: Yihong Wang <[email protected]> Signed-off-by: ted chang <[email protected]> Co-authored-by: Yihong Wang <[email protected]> Co-authored-by: Rob Geada <[email protected]> Co-authored-by: ted chang <[email protected]> Co-authored-by: Adolfo Aguirrezabal <[email protected]>
- Loading branch information