Take env vars from .env or shell vars #543

javiermtorres · 2024-12-20T11:01:54Z

What's changing

The Makefile will now incorporate the RAY_WORKER_GPUS, RAY_WORKER_GPUS_FRACTION and INFERENCE_PIP_REQS environment variables from the shell or the .env file (if it exists). This allows selective use of GPU and prevents tasks from being stuck in pending for cpu-only tests.

How to test it

From the base directory:

export RAY_WORKER_GPUS="0.0"
export RAY_WORKER_GPUS_FRACTION="0.0"
make test-backend-integration-target

Tests shall pass, but the Ray dashboard will show failed jobs (instead of pending).

Additional notes for reviewers

N/A

I already...

Tested the changes in a working environment to ensure they work as expected
Added some tests for any new functionality
- N/A
Updated the documentation (both comments in code and product documentation under /docs)
Checked if a (backend) DB migration step was required and included it if required
- N/A

agpituk · 2024-12-20T13:55:53Z

Makefile

+RAY_WORKER_GPUS ?= "0.0"
+RAY_WORKER_GPUS_FRACTION ?= "0.0"
+INFERENCE_PIP_REQS ?= ../jobs/inference/requirements_cpu.txt
+


I think including the .env here is perfect , but, shouldn't we put this vars that you just added in the .env as well? you can put defaults in the .env.example file

How would this affect cloud/hybrid settings? On those, the user should be able to use a GPU.

This relates to #453 (comment) from @aittalam. If we keep the PR as is, then I'll include the .env entries. We may use some other, more flexible, per-job way of specifying non-gpu requirements at a later stage.

@ividal this is resolved at lumigator startup time currently. Whoever creates the cloud service would be responsible to set this value appropriately (e.g. the helm charts provided with lumigator). AFAICT these Makefiles would not be used in those cases.

ividal

Thanks! FYI, explicitly setting RAY_WORKER_GPUS and RAY_WORKER_GPUS_FRACTION solved the integration test problems on #519 .

My only concern is how test/regular use differ from each other and how that affects GPU users.

ividal · 2024-12-27T08:47:53Z

Makefile

+RAY_WORKER_GPUS ?= "0.0"
+RAY_WORKER_GPUS_FRACTION ?= "0.0"
+INFERENCE_PIP_REQS ?= ../jobs/inference/requirements_cpu.txt
+


How would this affect cloud/hybrid settings? On those, the user should be able to use a GPU.

ividal · 2024-12-27T08:53:44Z

Makefile

+-include .env
+# GPU related settings
+#
+RAY_WORKER_GPUS ?= "0.0"


Suggestion to avoid issues with users that do have a GPU: check if "nvidia-smi" is present, and if it isn't force cpu mode.

This would not work with AMD GPU users, but my assumption is the README request to set the vars via .env file, coupled with your "?=" takes care of them.

Suggestion to avoid issues with users that do have a GPU: check if "nvidia-smi" is present, and if it isn't force cpu mode.

My impression is that we have a few settings (like the Ray GPU env vars and the requirements.txt repo url prefix lines) that we need to coordinate. I don't really like auto-detection since imho the tools themselves (vendor tools like this, python settings, ray itself...) don't seem so well coordinated, but issuing a warning or some logs seems ok to me ("GPU selected, but no nvidia-smi present" or the like).

So...

is the user responsible for setting GPUs? [Y/n]

is the default behaviour is CPU? [Y/n]

how exactly does a user force GPU usage? [On the .env via RAY_WORKER_GPU*/...]
we just need to be explicit in our guide as to how the user can run lumigator with GPU support.

TODO:

here: add log "setting GPU/CPU" so the user knows what's going on

separate issue: document GPU/CPU usage in our user docs.

Does this sound about right?

ividal · 2024-12-27T08:54:54Z

lumigator/python/mzai/jobs/inference/requirements_cpu.txt

@@ -0,0 +1,11 @@
+--extra-index-url https://download.pytorch.org/whl/cpu


Are we sure this is only used in tests? What gets installed for regular usage and what happens to people who do not have a GPU in that case?

This is somehow addressed in

lumigator/lumigator/python/mzai/backend/backend/tests/README.md

Line 38 in c0350f8

* INFERENCE_PIP_REQS (default value `../jobs/inference/requirements_cpu.txt`)

, although a bit more verbose description would probably be desirable.

Take env vars from .env or shell vars

c0350f8

github-actions bot added the backend label Dec 20, 2024

agpituk reviewed Dec 20, 2024

View reviewed changes

ividal mentioned this pull request Dec 23, 2024

337 annotation/ground truth generation uses new inference job #519

Merged

4 tasks

ividal reviewed Dec 27, 2024

View reviewed changes

Merge branch 'main' into javiermtorres/inherit-gpu-settings

1fc74b2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Take env vars from .env or shell vars #543

Take env vars from .env or shell vars #543

javiermtorres commented Dec 20, 2024

agpituk Dec 20, 2024 •

edited

Loading

ividal Dec 27, 2024

javiermtorres Jan 7, 2025

ividal left a comment

ividal Dec 27, 2024

ividal Dec 27, 2024

javiermtorres Jan 7, 2025

ividal Jan 16, 2025

ividal Dec 27, 2024

javiermtorres Jan 7, 2025

		@@ -0,0 +1,11 @@
		--extra-index-url https://download.pytorch.org/whl/cpu

Take env vars from .env or shell vars #543

Are you sure you want to change the base?

Take env vars from .env or shell vars #543

Conversation

javiermtorres commented Dec 20, 2024

What's changing

How to test it

Additional notes for reviewers

I already...

agpituk Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ividal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agpituk Dec 20, 2024 •

edited

Loading