Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take env vars from .env or shell vars #543

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

javiermtorres
Copy link
Contributor

What's changing

The Makefile will now incorporate the RAY_WORKER_GPUS, RAY_WORKER_GPUS_FRACTION and INFERENCE_PIP_REQS environment variables from the shell or the .env file (if it exists). This allows selective use of GPU and prevents tasks from being stuck in pending for cpu-only tests.

How to test it

From the base directory:

  1. export RAY_WORKER_GPUS="0.0"
  2. export RAY_WORKER_GPUS_FRACTION="0.0"
  3. make test-backend-integration-target

Tests shall pass, but the Ray dashboard will show failed jobs (instead of pending).

Additional notes for reviewers

N/A

I already...

  • Tested the changes in a working environment to ensure they work as expected
  • Added some tests for any new functionality
    • N/A
  • Updated the documentation (both comments in code and product documentation under /docs)
  • Checked if a (backend) DB migration step was required and included it if required
    • N/A

RAY_WORKER_GPUS ?= "0.0"
RAY_WORKER_GPUS_FRACTION ?= "0.0"
INFERENCE_PIP_REQS ?= ../jobs/inference/requirements_cpu.txt

Copy link
Contributor

@agpituk agpituk Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think including the .env here is perfect , but, shouldn't we put this vars that you just added in the .env as well? you can put defaults in the .env.example file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this affect cloud/hybrid settings? On those, the user should be able to use a GPU.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relates to #453 (comment) from @aittalam. If we keep the PR as is, then I'll include the .env entries. We may use some other, more flexible, per-job way of specifying non-gpu requirements at a later stage.

@ividal this is resolved at lumigator startup time currently. Whoever creates the cloud service would be responsible to set this value appropriately (e.g. the helm charts provided with lumigator). AFAICT these Makefiles would not be used in those cases.

Copy link
Contributor

@ividal ividal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! FYI, explicitly setting RAY_WORKER_GPUS and RAY_WORKER_GPUS_FRACTION solved the integration test problems on #519 .

My only concern is how test/regular use differ from each other and how that affects GPU users.

RAY_WORKER_GPUS ?= "0.0"
RAY_WORKER_GPUS_FRACTION ?= "0.0"
INFERENCE_PIP_REQS ?= ../jobs/inference/requirements_cpu.txt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this affect cloud/hybrid settings? On those, the user should be able to use a GPU.

-include .env
# GPU related settings
#
RAY_WORKER_GPUS ?= "0.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to avoid issues with users that do have a GPU: check if "nvidia-smi" is present, and if it isn't force cpu mode.

This would not work with AMD GPU users, but my assumption is the README request to set the vars via .env file, coupled with your "?=" takes care of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion to avoid issues with users that do have a GPU: check if "nvidia-smi" is present, and if it isn't force cpu mode.

My impression is that we have a few settings (like the Ray GPU env vars and the requirements.txt repo url prefix lines) that we need to coordinate. I don't really like auto-detection since imho the tools themselves (vendor tools like this, python settings, ray itself...) don't seem so well coordinated, but issuing a warning or some logs seems ok to me ("GPU selected, but no nvidia-smi present" or the like).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So...

  • is the user responsible for setting GPUs? [Y/n]
  • is the default behaviour is CPU? [Y/n]
  • how exactly does a user force GPU usage? [On the .env via RAY_WORKER_GPU*/...]
    we just need to be explicit in our guide as to how the user can run lumigator with GPU support.
  • TODO:
    • here: add log "setting GPU/CPU" so the user knows what's going on
    • separate issue: document GPU/CPU usage in our user docs.

Does this sound about right?

@@ -0,0 +1,11 @@
--extra-index-url https://download.pytorch.org/whl/cpu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this is only used in tests? What gets installed for regular usage and what happens to people who do not have a GPU in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somehow addressed in

* INFERENCE_PIP_REQS (default value `../jobs/inference/requirements_cpu.txt`)
, although a bit more verbose description would probably be desirable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants