Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray Launcher: Additional properties are not allowed #2927

Open
2 tasks done
philkohl opened this issue Jul 21, 2024 · 0 comments
Open
2 tasks done

Ray Launcher: Additional properties are not allowed #2927

philkohl opened this issue Jul 21, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@philkohl
Copy link

philkohl commented Jul 21, 2024

🐛 Bug

Description

Hi,

thank you for this beautiful library. Enjoying it for years!
The last days I wanted to use the multirun command on an AWS cluster. And I noticed that you provide a ray launcher: https://hydra.cc/docs/plugins/ray_launcher/

I started with the "simple app" example from the documentation (https://github.com/facebookresearch/hydra/tree/main/plugins/hydra_ray_launcher/examples/simple). Running it out of the box, resulted in an error jsonschema.exceptions.ValidationError: Additional properties are not allowed ('autoscaling_mode', 'initial_workers', 'target_utilization_fraction' were unexpected). See the full output under Stack trace/error message.

So at that point I thought of creating my own custom ray aws config and gave it a try to comment out the additional properties. To follow along and reproduce my steps I created a little repo: https://github.com/philkohl/hydra-ray-aws-example

With this workaround I was able to start a ray head node. But I was not able to submit the tasks due to an import error ImportError: attempted relative import with no known parent package. For details see the second stack trace below.

Is there something wrong in my config or is there an issue in the plugin?

EDIT:
I think I found a problem in my config for creating the python environment. I pushed a change to my repo. But I still face the import problem.

Therefore, I tested a workaround to replace all relative import to absolute imports via package notation in my site-packages for hydra_ray_launcher. E.g.:

from hydra_plugins.hydra_ray_launcher._launcher_util import (
    JOB_RETURN_PICKLE,
    JOB_SPEC_PICKLE,
    launch_job_on_ray,
    start_ray,
)

instead of

from ._launcher_util import (
    JOB_RETURN_PICKLE,
    JOB_SPEC_PICKLE,
    launch_job_on_ray,
    start_ray,
)

With this change it seems to work.

Checklist

  • I checked on the latest version of Hydra
  • I created a minimal repro (See this for tips).

To reproduce

** Minimal Code/Config snippet to reproduce **

  1. Follow along the documentation for the "Simple app" (https://hydra.cc/docs/plugins/ray_launcher/)
  2. I also created a repository for reproduction: https://github.com/philkohl/hydra-ray-aws-example

** Stack trace/error message **

[2024-07-21 10:43:47,244][HYDRA] Ray Launcher is launching 3 jobs, 
[2024-07-21 10:43:47,244][HYDRA]        #0 : task=1
[2024-07-21 10:43:47,319][HYDRA]        #1 : task=2
[2024-07-21 10:43:47,391][HYDRA]        #2 : task=3
[2024-07-21 10:43:47,469][HYDRA] Pickle for jobs: /tmp/tmp6574t01r/job_spec.pkl
Cluster: default

2024-07-21 10:43:47,480 INFO util.py:382 -- setting max workers for head node type to 0
Checking AWS environment settings
Traceback (most recent call last):
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 466, in <lambda>
    lambda: hydra.multirun(
            ^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 162, in multirun
    ret = sweeper.sweep(arguments=task_overrides)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 177, in sweep
    results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra_plugins/hydra_ray_launcher/ray_aws_launcher.py", line 62, in launch
    return _core_aws.launch(
           ^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra_plugins/hydra_ray_launcher/_core_aws.py", line 106, in launch
    return launch_jobs(
           ^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra_plugins/hydra_ray_launcher/_core_aws.py", line 117, in launch_jobs
    sdk.create_or_update_cluster(
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/sdk/sdk.py", line 38, in create_or_update_cluster
    return commands.create_or_update_cluster(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/_private/commands.py", line 314, in create_or_update_cluster
    config = _bootstrap_config(config, no_config_cache=no_config_cache)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/_private/commands.py", line 408, in _bootstrap_config
    validate_config(config)
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/_private/util.py", line 162, in validate_config
    raise e from None
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/_private/util.py", line 160, in validate_config
    jsonschema.validate(config, schema)
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/jsonschema/validators.py", line 1332, in validate
    raise error
jsonschema.exceptions.ValidationError: Additional properties are not allowed ('autoscaling_mode', 'initial_workers', 'target_utilization_fraction' were unexpected)
Traceback (most recent call last):
  File "/tmp/tmp.tTUvsfPQn6/_remote_invoke.py", line 18, in <module>
    from ._launcher_util import (
ImportError: attempted relative import with no known parent package
Shared connection to 3.79.57.96 closed.
Traceback (most recent call last):
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 466, in <lambda>
    lambda: hydra.multirun(
            ^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 162, in multirun
    ret = sweeper.sweep(arguments=task_overrides)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 177, in sweep
    results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra_plugins/hydra_ray_launcher/ray_aws_launcher.py", line 62, in launch
    return _core_aws.launch(
           ^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra_plugins/hydra_ray_launcher/_core_aws.py", line 106, in launch
    return launch_jobs(
           ^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra_plugins/hydra_ray_launcher/_core_aws.py", line 154, in launch_jobs
    sdk.run_on_cluster(
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/sdk/sdk.py", line 109, in run_on_cluster
    return commands.exec_cluster(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/_private/commands.py", line 1167, in exec_cluster
    result = _exec(
             ^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/_private/commands.py", line 1233, in _exec
    return updater.cmd_runner.run(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/_private/command_runner.py", line 383, in run
    return self._run_helper(final_cmd, with_output, exit_on_fail, silent=silent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/_private/command_runner.py", line 291, in _run_helper
    raise click.ClickException(
click.exceptions.ClickException: Command failed:

  ssh -tt -i /home/philipp/.ssh/hydra-philipp.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_7aa2b466ee/7505d64a54/%C -o ControlPersist=10s -o ConnectTimeout=120s [email protected] bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python /tmp/tmp.tTUvsfPQn6/_remote_invoke.py /tmp/tmp.tTUvsfPQn6)'

Expected Behavior

  1. Spin up Ray cluster
  2. Submit tasks
  3. See a kind of this output like in the documentation
$ python my_app.py --multirun task=1,2,3
[HYDRA] Ray Launcher is launching 3 jobs, 
[HYDRA]        #0 : task=1
[HYDRA]        #1 : task=2
[HYDRA]        #2 : task=3
[HYDRA] Pickle for jobs: /var/folders/n_/9qzct77j68j6n9lh0lw3vjqcn96zxl/T/tmpqqg4v4i7/job_spec.pkl
Cluster: default
...
INFO services.py:1172 -- View the Ray dashboard at http://localhost:8265
(pid=3374) [__main__][INFO] - Executing task 1
(pid=3374) [__main__][INFO] - Executing task 2
(pid=3374) [__main__][INFO] - Executing task 3
...
[HYDRA] Stopping cluster now. (stop_cluster=true)
[HYDRA] Deleted the cluster (provider.cache_stopped_nodes=false)
Destroying cluster. Confirm [y/N]: y [automatic, due to --yes]
...
No nodes remaining.

System information

  • Hydra Version : 1.3.2
  • Python version : 3.9 / 3.11 / 3.12
  • Virtual environment type and version : Poetry
  • Operating system : Linux (Fedora 40)
@philkohl philkohl added the bug Something isn't working label Jul 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant