Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Add thread check to job mgr callback #48005

Conversation

dentiny
Copy link
Contributor

@dentiny dentiny commented Oct 13, 2024

This PR followup for comment #47793 (comment), and adds a thread checking to GCS job manager callback to make sure no concurrent access for data members.

@dentiny dentiny requested a review from a team as a code owner October 13, 2024 02:17
@dentiny
Copy link
Contributor Author

dentiny commented Oct 13, 2024

@rynewang Would appreciate a code review :)

@rynewang rynewang enabled auto-merge (squash) October 13, 2024 04:00
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 13, 2024
@dentiny
Copy link
Contributor Author

dentiny commented Oct 13, 2024

@rynewang Could you please shed some lights on the CI pipeline?
Error message looks like

[2024-10-13T07:57:52Z] (IMPALA pid=7862)   File "/rayci/python/ray/rllib/utils/actor_manager.py", line 194, in apply
--
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)     raise e
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)   File "/rayci/python/ray/rllib/utils/actor_manager.py", line 183, in apply
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)     return func(self, *args, **kwargs)
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)   File "/rayci/python/ray/rllib/algorithms/impala/impala.py", line 781, in _remote_sample_get_state_and_metrics
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)     _episodes = _worker.sample()
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)   File "/rayci/python/ray/rllib/env/single_agent_env_runner.py", line 189, in sample
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)     samples = self._sample_timesteps(
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)   File "/rayci/python/ray/rllib/env/single_agent_env_runner.py", line 305, in _sample_timesteps
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)     to_env = self._module_to_env(
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)   File "/rayci/python/ray/rllib/connectors/connector_pipeline_v2.py", line 93, in __call__
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)     batch = connector(
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)   File "/rayci/python/ray/rllib/connectors/module_to_env/get_actions.py", line 62, in __call__
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)     self._get_actions(batch, rl_module, explore)
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)   File "/rayci/python/ray/rllib/connectors/module_to_env/get_actions.py", line 78, in _get_actions
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)     action_dist = action_dist_class.from_logits(
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)   File "/rayci/python/ray/rllib/models/torch/torch_distributions.py", line 148, in from_logits
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)     return TorchCategorical(logits=logits, **kwargs)
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)   File "/rayci/python/ray/rllib/models/torch/torch_distributions.py", line 116, in __init__
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)     super().__init__(logits=logits, probs=probs)
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)   File "/rayci/python/ray/rllib/models/torch/torch_distributions.py", line 28, in __init__
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)     self._dist = self._get_torch_distribution(*args, **kwargs)
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)   File "/rayci/python/ray/rllib/models/torch/torch_distributions.py", line 128, in _get_torch_distribution
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)     return torch.distributions.categorical.Categorical(logits=logits, probs=probs)
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)   File "/opt/miniconda/lib/python3.9/site-packages/torch/distributions/categorical.py", line 70, in __init__
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)     super().__init__(batch_shape, validate_args=validate_args)
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)   File "/opt/miniconda/lib/python3.9/site-packages/torch/distributions/distribution.py", line 68, in __init__
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862)     raise ValueError(
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862) ValueError: Expected parameter logits (Tensor of shape (1, 1, 2)) of distribution Categorical(logits: torch.Size([1, 1, 2])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
  | [2024-10-13T07:57:52Z] (IMPALA pid=7862) tensor([[[nan, nan]]])), taking actor 1 out of service.
  | [2024-10-13T07:57:52Z] (pid=8082) /opt/miniconda/lib/python3.9/site-packages/pygame/pkgdata.py:25: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  | [2024-10-13T07:57:52Z] (pid=8082)   from pkg_resources import resource_stream, resource_exists
  | [2024-10-13T07:57:52Z] (pid=8082) /opt/miniconda/lib/python3.9/site-packages/pkg_resources/__init__.py:2563: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`. [repeated 5x across cluster]
  | [2024-10-13T07:57:52Z] (pid=8082) Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages [repeated 13x across cluster]
  | [2024-10-13T07:57:52Z] (pid=8082)   declare_namespace(pkg) [repeated 10x across cluster]
  | [2024-10-13T07:57:52Z] (pid=8082) /opt/miniconda/lib/python3.9/site-packages/pkg_resources/__init__.py:3154: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google.logging')`.
  | [2024-10-13T07:57:52Z] (pid=8082)   declare_namespace(parent) [repeated 2x across cluster]
  | [2024-10-13T07:57:52Z] (pid=8082) /opt/miniconda/lib/python3.9/site-packages/pkg_resources/__init__.py:3154: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('mpl_toolkits')`.
  | [2024-10-13T07:57:52Z] (pid=8082) /opt/miniconda/lib/python3.9/site-packages/pkg_resources/__init__.py:3154: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('com')`. [repeated 4x across cluster]
  | [2024-10-13T07:57:52Z] (pid=8082) /opt/miniconda/lib/python3.9/site-packages/pkg_resources/__init__.py:3154: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('sphinxcontrib')`.
  | [2024-10-13T07:57:52Z] (pid=8082) /opt/miniconda/lib/python3.9/site-packages/google/rpc/__init__.py:20: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google.rpc')`.
  | [2024-10-13T07:57:52Z] (pid=8082)   pkg_resources.declare_namespace(__name__)

which looks completely unrelated to my change.

@rynewang
Copy link
Contributor

rllib CI is flaky these days. triggered a retry

@rynewang rynewang merged commit aca67b3 into ray-project:master Oct 14, 2024
6 checks passed
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR followup for comment
ray-project#47793 (comment),
and adds a thread checking to GCS job manager callback to make sure no
concurrent access for data members.

Signed-off-by: dentiny <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR followup for comment
ray-project#47793 (comment),
and adds a thread checking to GCS job manager callback to make sure no
concurrent access for data members.

Signed-off-by: dentiny <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR followup for comment
ray-project#47793 (comment),
and adds a thread checking to GCS job manager callback to make sure no
concurrent access for data members.

Signed-off-by: dentiny <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR followup for comment
ray-project#47793 (comment),
and adds a thread checking to GCS job manager callback to make sure no
concurrent access for data members.

Signed-off-by: dentiny <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
This PR followup for comment
ray-project#47793 (comment),
and adds a thread checking to GCS job manager callback to make sure no
concurrent access for data members.

Signed-off-by: dentiny <[email protected]>
Signed-off-by: ujjawal-khare <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants