We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After running experiment for a week the experiments fail with the following error:
[2024-08-05 14:54:37] [8b3f458a] [rank=0] Traceback (most recent call last): <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_trainer.py", line 310, in init <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] yield context <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/exec/harness.py", line 177, in _run_pytorch_trial <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] trainer.fit( <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_trainer.py", line 203, in fit <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] trial_controller.run() <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_trial.py", line 615, in run <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] self._run() <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_trial.py", line 650, in _run <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] self._train_for_op( <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_trial.py", line 775, in _train_for_op <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] self._report_searcher_progress(op, self.searcher_unit) <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_trial.py", line 521, in _report_searcher_progress <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] op.report_progress(self.state.batches_trained) <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/core/_searcher.py", line 87, in report_progress <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] self._session.post( <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/common/api/_session.py", line 212, in post <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] return self._do_request("POST", path, params, json, data, headers, timeout, False) <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/common/api/_session.py", line 173, in _do_request <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] raise errors.UnauthenticatedException() <none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] determined.common.api.errors.UnauthenticatedException: Unauthenticated: Please use 'det user login <username>' for password login, or for Enterprise users logging in with an SSO provider, use 'det auth login --provider=<provider>'.
The automatic retries will also fail:
[2024-08-05 14:58:32] [d2bcf554] Traceback (most recent call last): <none> [2024-08-05 14:58:32] [d2bcf554] File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main <none> [2024-08-05 14:58:32] [d2bcf554] return _run_code(code, main_globals, None, <none> [2024-08-05 14:58:32] [d2bcf554] File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code <none> [2024-08-05 14:58:32] [d2bcf554] exec(code, run_globals) <none> [2024-08-05 14:58:32] [d2bcf554] File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/exec/prep_container.py", line 324, in <module> <none> [2024-08-05 14:58:32] [d2bcf554] download_context_directory(sess, info) <none> [2024-08-05 14:58:32] [d2bcf554] File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/exec/prep_container.py", line 29, in download_context_directory <none> [2024-08-05 14:58:32] [d2bcf554] b64_tgz = bindings.get_GetTaskContextDirectory(sess, taskId=info.task_id).b64Tgz <none> [2024-08-05 14:58:32] [d2bcf554] File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/common/api/bindings.py", line 19363, in get_GetTaskContextDirectory <none> [2024-08-05 14:58:32] [d2bcf554] _resp = session._do_request( <none> [2024-08-05 14:58:32] [d2bcf554] File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/common/api/_session.py", line 173, in _do_request <none> [2024-08-05 14:58:32] [d2bcf554] raise errors.UnauthenticatedException() <none> [2024-08-05 14:58:32] [d2bcf554] determined.common.api.errors.UnauthenticatedException: Unauthenticated: Please use 'det user login <username>' for password login, or for Enterprise users logging in with an SSO provider, use 'det auth login --provider=<provider>'.
I have not looked too deeply but could be related to following refactor: #8347
And the session duration set at:
determined/master/internal/user/postgres_users.go
Line 24 in 3a91552
After forking the failed experiment it will run again without issues with authentication, for a week.
Experiment should continue running without exception.
Determined version 0.33.0
No response
The text was updated successfully, but these errors were encountered:
thank you for the report. we believe it is a regression, and we'll try to address it as soon as possible.
Sorry, something went wrong.
#9860
No branches or pull requests
Describe the bug
After running experiment for a week the experiments fail with the following error:
The automatic retries will also fail:
I have not looked too deeply but could be related to following refactor:
#8347
And the session duration set at:
determined/master/internal/user/postgres_users.go
Line 24 in 3a91552
After forking the failed experiment it will run again without issues with authentication, for a week.
Reproduction Steps
Expected Behavior
Experiment should continue running without exception.
Screenshot
Environment
Determined version 0.33.0
Additional Context
No response
The text was updated successfully, but these errors were encountered: