Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛[bug] Experiments fails after running for a week #9856

Open
hilvi opened this issue Aug 22, 2024 · 2 comments
Open

🐛[bug] Experiments fails after running for a week #9856

hilvi opened this issue Aug 22, 2024 · 2 comments
Labels

Comments

@hilvi
Copy link

hilvi commented Aug 22, 2024

Describe the bug

After running experiment for a week the experiments fail with the following error:

[2024-08-05 14:54:37] [8b3f458a] [rank=0] Traceback (most recent call last):
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_trainer.py", line 310, in init
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     yield context
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/exec/harness.py", line 177, in _run_pytorch_trial
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     trainer.fit(
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_trainer.py", line 203, in fit
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     trial_controller.run()
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_trial.py", line 615, in run
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     self._run()
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_trial.py", line 650, in _run
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     self._train_for_op(
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_trial.py", line 775, in _train_for_op
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     self._report_searcher_progress(op, self.searcher_unit)
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/pytorch/_pytorch_trial.py", line 521, in _report_searcher_progress
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     op.report_progress(self.state.batches_trained)
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/core/_searcher.py", line 87, in report_progress
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     self._session.post(
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/common/api/_session.py", line 212, in post
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     return self._do_request("POST", path, params, json, data, headers, timeout, False)
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/common/api/_session.py", line 173, in _do_request
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0]     raise errors.UnauthenticatedException()
<none> [2024-08-05 14:54:37] [8b3f458a] [rank=0] determined.common.api.errors.UnauthenticatedException: Unauthenticated: Please use 'det user login <username>' for password login, or for Enterprise users logging in with an SSO provider, use 'det auth login --provider=<provider>'.

The automatic retries will also fail:

[2024-08-05 14:58:32]
[d2bcf554] Traceback (most recent call last): <none> [2024-08-05 14:58:32]
[d2bcf554]   File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main <none> [2024-08-05 14:58:32]
[d2bcf554]     return _run_code(code, main_globals, None, <none> [2024-08-05 14:58:32]
[d2bcf554]   File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code <none> [2024-08-05 14:58:32]
[d2bcf554]     exec(code, run_globals) <none> [2024-08-05 14:58:32]
[d2bcf554]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/exec/prep_container.py", line 324, in <module> <none> [2024-08-05 14:58:32]
[d2bcf554]     download_context_directory(sess, info) <none> [2024-08-05 14:58:32]
[d2bcf554]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/exec/prep_container.py", line 29, in download_context_directory <none> [2024-08-05 14:58:32]
[d2bcf554]     b64_tgz = bindings.get_GetTaskContextDirectory(sess, taskId=info.task_id).b64Tgz <none> [2024-08-05 14:58:32]
[d2bcf554]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/common/api/bindings.py", line 19363, in get_GetTaskContextDirectory <none> [2024-08-05 14:58:32]
[d2bcf554]     _resp = session._do_request( <none> [2024-08-05 14:58:32]
[d2bcf554]   File "/run/determined/pythonuserbase/lib/python3.10/site-packages/determined/common/api/_session.py", line 173, in _do_request <none> [2024-08-05 14:58:32]
[d2bcf554]     raise errors.UnauthenticatedException() <none> [2024-08-05 14:58:32]
[d2bcf554] determined.common.api.errors.UnauthenticatedException: Unauthenticated: Please use 'det user login <username>' for password login, or for Enterprise users logging in with an SSO provider, use 'det auth login --provider=<provider>'. 

I have not looked too deeply but could be related to following refactor:
#8347

And the session duration set at:

SessionDuration = 7 * 24 * time.Hour

After forking the failed experiment it will run again without issues with authentication, for a week.

Reproduction Steps

  1. Create long running experiment
  2. Let it run for a week
  3. Experiment fails with UnauthenticatedException

Expected Behavior

Experiment should continue running without exception.

Screenshot

Environment

Determined version 0.33.0

Additional Context

No response

@hilvi hilvi added the bug label Aug 22, 2024
@ioga
Copy link
Contributor

ioga commented Aug 22, 2024

thank you for the report. we believe it is a regression, and we'll try to address it as soon as possible.

@ioga
Copy link
Contributor

ioga commented Aug 23, 2024

#9860

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants