Allow hooking job failure for generic error handling #205

NeoLegends · 2024-08-19T15:24:04Z

Closes #179
Closes #204

now testing this

Closes #179 Closes #204

critias · 2024-08-20T14:13:47Z

sisyphus/manager.py

+    def handle_job_failure(self, prev_jobs: Dict[str, List[Job]], cur_jobs: Dict[str, List[Job]]):
+        prev_jobs = set(prev_jobs.get(gs.STATE_ERROR, []))
+        for job in cur_jobs.get(gs.STATE_ERROR, []):
+            if job not in prev_jobs:


Is it possible that this line should be if job not in prev_jobs.get(gs.STATE_ERROR, []):?

I think overwriting that variable is just confusing and not what I intended. I've since changed the var names, so it should be clearer now.

I missed that, but yes it's better to use a different name.

critias

Add another doc string, but beside that it looks good to me.

About the proposed usage in the linked issues: It might be worth having a more general way detect broken nodes and exclude them for a certain time instead of adding this to each job individually 🤔

critias · 2024-08-20T14:29:11Z

sisyphus/manager.py

@@ -572,6 +576,12 @@ def maybe_clear_state(state, always_clear, action):
            self.job_cleaner.start()
        return True

+    def handle_job_failure(self, prev_jobs: Dict[str, List[Job]], cur_jobs: Dict[str, List[Job]]):


Please add a short doc string here as well.

michelwi · 2024-08-21T08:04:49Z

sisyphus/global_settings.py

+    logic.
+
+    Sispyhus will call this function w/ the job instance if the job enters the
+    failure state. The callback itself is then responsible for any retry logic,


Suggested change

failure state. The callback itself is then responsible for any retry logic,

error state. The callback itself is then responsible for any retry logic,

I think it is called "error" everywhere else. cf. also "interrupted_resumable" and "interrupted_non_resumable" which I would also name failures but they are not handled here.

Maybe the function name should be renamed as well.

michelwi · 2024-08-21T08:12:12Z

sisyphus/manager.py

+            prev_jobs = self.jobs
+            cur_jobs = self.update_jobs()


How will this interact with the block directly below. self.clear_states removes some errors, call self.update_jobs() and then cycle the outer loop again. This would mess up the logic to detect new jobs in error state.

This is ok, the loop cycles around if jobs were cleared before any of the logic is run.

michelwi · 2024-08-21T08:13:59Z

sisyphus/manager.py

+        for job in cur_jobs.get(gs.STATE_ERROR, []):
+            if job not in prev_errored_jobs:
+                gs.on_job_failure(job)
+


would you call self.update_jobs() after finishing the loop to account for changes in the state of jobs?

Not needed, since this will loop around immediately when the processing is done.

michelwi · 2024-08-21T08:19:57Z

sisyphus/manager.py

+    def handle_job_failure(self, prev_jobs: Dict[str, List[Job]], cur_jobs: Dict[str, List[Job]]):
+        prev_errored_jobs = set(prev_jobs.get(gs.STATE_ERROR, []))
+        for job in cur_jobs.get(gs.STATE_ERROR, []):
+            if job not in prev_errored_jobs:


do we need to keep track of which job is newly added to error state or could this be a stateless function and we always use it on all errored jobs?

It could just as well be stateless.

NeoLegends · 2024-09-23T14:13:31Z

I tested this and it works well. Do you think it's worthwhile adding default handlers that check for e.g. certain log file substrings and then automatically clear the error if they are present?

michelwi

I guess some changes can now be removed again

sisyphus/manager.py

Co-authored-by: michelwi <[email protected]>

NeoLegends · 2024-09-23T14:52:36Z

As a first approximation of a proper error handling implementation for the CUDA errors I'm encountering at the momentadding this snippet to settings.py works reasonably well:

ignore_error_cache = set()

def on_job_failure(job: "Job"):
    from i6_core.returnn import ReturnnTrainingJob

    if not isinstance(job, ReturnnTrainingJob):
        logging.debug(f"{job.job_id()}: error, but not a {ReturnnTrainingJob.__name__}, so not doing anything.")
        return
    elif job.job_id() in ignore_error_cache:
        return

    log_file_path = os.path.join(job.work_path(), "../log.run.1")
    with open(log_file_path, "rt") as log_file:
        is_cuda_err = any("cuda error" in line.lower() for line in log_file)

    if not is_cuda_err:
        logging.debug(f"{job.job_id()}: died but probably not due to a CUDA error, better go check by hand.")
        ignore_error_cache.add(job.job_id())
        return

    logging.info(f"{job.job_id()}: CUDA 💥, re-starting... 🔁")

    # archive log file
    i = 1
    cleared_log_path = None
    while cleared_log_path is None or os.path.exists(cleared_log_path):
        cleared_log_path = os.path.join(job.work_path(), f"../log.run.cuda-cleared.{i:04}.gz")
        i += 1
    with open(log_file_path, "rb") as log_in, gzip.open(cleared_log_path, "wb") as log_out:
        shutil.copyfileobj(log_in, log_out)
    os.remove(log_file_path)

    # re-schedule job
    for f in [
        os.path.join(job.work_path(), "../error.run.1"),
        os.path.join(job.work_path(), "../submit_log.run"),
    ]:
        try:
            os.remove(f)
        except FileNotFoundError:
            pass

albertz · 2024-10-04T08:03:45Z

I wonder, this callback is called all the time, not once on job failure? This is a bit unexpected to me. And also makes the logic much more complicated on the user side. E.g. you need to add this ignore_error_cache logic here. Which is also wrong, because once the user clears the error for this job, and it continues to run, it might later run into a CUDA error, but then you would ignore it, because you never clear the ignore_error_cache here.

albertz · 2024-10-04T08:08:41Z

Well, ok, checking the mtime of the error file probably should be better, if you want to keep the callback logic this way. Like:

ignore_error_cache = {}  # job_id -> err_mtime


# https://github.com/rwth-i6/sisyphus/pull/205#issuecomment-2368527715
def on_job_failure(job: Job):
    import logging
    import gzip
    from i6_core.returnn import ReturnnTrainingJob

    if not isinstance(job, ReturnnTrainingJob):
        return

    try:
        err_mtime = os.path.getmtime(os.path.join(job.work_path(), "../error.run.1"))
    except FileNotFoundError:
        return  # maybe was already cleared
    if ignore_error_cache.get(job.job_id()) == err_mtime:
        return

    log_file_path = os.path.join(job.work_path(), "../log.run.1")
    with open(log_file_path, "rt") as log_file:
        is_cuda_err = any(("cuda error" in line.lower() or "cuFFT error" in line) for line in log_file)

    if not is_cuda_err:
        logging.debug(f"{job.job_id()}: died but probably not due to a CUDA error, better go check by hand.")
        ignore_error_cache[job.job_id()] = err_mtime
        return

    ...

Allow hooking job failure for generic error handling

1f08daa

Closes #179 Closes #204

NeoLegends added the enhancement New feature or request label Aug 19, 2024

NeoLegends requested review from albertz, critias and michelwi August 19, 2024 15:24

NeoLegends self-assigned this Aug 19, 2024

critias reviewed Aug 20, 2024

View reviewed changes

fix confusion due to overwritten variable

e496821

critias reviewed Aug 20, 2024

View reviewed changes

michelwi reviewed Aug 21, 2024

View reviewed changes

NeoLegends added 3 commits September 23, 2024 10:00

fix typings

a86c4cc

simplify implementation, make stateless

15af67b

better cso

511d695

NeoLegends marked this pull request as ready for review September 23, 2024 14:12

michelwi reviewed Sep 23, 2024

View reviewed changes

sisyphus/manager.py Outdated Show resolved Hide resolved

sisyphus/manager.py Outdated Show resolved Hide resolved

sisyphus/manager.py Outdated Show resolved Hide resolved

NeoLegends and others added 2 commits September 23, 2024 10:22

remove superfluous imports

6de4ab0

Update sisyphus/manager.py

01a7ca3

Co-authored-by: michelwi <[email protected]>

michelwi approved these changes Sep 23, 2024

View reviewed changes

NeoLegends merged commit c7de85e into master Sep 23, 2024
3 checks passed

NeoLegends deleted the moritz-job-failure-hook branch September 23, 2024 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow hooking job failure for generic error handling #205

Allow hooking job failure for generic error handling #205

NeoLegends commented Aug 19, 2024 •

edited

Loading

critias Aug 20, 2024

NeoLegends Aug 20, 2024

critias Aug 20, 2024

critias left a comment

critias Aug 20, 2024

michelwi Aug 21, 2024

michelwi Aug 21, 2024

NeoLegends Sep 23, 2024

michelwi Aug 21, 2024

NeoLegends Sep 23, 2024

michelwi Aug 21, 2024

NeoLegends Sep 23, 2024

NeoLegends commented Sep 23, 2024

michelwi left a comment

NeoLegends commented Sep 23, 2024 •

edited

Loading

albertz commented Oct 4, 2024

albertz commented Oct 4, 2024 •

edited

Loading

	failure state. The callback itself is then responsible for any retry logic,
	error state. The callback itself is then responsible for any retry logic,

Allow hooking job failure for generic error handling #205

Allow hooking job failure for generic error handling #205

Conversation

NeoLegends commented Aug 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

critias left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NeoLegends commented Sep 23, 2024

michelwi left a comment

Choose a reason for hiding this comment

NeoLegends commented Sep 23, 2024 • edited Loading

albertz commented Oct 4, 2024

albertz commented Oct 4, 2024 • edited Loading

NeoLegends commented Aug 19, 2024 •

edited

Loading

NeoLegends commented Sep 23, 2024 •

edited

Loading

albertz commented Oct 4, 2024 •

edited

Loading