Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: terminate hanging jobs #187

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

hbruch
Copy link
Collaborator

@hbruch hbruch commented Feb 23, 2025

This PR

  • adds a script executed on startup to terminate runs still in started state.
  • enables run monitoring to terminate jobs hanging on startup/cancellation (after 180s) or running for more than 6h

@hbruch hbruch requested a review from the-infinity February 23, 2025 11:37
@the-infinity
Copy link
Collaborator

Everything else looks good.

@hbruch hbruch requested a review from the-infinity February 24, 2025 08:41
@hbruch hbruch force-pushed the fix/terminate-hanging-jobs branch from ed005b1 to dab3b03 Compare February 24, 2025 08:53
set -x

echo "Starting terminate_starting_and_started_runs in background to terminate orphaned ones."
python /opt/dagster/app/scripts/terminate_starting_and_started_runs.py &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The background job (sub process) started here might fail for any reason, including /opt/dagster/app/scripts/terminate_starting_and_started_runs.py missing. I'm afraid that this will cause further headaches down the road, at some inconvenient moment, in production. I strongly suggest that we change the script to crash (exit with non-0) the contain if this happens.

It seems very hard to get the following behaviour solely with a Bash script:

  • If the background job fails, also quit the main process.
  • If the main process exits, don't leave the background job running.
  • If both run through cleanly, exit as usual.

After experimenting for too long (with many variants of 1, 2 & 3), I propose just using tini in combination with kill "-$$" as a workaround.

#!/bin/bash

set -eo pipefail

echo 'Running terminate_starting_and_started_runs.py as a background job.' >&2

set -x

# In case the background job fails, we kill the entire shell ($$ has its PID) and all its children (by negating the PID).
# This *does not* work if `$$` evaluates to 1 (our shell is the init process), so we *must* run this script with an "external" init command.
python /opt/dagster/app/scripts/terminate_starting_and_started_runs.py \
	|| kill -TERM -- "-$$" &

exec "$@"
``

'''


def get_run_ids_of_runs(status: list[str], timeout: int = 20) -> list[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's likely that we'll want to change this soon, so let's make it configurable via an environment variable.

@@ -27,6 +27,10 @@ LABEL org.opencontainers.image.licenses="(EUPL-1.2)"

EXPOSE 3000

COPY scripts/ /opt/dagster/app/scripts/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See the comment about background jobs below.)

Suggested change
COPY scripts/ /opt/dagster/app/scripts/
ARG TARGETARCH
ENV TINI_VERSION=v0.19.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini-${TARGETARCH} /tini
RUN chmod +x /tini && /tini --version
COPY scripts/ /opt/dagster/app/scripts/
ENTRYPOINT [ "/tini", "--", "/opt/dagster/app/scripts/start_runs_termination_script_in_background.sh" ]

@derhuerst
Copy link
Member

I'm afraid that this change will cause more headaches than helping us. Effectively, we have two race conditions here:

  1. If Dagster and its GraphQL API don't start quickly enough, we either don't clean hanging jobs (this PR's goal) or we crash (as proposed in my comments).
  2. If the Python script for some reason is too slow act (right after Dagster has started and not started new jobs yet because of auto-materialization), it might accidentally kill legitimate jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants