Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process hanging around after container stopped #1489

Open
luizkowalski opened this issue Sep 17, 2024 · 3 comments
Open

Process hanging around after container stopped #1489

luizkowalski opened this issue Sep 17, 2024 · 3 comments

Comments

@luizkowalski
Copy link
Contributor

hey (again) 👋🏻

I'm seeing a strange behavior when after a deployment with Kamal: the old process is still showing up, even though the container has stopped

image

There are a number of job containers, though they are all stopped.

CONTAINER ID   IMAGE                                                                                          COMMAND                  CREATED          STATUS                        PORTS            NAMES
7d91ae5526c8   luizkowalski/sumiu-app:6563f795f6a4b489e0e4bea52b682724a69c247c_uncommitted_62ecdc8cec127b89   "/rails/bin/docker-e…"   3 minutes ago    Up 3 minutes                  80/tcp           sumiu-job-6563f795f6a4b489e0e4bea52b682724a69c247c_uncommitted_62ecdc8cec127b89
b1b63fba1d87   luizkowalski/sumiu-app:6563f795f6a4b489e0e4bea52b682724a69c247c_uncommitted_62ecdc8cec127b89   "/rails/bin/docker-e…"   4 minutes ago    Up 4 minutes (healthy)        80/tcp           sumiu-web-6563f795f6a4b489e0e4bea52b682724a69c247c_uncommitted_62ecdc8cec127b89
90854613ed86   luizkowalski/sumiu-app:6563f795f6a4b489e0e4bea52b682724a69c247c_uncommitted_c84c94f7475e371d   "/rails/bin/docker-e…"   16 minutes ago   Exited (137) 3 minutes ago                     sumiu-job-6563f795f6a4b489e0e4bea52b682724a69c247c_uncommitted_c84c94f7475e371d
daaa1f28cb1b   luizkowalski/sumiu-app:6563f795f6a4b489e0e4bea52b682724a69c247c_uncommitted_c84c94f7475e371d   "/rails/bin/docker-e…"   17 minutes ago   Exited (137) 3 minutes ago                     sumiu-web-6563f795f6a4b489e0e4bea52b682724a69c247c_uncommitted_c84c94f7475e371d
ff4573af4493   4c6c00662b98                                                                                   "/rails/bin/docker-e…"   2 days ago       Exited (137) 16 minutes ago                    sumiu-job-latest
5cec3bfd1d0f   4c6c00662b98                                                                                   "/rails/bin/docker-e…"   2 days ago       Exited (137) 2 days ago                        sumiu-job-latest_replaced_40cad9c926616e5c
171ee6e3ce16   luizkowalski/sumiu-app:6563f795f6a4b489e0e4bea52b682724a69c247c                                "/rails/bin/docker-e…"   2 days ago       Exited (137) 2 days ago                        sumiu-job-6563f795f6a4b489e0e4bea52b682724a69c247c

After a couple of minutes, the old process disappears automagically.

Before I dig deeper into it, I need to ask a couple of questions:

  1. Is this a new behavior? the old container stops gracefully, from what I can tell, and I don't see any exceptions in their logs. They just stop and that's it.

  2. Is this view purely for presentation purposes and it does not affect the processing (e.g. jobs being enqueued by this "ghost" process)

  3. I initially suspected that it had something to do with Kamal and their healthcheck thing but I see from the logs that the container is indeed passing the healthcheck and the old container is stopped with docker stop job-container-name. Could it have something to do with the way docker stops and the way Good Job traps the signal? I don't know if I'm making sense hahahah

@bensheldon
Copy link
Owner

Could it have something to do with the way docker stops and the way Good Job traps the signal?

I imagine this is the problem. That Docker is sending a SIGKILL which is abruptly interruping the process, rather than giving it a chance to shut down gracefully.

I did move GoodJob over entirely to a heartbeat model for detecting active processes (previously the process would take an Advisory Lock: #1451). A heartbeat results in a slightly slower cleanup of process records... but that's more of an accounting issue. The underlying problem is: the process is being stopped ungracefully.

Do you know what the signal is being sent to processes to shut them down? https://github.com/bensheldon/good_job?tab=readme-ov-file#Interrupts-graceful-shutdown-and-SIGKILL

@luizkowalski
Copy link
Contributor Author

here is what I've found so far:

  • Kamal uses docker stop with a timeout of 30s
Running docker container ls --all --filter name=^sumiu-job-f11a5b84a8dbfcbd5972457fa596059710dea2d3_uncommitted_c517826e379b8498$ --quiet | xargs docker stop -t 30 on web

INFO [16b022c8] Finished in 12.525 seconds with exit status 0 (successful).
  • docker stop sends SIGTERM, and after a grace period, SIGKILL
  • I manually run docker stop, it does not take more than 10ish seconds for the container to stop, there is no way that the container is receiving a SIGKILL
  • Checking the logs after the container has stopped has no mention of a shutdown. These are the last lines of docker logs
I, [2024-10-09T21:15:00.156245 #1]  INFO -- : [GoodJob] [1] [GoodJob::Scheduler(queues=urgent max_threads=5)-thread-34] Executed GoodJob 69cd2dc4-7901-404f-9836-53d51e1f446e
I, [2024-10-09T21:20:00.016480 #1]  INFO -- : [ActiveJob] Enqueued CaptureQueryStatsJob (Job ID: 72f0fc26-6e24-4b28-85d6-5cad0d1a08db) to GoodJob(urgent)
I, [2024-10-09T21:20:00.047545 #1]  INFO -- : [ActiveJob] [CaptureQueryStatsJob] [72f0fc26-6e24-4b28-85d6-5cad0d1a08db] Performing CaptureQueryStatsJob (Job ID: 72f0fc26-6e24-4b28-85d6-5cad0d1a08db) from GoodJob(urgent) enqueued at 2024-10-09T21:20:00.004459967Z
I, [2024-10-09T21:20:00.121179 #1]  INFO -- : [ActiveJob] [CaptureQueryStatsJob] [72f0fc26-6e24-4b28-85d6-5cad0d1a08db] Performed CaptureQueryStatsJob (Job ID: 72f0fc26-6e24-4b28-85d6-5cad0d1a08db) from GoodJob(urgent) in 73.77ms
I, [2024-10-09T21:20:00.122605 #1]  INFO -- : [GoodJob] [1] [GoodJob::Scheduler(queues=urgent max_threads=5)-thread-36] Executed GoodJob 72f0fc26-6e24-4b28-85d6-5cad0d1a08db
  • docker inspect looks ok too, even though the exit code is 137
"State": {
    "Status": "exited",
    "Running": false,
    "Paused": false,
    "Restarting": false,
    "OOMKilled": false,
    "Dead": false,
    "Pid": 0,
    "ExitCode": 137,
    "Error": "",
    "StartedAt": "2024-10-09T18:26:38.770016471Z",
    "FinishedAt": "2024-10-09T21:22:42.292104872Z"
},

@luizkowalski
Copy link
Contributor Author

luizkowalski commented Oct 11, 2024

ok, back at it, debugging some more and I think I'm close to find the actual problem.
so good job is indeed trapping the signals correctly and there's is no reason for it not to work...unless good job is not receiving the signal at all

there are more people with similar problems, apparently, and I think it all boils down to another process "wrapping" a good job and not passing the signal down. I thought that would be weird because I'm explicitly setting the command in my Kamal config:

job:
  hosts:
  - web
  env:
    clear:
      CRON_ENABLED: 1
      GOOD_JOB_MAX_THREADS: 10 # Number of threads to process background jobs
      DB_POOL: <%= 10 + 1 + 2 %> # GOOD_JOB_MAX_THREADS + 1 + 2
  cmd: bin/good_job start --probe-port=7001 
  options:
    health-cmd: curl -f http://localhost:7001/status/connected || exit 1
    health-retries: 15
    health-interval: 5s

Running docker top to confirm it shows me this

kamal@web:~$ docker top a7bd263ec291
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
kamal               4163342             4163321             3                   15:59               ?                   00:00:09            ruby /rails/bin/good_job start --probe-port=7001

Apparently, ruby is prepended on the cmd and could be the cause. I do have an ENTRYPOINT directive in my dockerfile like this:

#!/bin/bash -e

# If running the rails server then create or migrate existing database
if [ "${*}" == "bundle exec thrust ./bin/rails server" ]; then
  ./bin/rails db:prepare
fi

exec "${@}"

which I also tried to remove but didn't do anything.

so, tldr:

  • I guess the SIGTERM is not being captured by GJ
  • GJ might not be the problem
  • ruby command prepended to the cmd could be swallowing the signal

Does that make any sense? if so, it might be a Kamal thing and I will follow up with an issue on their side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Inbox
Development

No branches or pull requests

2 participants