Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncontrolled evaluation of periodic batch jobs during DST change #5410

Closed
boardwalk opened this issue Mar 12, 2019 · 6 comments · Fixed by #7894
Closed

Uncontrolled evaluation of periodic batch jobs during DST change #5410

boardwalk opened this issue Mar 12, 2019 · 6 comments · Fixed by #7894

Comments

@boardwalk
Copy link

Nomad version

Nomad v0.8.7 (21a2d93+CHANGES)

Operating system and Environment details

Red Hat Enterprise Linux Server release 7.6 (Maipo)

Issue

I had a periodic job scheduled during the DST rollover this past weekend which was repeatedly rapidly evaluated at 6AM UTC when it was scheduled for 2AM local/ET. For reference, ET was UTC-5 before the change and UTC-4 after. The job ran many, many thousands of times before it was caught, and I believe there were many times that many allocations that never ended up getting placed (I posted about that in #4532). The flood eventually crippled the cluster, because Nomad was tracking so much that the OOM killer came out in force, and even without that, Nomad was mostly unresponsive and far as I can tell. I was able to stop the stage-a-restart-services job, but I seemed to also have to change the eligibility of the nodes to false to get the allocations to drain from being in the thousands per node to the dozen or as normal.

Reproduction steps

Running a job scheduled the same way across the the DST change should reproduce this, but I honestly don't have time to do things like bring up another Nomad instance in a VM where I have the privs to set the date.

Job file (if appropriate)

job "stage-a-restart-services" {
  type = "batch"
  periodic {
    cron = "0 0 2 * * * *"
    time_zone = "Local"
  }
  datacenters = ["a"]
  group "restart-services" {
    task "restart-services" {
      leader = true
      driver = "raw_exec"
      env {
        TIER = "stage"
        SITE = "a"
      }
      config {
        command = "/home/fds/dsotm/FDSdsotm_misc/bin/restart_services"
      }
    }
    task "store-logs" {
      driver = "raw_exec"
      config {
        command = "/home/fds/dsotm/FDSdsotm_misc/bin/store_logs"
        args = ["/home/fds/dsotm/log/${NOMAD_JOB_NAME}"]
      }
    }
  }
}

/home/fds/dsotm/FDSdsotm_misc/bin/restart_services is a short Python script that thankfully didn't actually do anything with this time around.

@c2nes
Copy link

c2nes commented Mar 15, 2019

We experienced the same issue with a periodic job configured with a 30 * * * * schedule and timezone "America/New_York" on Nomad 0.8.6.

The first problematic allocation was started at 2019-03-10T06:30:11.736Z. Once that allocation completed a new allocation for the same job instance was created. This cycled continued until this morning when we manually stopped the parent job and the child job instance. Ultimately, over 4000 allocations were created. Virtually all completed successfully.

Attempting to manually stop just the child job (while leaving the parent periodic job registered) was unsuccessful. Nomad would accept the DELETE request, and the child job would sometimes briefly be marked as dead, but would almost immediately return to a running state with all of the old allocations still in place.

After completely stopping the job (both parent and child) we were able to successfully re-register the job. However, it is now not being scheduled for execution at all. Looking at other periodic jobs, the ones that aren't stuck in continuous execution loops appear to not have run at all since the EST/EDT switch over.

@stale
Copy link

stale bot commented Jun 13, 2019

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

@stale
Copy link

stale bot commented Jul 13, 2019

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

@notnoop
Copy link
Contributor

notnoop commented Oct 24, 2019

Thank you so much for reporting this and sorry for taking a long time. We plan to investigate and remedy this soon.

The issue here is that our cron library doesn't handle daylight transitions well. We have two complications: the library we use is deprecated and unmaintained[1] and the daylight saving issue is a known unresolved issue[2]. We'll investigate our options and address it soon.

Meanwhile, we recommend using UTC timezone for periodic jobs either in general or at least around DST transitioning time, if possible.

[1] https://github.com/gorhill/cronexpr
[2] gorhill/cronexpr#17

@notnoop
Copy link
Contributor

notnoop commented May 4, 2020

Providing an update here with my notes.

We have two options:

We can fix gorhill/cronexpr library to handle DST properly. Sadly, the DST PR gorhill/cronexpr#17 fails some our testing, as it gets into infinite recursion causing a stack overflow in some cases.

Alternatively, we can migrate to using another maintained library. https://github.com/robfig/cron is a very reasonable library. Its handling of DST passed our tests. The library is well maintained and commonly used.

The downside of switching libraries that cronexpr supports some cron expression extensions not supported by any other library I looked at, so we may risk introducing subtle compatibility changes:

  • Years: this is a simple thing to add
  • L (last day), W (week day), # (further constraints on days) - these are trickier to implement while ensuring that we adhere to gorhill/cronexpr semantics properly.

My current inclination is to check if robfig/cron would welcome contributions for the extensions - Their SpecSchedule struct would need to change significantly. If not, I would suggest fixing cronexpr as-is.

@github-actions
Copy link

github-actions bot commented Nov 7, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants