-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uncontrolled evaluation of periodic batch jobs during DST change #5410
Comments
We experienced the same issue with a periodic job configured with a The first problematic allocation was started at 2019-03-10T06:30:11.736Z. Once that allocation completed a new allocation for the same job instance was created. This cycled continued until this morning when we manually stopped the parent job and the child job instance. Ultimately, over 4000 allocations were created. Virtually all completed successfully. Attempting to manually stop just the child job (while leaving the parent periodic job registered) was unsuccessful. Nomad would accept the DELETE request, and the child job would sometimes briefly be marked as dead, but would almost immediately return to a running state with all of the old allocations still in place. After completely stopping the job (both parent and child) we were able to successfully re-register the job. However, it is now not being scheduled for execution at all. Looking at other periodic jobs, the ones that aren't stuck in continuous execution loops appear to not have run at all since the EST/EDT switch over. |
Hey there Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this. Thanks! |
This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍 |
Thank you so much for reporting this and sorry for taking a long time. We plan to investigate and remedy this soon. The issue here is that our cron library doesn't handle daylight transitions well. We have two complications: the library we use is deprecated and unmaintained[1] and the daylight saving issue is a known unresolved issue[2]. We'll investigate our options and address it soon. Meanwhile, we recommend using [1] https://github.com/gorhill/cronexpr |
Providing an update here with my notes. We have two options: We can fix gorhill/cronexpr library to handle DST properly. Sadly, the DST PR gorhill/cronexpr#17 fails some our testing, as it gets into infinite recursion causing a stack overflow in some cases. Alternatively, we can migrate to using another maintained library. https://github.com/robfig/cron is a very reasonable library. Its handling of DST passed our tests. The library is well maintained and commonly used. The downside of switching libraries that cronexpr supports some cron expression extensions not supported by any other library I looked at, so we may risk introducing subtle compatibility changes:
My current inclination is to check if robfig/cron would welcome contributions for the extensions - Their |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.8.7 (21a2d93+CHANGES)
Operating system and Environment details
Red Hat Enterprise Linux Server release 7.6 (Maipo)
Issue
I had a periodic job scheduled during the DST rollover this past weekend which was repeatedly rapidly evaluated at 6AM UTC when it was scheduled for 2AM local/ET. For reference, ET was UTC-5 before the change and UTC-4 after. The job ran many, many thousands of times before it was caught, and I believe there were many times that many allocations that never ended up getting placed (I posted about that in #4532). The flood eventually crippled the cluster, because Nomad was tracking so much that the OOM killer came out in force, and even without that, Nomad was mostly unresponsive and far as I can tell. I was able to stop the
stage-a-restart-services
job, but I seemed to also have to change the eligibility of the nodes to false to get the allocations to drain from being in the thousands per node to the dozen or as normal.Reproduction steps
Running a job scheduled the same way across the the DST change should reproduce this, but I honestly don't have time to do things like bring up another Nomad instance in a VM where I have the privs to set the date.
Job file (if appropriate)
/home/fds/dsotm/FDSdsotm_misc/bin/restart_services
is a short Python script that thankfully didn't actually do anything with this time around.The text was updated successfully, but these errors were encountered: