-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinite recursion on list_windows.go #23984
Comments
Hi @albertofem-scopely and thanks for raising this issue with the detail included. This looks like a problem with the stats gathering here and something we should look into fixing. I'll mark this for roadmapping and also raise it internally.
I've given it a little thought, and I am not aware of a workaround for this currently. If I do think of something, I'll be sure to note it here. |
Hi @albertofem-scopely! I did some testing on the recursive algorithm that @jrasell linked above. It looks like the stack size increases geometrically with the number of processes we have to examine, and we end up examining the same PID multiple times to build the tree. For example, I took the PID tree you posted and ran it thru and that got me 172 iterations for 21 processes, and 9 of those PIDs were touched 10 or more times. And we have to do this for all the processes on the machine, not just the ones in question. So it's not "infinite" but on a busy machine it can be quite a lot. I'm working on figuring out a patch now, but at this point I feel pretty good about having eliminated anything weirdly Windows-specific (other than that we only use this code on Windows) so it should go quickly. |
I've got a draft PR up with a new algorithm that trades a little bit of memory (although not much) for being O(n) with the number of processes on the host: #24182. Need to finish up the improved test and then I'll mark this for review. |
In #20619 we overhauled how we were gathering stats for Windows processes. Unlike in Linux where we can ask for processes in a cgroup, on Windows we have to make a single expensive syscall to get all the processes and then build the tree ourselves. Our algorithm to do so is recursive and quadratic in both steps and space with the number of processes on the host. For busy hosts this hit the stack limit and panic the Nomad client. We already build a map of parent PID to PID, so modify this to be a map of parent PID to slice of children and then traverse that tree only from the root we care about (the executor PID). This moves the allocations to the heap but makes the stats gathering linear in steps and space required. Fixes: #23984
#24182 has been merged and will ship in Nomad 1.9.1 (with Enterprise backports as usual) |
@tgross Thank you! I'll test this out as soon as the 1.9.1 goes out. If this is indeed fixed we'll know within 24 hours, as the issue is still happening on a daily basis |
We just upgraded and I'm afraid this issue is still happening, albeit in a different place, not in
|
Nomad version
Nomad v1.8.3
BuildDate 2024-08-13T07:37:30Z
Revision 63b636e
Operating system and Environment details
OS Name: Microsoft Windows Server 2022 Datacenter
OS Version: 10.0.20348 N/A Build 20348
Issue
It seems that we are hitting an infinite recursion issue running Nomad on Windows (as client). We have a large cluster of just long-running services usinf raw_exec. When the services goes up, all seems to be good and they can be stable for quite a few hours.
However, sometimes after a few hours, we start receiving a lot of these errors in the respective allocations:
I pulled the logs from a particular client in which one of these allocations failed and I found this go stack trace:
Our system is set to try restarting up to three times before reallocating. If it fails three times in a row, it triggers a new allocation. As you can see in the screenshot, even those new allocations can fail at first, but eventually, one of them sticks and things stabilize. The catch is, after a while, the service gets unstable again for the same reason, and the whole cycle starts over.
Reproduction steps
We don't have a synthetic project in which we can reliably reproduce this, as this is happening exclusively on our production workload.
Here are some guesses and some more information about what this process is actually doing.
The main process being executed in this Nomad allocation is a Node application that spawns a Github Actions runner, that is configured to only execute a particular kind of a job: a Unity build. Unity is a game engine that, when building, can take a lot of resources and put the machine under a significant amount of stress. Although we have beefy machines that are way beyond what we have observed this process to take, this could explain part of this behaviour. Moreover, the actual Unity process run on a Docker for Windows container.
Nomad runs as a Windows Service under the
NT AUTHORITY\SYSTEM
user in the Windows machine. This is an example of the process tree that our nomad client spawns for each allocation:Moreover, it looks like when this happen, only the Nomad executor dies but the underlying process is left alive, which becomes problematic as these live outside our Nomad cluster and take resources in the machines.
Finally, we started to experience this issue after upgrading Nomad from version 1.7.7 to 1.8.3. We are considering downgrading because of this issue, but would be nice to understand if we can do anything to mitigate at all.
If it helps, these are normal EC2 machines in the AWS Cloud.
Expected Result
The Nomad allocations don't crash with a go panic and they run normally
Actual Result
The Nomad allocations crash with a go panic, and the underlying spawned process is left alive taking resources in the machine.
The text was updated successfully, but these errors were encountered: