Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overwhelming the watchbot Logger #339

Open
sgillies opened this issue Jun 24, 2020 · 1 comment
Open

Overwhelming the watchbot Logger #339

sgillies opened this issue Jun 24, 2020 · 1 comment

Comments

@sgillies
Copy link

At high log volume and small-ish CPU/memory reservations our PXM worker command (Python, https://github.com/mapbox/pxm/blob/dev/cloudformation/pxm.template.js#L82) is breaking (I believe) the pipe to its parent worker. Logging less and increasing reservations for the stack helps but I'm interested in trying to solve the root problem.

When I say that we break the pipe, what I mean is that the PXM worker gets a 120 code from its child process at https://github.com/mapbox/ecs-watchbot/blob/master/lib/worker.js#L18. This 120 code is only set by Python when Python fails to flush stderr and stdout just before exiting. In the jobs where we see 120 exit codes, we also witness missing log messages where we would expect them; the messages our worker command is sending to stderr are not making it through and then we fail to flush the streams at the end. I do not understand at all how Python finds its stderr in a busted state.

I've tried to reproduce this problem outside of a deployed ecs-watchbot stack, but the watchbot logger can keep up with logs as fast as a single-threaded Python program can write them. I might be able to break the logger if I ran everything in a resource limited container and logged from multiple threads (as we are doing in PXM), but before I do that I am wondering if there are known limits to the watchbot logger and if this is something we should try to fix at all.

@rclark I hope you're doing well. Don't sweat this one, just letting you know your favorite pathological users have found another limit to computers 😄

@rclark
Copy link
Contributor

rclark commented Jun 29, 2020

We've run up against similar challenges before when running things as Node.js child processes. However in this case it sounds like we're probably doing that piece alright, but there's a pretty thick pile of Node code before getting to the container's stdout.

python -> parent node process -> combiner module (new-line-splitting module -> writable stream ) -> process.stdout

There's a bunch of potential space here where python writing faster than node could end up with a backup in python-land. It is kind of interesting that you say this happens during a pre-shutdown flush -- sounds like there's data to flush that's backed up, but I wonder if the pipeline in Node.js is possibly getting shut down prematurely? Or maybe there's a limit to how long python is willing to wait to flush, and Node.js is still backed up when time runs out?

There's even the possibility that the way your container's logging is hooked up to CloudWatch Logs could cause some degree of slowdown. There are a bunch of spaces here where I really don't know all the implementation specifics.

For reference, the combiner module and the splitter module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants