Overwhelming the watchbot Logger #339

sgillies · 2020-06-24T23:57:35Z

At high log volume and small-ish CPU/memory reservations our PXM worker command (Python, https://github.com/mapbox/pxm/blob/dev/cloudformation/pxm.template.js#L82) is breaking (I believe) the pipe to its parent worker. Logging less and increasing reservations for the stack helps but I'm interested in trying to solve the root problem.

When I say that we break the pipe, what I mean is that the PXM worker gets a 120 code from its child process at https://github.com/mapbox/ecs-watchbot/blob/master/lib/worker.js#L18. This 120 code is only set by Python when Python fails to flush stderr and stdout just before exiting. In the jobs where we see 120 exit codes, we also witness missing log messages where we would expect them; the messages our worker command is sending to stderr are not making it through and then we fail to flush the streams at the end. I do not understand at all how Python finds its stderr in a busted state.

I've tried to reproduce this problem outside of a deployed ecs-watchbot stack, but the watchbot logger can keep up with logs as fast as a single-threaded Python program can write them. I might be able to break the logger if I ran everything in a resource limited container and logged from multiple threads (as we are doing in PXM), but before I do that I am wondering if there are known limits to the watchbot logger and if this is something we should try to fix at all.

@rclark I hope you're doing well. Don't sweat this one, just letting you know your favorite pathological users have found another limit to computers 😄

rclark · 2020-06-29T01:26:54Z

We've run up against similar challenges before when running things as Node.js child processes. However in this case it sounds like we're probably doing that piece alright, but there's a pretty thick pile of Node code before getting to the container's stdout.

python -> parent node process -> combiner module (new-line-splitting module -> writable stream ) -> process.stdout

There's a bunch of potential space here where python writing faster than node could end up with a backup in python-land. It is kind of interesting that you say this happens during a pre-shutdown flush -- sounds like there's data to flush that's backed up, but I wonder if the pipeline in Node.js is possibly getting shut down prematurely? Or maybe there's a limit to how long python is willing to wait to flush, and Node.js is still backed up when time runs out?

There's even the possibility that the way your container's logging is hooked up to CloudWatch Logs could cause some degree of slowdown. There are a bunch of spaces here where I really don't know all the implementation specifics.

For reference, the combiner module and the splitter module.

See https://docs.python.org/3/library/sys.html#sys.exit. This works around issue #339.

sgillies pushed a commit that referenced this issue Jul 7, 2020

Handle python's 120 exit code

e5c95e2

See https://docs.python.org/3/library/sys.html#sys.exit. This works around issue #339.

sgillies mentioned this issue Jul 7, 2020

Handle python's 120 exit code #340

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overwhelming the watchbot Logger #339

Overwhelming the watchbot Logger #339

sgillies commented Jun 24, 2020

rclark commented Jun 29, 2020

Overwhelming the watchbot Logger #339

Overwhelming the watchbot Logger #339

Comments

sgillies commented Jun 24, 2020

rclark commented Jun 29, 2020