Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pexpect stdout sampling race can cause event loss #1330

Closed
nitzmahone opened this issue Dec 5, 2023 · 0 comments · Fixed by #1331
Closed

pexpect stdout sampling race can cause event loss #1330

nitzmahone opened this issue Dec 5, 2023 · 0 comments · Fixed by #1331
Labels
needs_triage New item that needs to be triaged

Comments

@nitzmahone
Copy link
Member

nitzmahone commented Dec 5, 2023

EDIT: proposed fix at #1331

Runner's stdout sampling is subject to a child process shutdown race when pexpect is in the mix, arguably because it's misusing the pexpect logfile_read callback:

while child.isalive():
result_id = child.expect(password_patterns, timeout=self.config.pexpect_timeout, searchwindowsize=100)
password = password_values[result_id]
if password is not None:
child.sendline(password)
self.last_stdout_update = time.time()
if self.cancel_callback:
try:
self.canceled = self.cancel_callback()
except Exception as e:
stdout_handle.close()
stderr_handle.close()
# TODO: logger.exception('Could not check cancel callback - cancelling immediately')
# if isinstance(extra_update_fields, dict):
# extra_update_fields['job_explanation'] = "System error during job execution, check system logs"
raise CallbackError(f"Exception in Cancel Callback: {e}") from e
if self.config.job_timeout and not self.canceled and (time.time() - job_start) > self.config.job_timeout:
self.timed_out = True
# if isinstance(extra_update_fields, dict):
# extra_update_fields['job_explanation'] = "Job terminated due to timeout"
if self.canceled or self.timed_out or self.errored:
self.kill_container()
Runner.handle_termination(child.pid)
if self.config.idle_timeout and (time.time() - self.last_stdout_update) > self.config.idle_timeout:
self.kill_container()
Runner.handle_termination(child.pid)
self.timed_out = True

If the child shuts down between a timed-out child.expect() and the next loop iteration's child.isalive(), any output sent by the child in the meantime will be lost. A final stdout stream pump via child.expect() until EOF is reached is necessary to ensure all data is fed through the stdout interposer. This can result in missing events- usually playbook_on_stats. The problem is exacerbated by delays longer than the pexpect timeout (default 5s) near the end of playbook execution (eg, profiling callback aggregation).

@github-actions github-actions bot added the needs_triage New item that needs to be triaged label Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs_triage New item that needs to be triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant