Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sporadic ansible-runner macOS bug and remove extra listener thread #23229

Merged
merged 1 commit into from
Oct 12, 2024

Conversation

Fryguy
Copy link
Member

@Fryguy Fryguy commented Oct 11, 2024

It was found that on macOS a sporadic failure would occur where the internal listener, while started, wouldn't actually start listening right away, causing the listener to miss the file we were waiting for.

The listen gem creates an internal thread, @run_thread, which on most target systems is where the actually listening is done. However, on macOS, @run_thread creates a second thread, @worker_thread, which does the actual listening. It's possible that although the listener is started, the @worker_thread hasn't actually started yet. This leaves a window where the target_path we are waiting on can actually be created before the @worker_thread is started and we "miss" the creation of the target_path. This commit ensures that we won't move on until that thread is ready, further ensuring we can't miss the creation of the target_path.

Ansible::Runner#wait_for was also creating an extra thread when starting the listener, but this thread is unnecessary as the listen gem creates its own internal thread under the covers.


@agrare Please review. I'd also like it if you could verify this on linux, since the removal of that extra thread could affect that path. I'm concerned that while macOS has this special case, there might be a different, but similar, special case on Linux we need to account for after calling listener.start.

I've also considering opening a bug in the upstream listen gem but would like your thoughts. IMO, listener.start shouldn't return until the underlying watching has actually started.

@Fryguy
Copy link
Member Author

Fryguy commented Oct 11, 2024

I added in the exception spec because of the comment in the old thread code about "raises an exception immediately". I believe you were hitting problems where the listener.start wasn't yet called from the thread, and thus the later listener.stop could get called first. However, if we move listener.start to the main thread, then that case can't happen because start will definitely complete and internally transition to "started" (internal threads being actually started notwithstanding).

@Fryguy Fryguy added the bug label Oct 11, 2024
@Fryguy Fryguy force-pushed the fix_ansible_runner_async_sporadic branch 2 times, most recently from 1834baa to f09580b Compare October 11, 2024 21:29
It was found that on macOS a sporadic failure would occur where the internal listener,
while started, wouldn't actually start listening right away, causing the listener to
miss the file we were waiting for.

The listen gem [creates an internal thread, `@run_thread`][1], which on most target systems
is where the actually listening is done. However, [on macOS, `@run_thread` creates a
second thread, `@worker_thread`][2], which does the actual listening. It's possible that
although the listener is started, the `@worker_thread` hasn't actually started yet.
This leaves a window where the target_path we are waiting on can actually be created
before the `@worker_thread` is started and we "miss" the creation of the target_path.
This commit ensures that we won't move on until that thread is ready, further ensuring
we can't miss the creation of the target_path.

Ansible::Runner#wait_for was also creating an extra thread when starting the listener,
but this thread is unnecessary as the listen gem creates its own internal thread under
the covers.

[1]: https://github.com/guard/listen/blob/f186b2fa159a2458f3ff7e8680c3a4fcbdc636d1/lib/listen/adapter/base.rb#L75
[2]: https://github.com/guard/listen/blob/f186b2fa159a2458f3ff7e8680c3a4fcbdc636d1/lib/listen/adapter/darwin.rb#L49
@Fryguy Fryguy force-pushed the fix_ansible_runner_async_sporadic branch from f09580b to 5ae61a1 Compare October 11, 2024 21:36
@miq-bot
Copy link
Member

miq-bot commented Oct 11, 2024

Checked commit Fryguy@5ae61a1 with ruby 3.1.5, rubocop 1.56.3, haml-lint 0.51.0, and yamllint
2 files checked, 0 offenses detected
Everything looks fine. 🏆

@agrare
Copy link
Member

agrare commented Oct 12, 2024

Oh yeah if Listener#start returns before the watch is started that is a bug in the gem IMO, that kind of defeats the purpose.
It sounds like they need to do something similar to our Concurrent::Event .set / .wait patter that we do in the EventCatcher::Runner.

@agrare
Copy link
Member

agrare commented Oct 12, 2024

Okay I've run this 10 times with the full suite in parallel (48 threads) on linux with no failures so pretty confident this will work

Copy link
Member

@agrare agrare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So before just removing the thread I didn't want to understand why it was there originally.
In the first implementation of this we used INotify::Notifier and notifier.run was a blocking call, so when we moved to Listener we could have dropped the thread at that point.
LGTM

@agrare agrare merged commit 8f31201 into ManageIQ:master Oct 12, 2024
12 checks passed
@Fryguy Fryguy deleted the fix_ansible_runner_async_sporadic branch October 14, 2024 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants