Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rare SendErrors in Certain Zero #188

Open
NicEastvillage opened this issue Jun 5, 2023 · 3 comments
Open

Rare SendErrors in Certain Zero #188

NicEastvillage opened this issue Jun 5, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@NicEastvillage
Copy link
Contributor

SendErrors happen when one thread tries to send a message to another thread, and the other thread does not attempt to receive it. In #183 most SendErrors where resolved by

  • making the master thread receive the final assignments even if not interested
  • making workers stop immediately after receiving a terminate message

However, rare SendErrors still happens. I believe this occurs when worker A sends a message to worker B, and worker B has received the terminate message and shut down, while worker A has just started a task requiring communication with B and missed the terminate message.

In order to handle this, we might need another two passes of the token. One pass, where once a worker receives the token, it stops sending messages to other workers, and a second pass where it receives incoming messages and then shuts down.

@NicEastvillage NicEastvillage added the bug Something isn't working label Jun 5, 2023
@falkecarlsen
Copy link
Member

Futhermore, we do not have a resend method which is what I gather generates this error when running many concurrent benches:
thread 'main' panicked at 'failed to spawn thread: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }', /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/thread/mod.rs:717:29

The full stderr trace on mexi3p3hp: https://gist.github.com/falkecarlsen/2aaee32933b2174dc3863f4982868edf

@NicEastvillage
Copy link
Contributor Author

NicEastvillage commented Jun 6, 2023

That looks unrelated. I think you are simply spawning too many threads too fast and ran out of some OS resource (no idea which), and that means some workers will fail to send, because they cannot send to the missing workers.

@falkecarlsen
Copy link
Member

Yes, perhaps unrelated but suspiciously showed up after incorporating changes proposed by #183 and #188.
It might be a node problem otherwise it smells like a bug in how we use threads, e.g. an async reader which never gets a message might be infinitely spawned under some conditions.
We'll see, I'll investigate further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants