Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in clone pattern (?) #818

Open
mhier opened this issue Jun 10, 2020 · 2 comments
Open

Race condition in clone pattern (?) #818

mhier opened this issue Jun 10, 2020 · 2 comments

Comments

@mhier
Copy link

mhier commented Jun 10, 2020

The clone pattern describes how to get an out-of-band snapshot for a late subscriber. For this it uses a second, independent socket (different port number). Since TCP does not guarantee the order of packets across sockets, there is no (obvious) guarantee that the subscription is made before the snapshot is taken. If indeed the snapshot is taken before the subscription is ready, there is again the chance of missing an update.

I have implemented a similar pattern, but I have replaced the out-of-band snapshot communication with a different (non-ZeroMQ) protocol. When writing a test for this, I discovered that the race condition is actually quite likely. The test is sending an update through the PUB-SUB channel immediately after the snapshot has been read. This update is lost with about 50% probability in my setup.

The title of the example "Reliable Pub-Sub (Clone Pattern)" suggests that such race conditions would not take place. If indeed there is a true guarantee somehow, which only works if also the out-of-band communication is done via ZeroMQ, this should IMO be mentioned in the description. If the guarantee cannot be given, I would recommend to mention this as well.

There may be situations where such loss of information is acceptable, but in other situations it is not. A simple example for this would be a rarely and irregularly changing value, which was by chance changed exactly in the moment the late subscriber joins. Maybe the next update is done a week later. The late subscriber then sees for one week an outdated value. It would be at least necessary to know about such potential issue, so one can think of a work around (can someone point me to the best option here, please?).

@sappo
Copy link
Member

sappo commented Jun 15, 2020

Hi @mhier,
you are making very valid points here. It would be awesome if you could supply some code that shows the race condition(s).

Currently we're working on a different solution for the late subscriber which heavily copies from kafka. The protocol and reference implementation is called dafka (https://github.com/zeromq/dafka). The protocol is almost finished but the user API is missing some finishing touches.

@mhier
Copy link
Author

mhier commented Jun 15, 2020

Thanks for confirming my assumptions :-) I am relatively new to ZeroMQ and hence could overlook something easily.

I have unfortunately no pure ZeroMQ code to reproduce the race condition (only code which uses our proprietary protocol for the out-of-band communication, but it has heavy depndencies). I can cook something up later, unfortunately I have some time pressure right now so please be a bit patient (could be a few weeks...).

What I can offer you right now is proof-of-concept code that does not have the race condition:

https://pastebin.com/6UHEhqVB

It uses an XPUB server, which sends out an inband snapshot as soon as it detects a new subscription. The code needs refining, though, since all clients receive the snapshot, not just the one which made the subscription. A possible implementation would be to use a special prefix for the snapshot which is subscribed after doing the main subscription and unsubscribed again after the snapshot has been received.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants