-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some clarifications about data persistence [HZG-257] #1510
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for hardcore-allen-f5257d ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working to improve our documentation on this topic @th0masb - a few comments from me on this round
- **Speeding up single member restarts**: | ||
|
||
** **Planned**: During a rolling restart each cluster member is restarted one by one for scenarios such as installing an operating system patch or new hardware. xref:maintain-cluster:rolling-upgrades.adoc[Rolling upgrades] are an example of a rolling restart. | ||
** **Unplanned**: A member may crash or terminate unexpectedly at any time, using persistence allows faster recovery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also depends on configuration - if rebalancing is delayed, or the cluster is in PASSIVE
state, it loads all data from disk without receiving any from other members
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that even if the member crashed when the cluster was in active state and rebalancing was not delayed the persisted data could still be used to speed up the member rejoining the cluster. Aren't merkle trees compared to reduce network traffic as much as possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's correct, I'm saying that in addition to this "faster recovery" case, there are conditions where it recovers entirely using its own data, not just using it as a speedup mechanism.
|
||
- **Speeding up single member restarts**: | ||
|
||
** **Planned**: During a rolling restart each cluster member is restarted one by one for scenarios such as installing an operating system patch or new hardware. xref:maintain-cluster:rolling-upgrades.adoc[Rolling upgrades] are an example of a rolling restart. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of a cluster-wide shutdown, we do not want to create confusion between these topics imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it confusing to read the existing way, how is a rolling restart a cluster wide shutdown if only one member at any time is stopped? Is it mandatory to put the cluster into passive mode for a rolling restart? It seems like it would work fine if the cluster was still in active mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose there's arguments to be made both ways - I can see why you see it as a series of single member restarts, but I can also see that it does eventually result in the entire cluster restarting, so it could be seen as a cluster-wide shutdown (restart). I'm not too fussed, we can let docs team review this aspect and see if it makes sense to them 👍
@th0masb Looks like you're still debating the technical details here, so ignoring for now. Feel free to @ me when this is ready for editorial review, and please add the backport to all versions label if appropriate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments from me, but mainly nits so approving in advance - thanks for raising this @th0masb and improving our docs 👍
@Rob-Hazelcast I think this is ready now, as for backporting I am unsure of the policy for the docs but since this is a clarification on existing functionality and the page content looks to be the same since 5.3 I would recommend backporting to 5.3. |
Some clarifications that data persistence is there to augment in-memory partition backups and that planned cluster shutdowns must be done cluster-wide. A rolling restart is not the same as a cluster shutdown, it is a sequence of individual member restarts and so is more appropriate to categorize as a speed improvement instead of a resiliency one.