Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Speed up bootstrap by maintaining a "delta set" table #145

Open
ieQu1 opened this issue Jun 21, 2023 · 1 comment
Open

[RFC] Speed up bootstrap by maintaining a "delta set" table #145

ieQu1 opened this issue Jun 21, 2023 · 1 comment
Labels

Comments

@ieQu1
Copy link
Member

ieQu1 commented Jun 21, 2023

Currently replicants have to copy the entire contents of the tables when they reconnect, even after a short while. With large enough volume of data it may hinder cluster recovery after disaster or maintenance. Moreover, it makes it almost impossible to rebalance the load on the core nodes.

Initially we tried to solve this problem by persisting the transaction log, so the replicants recovering after a reconnect could replay it instead of going through the entire bootstrap procedure.

That approach proved to hurt performance too much to be practical. In addition, flapping client connections can often generate delete -> add -> delete -> ... loops in the transaction logs, making them larger than the table itself, making the whole idea of replaying transaction log questionable.

Below I describe an alternative approach. Instead of trying to avoid bootstrap, we could speed it up.

  1. For each shard we create so called "delta set", that consists of N set-like tables (plain ets or plain rocksdb) that store the following records: {{Table, Key}, X} where X is a value of a counter.
  2. Every M seconds we rotate the tables in the delta set in a ring buffer fashion, contents of the oldest table are dropped. We also increase counter X by 1.
  3. mria_rlog_server process, as it processes intercepted transactions, writes each affected key to the table with the current delta set table. The existing keys are simply overwritten.
  4. When a replicant connects to the core, it tells it its current logical timestamp in the hello message
  5. The core checks if the timestamp is covered by its delta set.
  6. If it is, instead of doing the normal bootstrap server loop (https://github.com/emqx/mria/blob/main/src/mria_bootstrapper.erl#L192) it loops over the delta set keys, and it doesn't send clear_table command (https://github.com/emqx/mria/blob/main/src/mria_bootstrapper.erl#L240), so the replicant preserves its local data.
  7. If the replicant's data is too old, bootstrap server does the normal loop.

Pitfalls:

  1. Clock skews. Possible solution: make sure to replay much older keys than the replicant's timestamp.
  2. Care should be taken to avoid "skipping" over a part of delta set, as tables get rotated while the bootstrap is running. This could be prevented, perhaps, by making sure X counter doesn't change during bootstrap, and only increments by 1 while jumping to the next table.
  3. When the core node itself restarts, or rlog_server process restarts, it can skip certain keys. In this situation replaying the delta set would lead to inconsistent results, so the best course of action would be to simply drop it. It means any replicant node that connects to a freshly restarted core node has to bootstrap from scratch.
@thalesmg
Copy link
Contributor

thalesmg commented Sep 6, 2023

I assume one needed step is to periodically update the replicants. Maybe, each time the delta set tables rotate, the cores would broadcast the new logical timestamp to the replicants?

Also, I guess clock skews would come in play here in the sense that, although the timestamps would be just counters, each core node could diverge on which is the current timestamp. That is, each core might be at a different "time". So replicants might need to keep track of current timestamps per core rather than just per table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants