Skip to content
This repository has been archived by the owner on Nov 25, 2024. It is now read-only.
Kegsay edited this page Jun 1, 2020 · 4 revisions

A whole lot about holes in Federation

Following on from the state resolution v2 blog post we did, the Dendrite team is here to try to explain another mystical aspect of Federation: holes.

Background context

Events in a room form a Directed Acyclic Graph (DAG), where each event "points" to the previous event before it.

 A  ----- B  ----- C
 []      [A]      [B]  prev_events

Events can also branch out and form forks, which can then later be merged:

              .--- D --.
              |   [B]  |
              |        |
 A  ----- B  -+--- C --+--- E
 []      [A]      [B]      [C,D] prev_events

When a server receives an event, it probably has the prev_events and therefore it can perform all the calculations it needs to. However, if it doesn't have them it can request them via /get_missing_events:

                A ---- B ---- C ---- D ---- E
Server 1 has?   ✓      ✓                    ✓
Server 2 has?   ✓      ✓      ✓      ✓      ✓

Server 1 has A,B and then receives E. It needs to request C,D:

POST /_matrix/federation/v1/get_missing_events/%21SomeRoom%3Amatrix.org HTTP/1.1
Content-Type: application/json

{
  "limit": 10,
  "min_depth": 0,
  "earliest_events": [
    "$event_B:example.org"
  ],
  "latest_events": [
    "$event_E:example.org"
  ]
}

NB: Be aware of the misleading key names. /get_missing_events does not return events "between" the two events given, but instead walks the DAG, backwards from latest_events and ignoring earliest_events. This distinction is important for DAGs like this:

          .----- D ---- E ---- F   Server 2
A --- B --|
          `----- C                 Server 1

Server 1 has A,B,C and then receives F
Server 2 has A,B,D,E,F

Server 1 needs to request D,E

POST /_matrix/federation/v1/get_missing_events/%21SomeRoom%3Amatrix.org HTTP/1.1
Content-Type: application/json

{
  "limit": 10,
  "min_depth": 0,
  "earliest_events": [
    "$event_C:example.org"
  ],
  "latest_events": [
    "$event_F:example.org"
  ]
}

If /get_missing_events returned events "between" the two events, it would not return D,E as they are not between C,F in the DAG.

This API works well when the server is missing a few events, but sometimes the server is missing a lot of events. When this happens, the server may decide to give up trying to call /get_missing_events over and over and instead a hole is born.

What's a hole and why should I care?

A hole refers to the "hole" in the DAG that is formed when a server has missed many events from Federation and decides to give up backfilling via /get_missing_events e.g. the server has been offline for a long time.

These DAGs can have many different shapes: (Server previously had A,B,C and wants to now store X,Y,Z) - we'll refer to these events throughout.

 A         A            A
 |        / \           |
 B       B   D          B
 |       |   |          |
 C       C   E          C
 |           |         / \ 
...         ...      ... ...
 |           |         \ /
 X           X          X
 |           |          |
 Y           Y          Y
 |           |          |
 Z           Z          Z
Linear  Early fork    Fork

Servers need to care about holes so they can perform auth checks and verify the room state for all the messages it wants to persist. This means it needs to know the state of the room before X to work out if X is allowed, and so on for Y and Z. Clients need to care about holes so they can show the gap on their UI and backfill via /messages requests.

How servers should handle holes

Servers need to perform auth checks to work out if the event being processed is "allowed" according to the rules of the room. These checks are done on a snapshot of the room state at a particular point in time: just before the event being processed is added to the DAG.

A bit on state terminology

A frequent gotcha is the distinction between before/after/at when referring to the "state of the room" for a particular event. The state "at" an event is an ambigous term as it could refer to before or after the event has been applied. Ideally, no code or specification should refer to the state "at" an event. The terms "before" and "after" are also somewhat suggestive and unfortunately the intuition is wrong: the state after B is not always the state before C.

Consider the following graphs, one linear and one fork:

Before A ___            ________ Before B
After A ____A     A   B ________ After B - no state from A involved, B applied.
Before B ___|      \ /__________ Before C - state resolution over A,B applied, C not involved.
After B ____B       C___________ After C - state resolution over A,B applied, C applied.
Before C ___|
After C ____C

For the linear DAG, the state after B is the same as the state before C. However, for the fork DAG, the state after B has no knowledge of A and hence isn't included. The state before C is the result of the state resolution algorithm being applied over the two branches A,B and hence will result in different state.

How to not work out the state before X

  • We want to persist X,Y,Z
  • You need to know the state before X.
  • Get the state before X (e.g via /state_ids)

This is intuitive and simple but unfortunately is wrong because it means that we are entirely trusting the room state in /state_ids. Consider the example:

TODO: explain an attack

We need to perform state resolution to resolve this. So instead, server implementations should do the following...

How to work out state before X

  • We want to persist X,Y,Z
  • You need to know the state before X. Steps:
  • Get the state AFTER X's prev_events. If the prev_event was W, this means getting /state_ids for W then applying W itself.
  • Apply your current known state after event C.
  • Apply state resolution over it.
  • You now have the state before X.
 A         A            A
 |        / \           |
 B       B   D          B
 |       |   |          |
 C--.    C   E          C----.
 |  |    |   |         / \   |
... |    |  ...      ... ... |
 |  |    |   |         \ /   |
 X--`     `--X          X----`
 |           |          |
 Y           Y          Y
 |           |          |
 Z           Z          Z

The state that gets mixed in to work out the state before X.
Woah there, what?!

Yep, we mix in what we previously knew about the room into the state resolution algorithm, and rely on the algorithm to come to a sensible conclusion. Effectively, we're pretending that C is a prev_event to X, even when it isn't! This fixes the attack whereby:

TODO: Explain how this fixes the attack

Rejected events

TODO

  • It can still be referenced in the DAG by other servers
  • But also, just because the event is rejected doesn't mean its useless
  • (eg the prev events might still be useful for ordering purposes, and for figuring out state of any events pointing at it)
Clone this wiki locally