wakuv2: invalid messages not stored in Waku Archive #3004

jm-clius · 2022-12-06T10:45:36Z

Problem

The Waku Archive (backend for the historical message store) reports that some messages are not stored due to invalid timestamps. There are two types of errors reported:

1. Invalid messages: timestamp too old

These are messages that have a timestamp older than 20 seconds at the time of inserting the message into the archive. These typically occur in occasional spikes, often correlating with periods of high network activity.

An example of this type of Waku Archive Error over 24 hours:

2. Invalid messages: future timestamp

These are messages that have a timestamp more than 20 seconds in the future from the perspective of the store/archive node. These occur more regularly than the "timestamp too old" errors, but with less pronounced spikes and a lower total number of errors. These also seem less correlated to network activity rates.

An example of this type of Waku Archive Error over 24 hours:

Impact

Medium (?) - depends on priority of fault tolerance in the store/archive mechanism.

These messages won't be stored in the archive instance that reports the error. Although these messages are likely stored in some of the other archive instances (depending on the cause), there is no synchronisation/fault tolerance between the difference instances and the message will not be retrievable from the specific archive where the error was reported.

I estimate that losses are in the same order as those incurred due to a node restart (these are currently scheduled for every 6 hours).
It is unclear if certain types of messages are more affected than others. This error is reported indiscriminately, but invalid timestamps may be more likely to occur in some messages.

Possible causes

Although not properly investigated yet, we have some ideas about possible causes:

1. Invalid messages: timestamp too old

SQLITE bottleneck: These could occur due to the single-threaded SQLite read/write Archive implementation. Even after some optimisations some queries still take up to a couple of seconds to complete. Furthermore, the store query rate is very high at high network activity periods (>500 per minute per archive instance). It is possible that SQLite blocks long enough on reading to respond to queries that inserts wait beyond the 20 second validity period and the message is discarded.

2. Invalid messages: timestamp in future

NTP drift: Perhaps some clients are not synced with NTP and have drifted into the future?
Invalidly timestamped messages: It is possible that (perhaps only a subset/certain type of) messages are timestamped incorrectly due to a bug.

Possible solutions

Mitigation

I can think of the following. There may be more, depending on what causes are uncovered in an investigation:

nwaku: increase tolerance for timestamp validity. Although possible in the short term, this is not a scalable future solution and will leave the archive more vulnerable to attacks.
Desktop: perhaps a warning when clients are not synced to NTP? Not sure if any checks are already in place.
nwaku/Desktop: further query optimisations - perhaps the number and complexity of queries can still be reduced?

Fault-tolerant store/PostgreSQL

The more appropriate solution (which is required to deal with restarts as well) is to use a mechanism to allow for some individual archive instance failures without affecting overall history integrity. Since we're running into SQLite's limitations, the already roadmapped PostgreSQL implementation, with built-in redundancy and parallelisation, becomes more urgent.

This will not address issues arising from timestamp bugs/NTP drift (if such exist).

Some more questions

The impact/urgency is unclear: I think the PostgreSQL implementation should be part of the next Status Mobile Milestone for 10 Feb 2023. Are any of the mitigating actions necessary before then?
Are there any existing time verification mechanisms in the client(s), such as verifying NTP sync?
Are there any scripts/tests running on status.prod that would artificially inflate the number of queries? Is the current query rate expected?

The text was updated successfully, but these errors were encountered:

jm-clius · 2022-12-06T16:15:00Z

Message from @cammellos:

fryorcraken · 2022-12-07T02:52:15Z

Message from @cammellos:

Great to see. Let's be sure we track once the fix is released and clients update so we can review the charts.

jm-clius · 2022-12-08T09:48:39Z

Other than the occasional spikes, there's also been a constant source of invalid messages (too old) since 6.30 UTC this morning.

jm-clius mentioned this issue Dec 6, 2022

Status MVP: Status Core Contributors use Status Mobile waku-org/pm#8

Closed

20 tasks

jm-clius mentioned this issue Dec 7, 2022

Use NTP sync source when available #3005

Merged

richard-ramos mentioned this issue Dec 8, 2022

chore: bump status-go status-im/status-desktop#8695

Merged

cammellos added the E:Chat Reliability MVP label Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wakuv2: invalid messages not stored in Waku Archive #3004

wakuv2: invalid messages not stored in Waku Archive #3004

jm-clius commented Dec 6, 2022 •

edited

Loading

jm-clius commented Dec 6, 2022

fryorcraken commented Dec 7, 2022

jm-clius commented Dec 8, 2022

wakuv2: invalid messages not stored in Waku Archive #3004

wakuv2: invalid messages not stored in Waku Archive #3004

Comments

jm-clius commented Dec 6, 2022 • edited Loading

Problem

1. Invalid messages: timestamp too old

2. Invalid messages: future timestamp

Impact

Possible causes

1. Invalid messages: timestamp too old

2. Invalid messages: timestamp in future

Possible solutions

Mitigation

Fault-tolerant store/PostgreSQL

Some more questions

jm-clius commented Dec 6, 2022

fryorcraken commented Dec 7, 2022

jm-clius commented Dec 8, 2022

jm-clius commented Dec 6, 2022 •

edited

Loading