You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Waku Archive (backend for the historical message store) reports that some messages are not stored due to invalid timestamps. There are two types of errors reported:
1. Invalid messages: timestamp too old
These are messages that have a timestamp older than 20 seconds at the time of inserting the message into the archive. These typically occur in occasional spikes, often correlating with periods of high network activity.
An example of this type of Waku Archive Error over 24 hours:
2. Invalid messages: future timestamp
These are messages that have a timestamp more than 20 seconds in the future from the perspective of the store/archive node. These occur more regularly than the "timestamp too old" errors, but with less pronounced spikes and a lower total number of errors. These also seem less correlated to network activity rates.
An example of this type of Waku Archive Error over 24 hours:
Impact
Medium (?) - depends on priority of fault tolerance in the store/archive mechanism.
These messages won't be stored in the archive instance that reports the error. Although these messages are likely stored in some of the other archive instances (depending on the cause), there is no synchronisation/fault tolerance between the difference instances and the message will not be retrievable from the specific archive where the error was reported.
I estimate that losses are in the same order as those incurred due to a node restart (these are currently scheduled for every 6 hours).
It is unclear if certain types of messages are more affected than others. This error is reported indiscriminately, but invalid timestamps may be more likely to occur in some messages.
Possible causes
Although not properly investigated yet, we have some ideas about possible causes:
1. Invalid messages: timestamp too old
SQLITE bottleneck: These could occur due to the single-threaded SQLite read/write Archive implementation. Even after some optimisations some queries still take up to a couple of seconds to complete. Furthermore, the store query rate is very high at high network activity periods (>500 per minute per archive instance). It is possible that SQLite blocks long enough on reading to respond to queries that inserts wait beyond the 20 second validity period and the message is discarded.
2. Invalid messages: timestamp in future
NTP drift: Perhaps some clients are not synced with NTP and have drifted into the future?
Invalidly timestamped messages: It is possible that (perhaps only a subset/certain type of) messages are timestamped incorrectly due to a bug.
Possible solutions
Mitigation
I can think of the following. There may be more, depending on what causes are uncovered in an investigation:
nwaku: increase tolerance for timestamp validity. Although possible in the short term, this is not a scalable future solution and will leave the archive more vulnerable to attacks.
Desktop: perhaps a warning when clients are not synced to NTP? Not sure if any checks are already in place.
nwaku/Desktop: further query optimisations - perhaps the number and complexity of queries can still be reduced?
Fault-tolerant store/PostgreSQL
The more appropriate solution (which is required to deal with restarts as well) is to use a mechanism to allow for some individual archive instance failures without affecting overall history integrity. Since we're running into SQLite's limitations, the already roadmapped PostgreSQL implementation, with built-in redundancy and parallelisation, becomes more urgent.
This will not address issues arising from timestamp bugs/NTP drift (if such exist).
Some more questions
The impact/urgency is unclear: I think the PostgreSQL implementation should be part of the next Status Mobile Milestone for 10 Feb 2023. Are any of the mitigating actions necessary before then?
Are there any existing time verification mechanisms in the client(s), such as verifying NTP sync?
Are there any scripts/tests running on status.prod that would artificially inflate the number of queries? Is the current query rate expected?
The text was updated successfully, but these errors were encountered:
Problem
The Waku Archive (backend for the historical message
store
) reports that some messages are not stored due to invalid timestamps. There are two types of errors reported:1. Invalid messages: timestamp too old
These are messages that have a timestamp older than 20 seconds at the time of inserting the message into the archive. These typically occur in occasional spikes, often correlating with periods of high network activity.
An example of this type of Waku Archive Error over 24 hours:
2. Invalid messages: future timestamp
These are messages that have a timestamp more than 20 seconds in the future from the perspective of the store/archive node. These occur more regularly than the "timestamp too old" errors, but with less pronounced spikes and a lower total number of errors. These also seem less correlated to network activity rates.
An example of this type of Waku Archive Error over 24 hours:
Impact
Medium (?) - depends on priority of fault tolerance in the store/archive mechanism.
These messages won't be stored in the archive instance that reports the error. Although these messages are likely stored in some of the other archive instances (depending on the cause), there is no synchronisation/fault tolerance between the difference instances and the message will not be retrievable from the specific archive where the error was reported.
I estimate that losses are in the same order as those incurred due to a node restart (these are currently scheduled for every 6 hours).
It is unclear if certain types of messages are more affected than others. This error is reported indiscriminately, but invalid timestamps may be more likely to occur in some messages.
Possible causes
Although not properly investigated yet, we have some ideas about possible causes:
1. Invalid messages: timestamp too old
2. Invalid messages: timestamp in future
Possible solutions
Mitigation
I can think of the following. There may be more, depending on what causes are uncovered in an investigation:
Fault-tolerant store/PostgreSQL
The more appropriate solution (which is required to deal with restarts as well) is to use a mechanism to allow for some individual archive instance failures without affecting overall history integrity. Since we're running into SQLite's limitations, the already roadmapped PostgreSQL implementation, with built-in redundancy and parallelisation, becomes more urgent.
This will not address issues arising from timestamp bugs/NTP drift (if such exist).
Some more questions
status.prod
that would artificially inflate the number of queries? Is the current query rate expected?The text was updated successfully, but these errors were encountered: