Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wakuv2: invalid messages not stored in Waku Archive #3004

Open
3 tasks
jm-clius opened this issue Dec 6, 2022 · 3 comments
Open
3 tasks

wakuv2: invalid messages not stored in Waku Archive #3004

jm-clius opened this issue Dec 6, 2022 · 3 comments

Comments

@jm-clius
Copy link

jm-clius commented Dec 6, 2022

Problem

The Waku Archive (backend for the historical message store) reports that some messages are not stored due to invalid timestamps. There are two types of errors reported:

1. Invalid messages: timestamp too old

These are messages that have a timestamp older than 20 seconds at the time of inserting the message into the archive. These typically occur in occasional spikes, often correlating with periods of high network activity.

An example of this type of Waku Archive Error over 24 hours:

image

2. Invalid messages: future timestamp

These are messages that have a timestamp more than 20 seconds in the future from the perspective of the store/archive node. These occur more regularly than the "timestamp too old" errors, but with less pronounced spikes and a lower total number of errors. These also seem less correlated to network activity rates.

An example of this type of Waku Archive Error over 24 hours:

image

Impact

Medium (?) - depends on priority of fault tolerance in the store/archive mechanism.

These messages won't be stored in the archive instance that reports the error. Although these messages are likely stored in some of the other archive instances (depending on the cause), there is no synchronisation/fault tolerance between the difference instances and the message will not be retrievable from the specific archive where the error was reported.

I estimate that losses are in the same order as those incurred due to a node restart (these are currently scheduled for every 6 hours).
It is unclear if certain types of messages are more affected than others. This error is reported indiscriminately, but invalid timestamps may be more likely to occur in some messages.

Possible causes

Although not properly investigated yet, we have some ideas about possible causes:

1. Invalid messages: timestamp too old

  • SQLITE bottleneck: These could occur due to the single-threaded SQLite read/write Archive implementation. Even after some optimisations some queries still take up to a couple of seconds to complete. Furthermore, the store query rate is very high at high network activity periods (>500 per minute per archive instance). It is possible that SQLite blocks long enough on reading to respond to queries that inserts wait beyond the 20 second validity period and the message is discarded.

2. Invalid messages: timestamp in future

  • NTP drift: Perhaps some clients are not synced with NTP and have drifted into the future?
  • Invalidly timestamped messages: It is possible that (perhaps only a subset/certain type of) messages are timestamped incorrectly due to a bug.

Possible solutions

Mitigation

I can think of the following. There may be more, depending on what causes are uncovered in an investigation:

  • nwaku: increase tolerance for timestamp validity. Although possible in the short term, this is not a scalable future solution and will leave the archive more vulnerable to attacks.
  • Desktop: perhaps a warning when clients are not synced to NTP? Not sure if any checks are already in place.
  • nwaku/Desktop: further query optimisations - perhaps the number and complexity of queries can still be reduced?

Fault-tolerant store/PostgreSQL

The more appropriate solution (which is required to deal with restarts as well) is to use a mechanism to allow for some individual archive instance failures without affecting overall history integrity. Since we're running into SQLite's limitations, the already roadmapped PostgreSQL implementation, with built-in redundancy and parallelisation, becomes more urgent.

This will not address issues arising from timestamp bugs/NTP drift (if such exist).

Some more questions

  • The impact/urgency is unclear: I think the PostgreSQL implementation should be part of the next Status Mobile Milestone for 10 Feb 2023. Are any of the mitigating actions necessary before then?
  • Are there any existing time verification mechanisms in the client(s), such as verifying NTP sync?
  • Are there any scripts/tests running on status.prod that would artificially inflate the number of queries? Is the current query rate expected?
@jm-clius
Copy link
Author

jm-clius commented Dec 6, 2022

Message from @cammellos:

image

@fryorcraken
Copy link

Message from @cammellos:

image

Great to see. Let's be sure we track once the fix is released and clients update so we can review the charts.

@jm-clius
Copy link
Author

jm-clius commented Dec 8, 2022

image

Other than the occasional spikes, there's also been a constant source of invalid messages (too old) since 6.30 UTC this morning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants