Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry core: "Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217030)" #501

Open
jsdw opened this issue Sep 30, 2022 · 1 comment

Comments

@jsdw
Copy link
Collaborator

jsdw commented Sep 30, 2022

At some point recently, telemetry.polkadot.io went downwith lots of errors like:

2022-09-30 10:33:26,536 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174701)
2022-09-30 10:33:26,538 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(1)/ShardNodeId(217267)
2022-09-30 10:33:26,905 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(1)/ShardNodeId(217346)
2022-09-30 10:33:27,001 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174702)
2022-09-30 10:33:27,001 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174702)
2022-09-30 10:33:27,070 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(2)/ShardNodeId(217363)
2022-09-30 10:33:27,070 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(2)/ShardNodeId(217363)
2022-09-30 10:33:27,202 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(2)/ShardNodeId(217364)
2022-09-30 10:33:27,204 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(2)/ShardNodeId(217364)
2022-09-30 10:33:27,834 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174703)
2022-09-30 10:33:27,834 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174703)
2022-09-30 10:33:28,577 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174704)
2022-09-30 10:33:28,577 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174704)
2022-09-30 10:33:28,680 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217030)
2022-09-30 10:33:29,421 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(216564)
2022-09-30 10:33:29,458 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217031)
2022-09-30 10:33:29,458 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217031)
^C

Restarting the telemetry-core pod didn't help.
Restarting the shards make things work again.

These errors imply that shards were sending information abotu nodes that the core knew nothing about.

Is there a chance that the core was restarted at some point (perhaps due to being out of memory or whatnot) and the shards didn't properly handle this and send new node information?

Alternately, is it possible that the connection between core and shards faultered and the core didn't properly clean up its internal state when this happened? (Right offhand I can't see anything that would drop all of the nodes in the core when a shard connection was lost).

The latter is also something that's a little harder to test locally (we'll have tested restarting shards and core plenty). Perhaps #497 also arose as a result of some conneciton issue like this that led to duplicates not being cleaned up?

@jsdw
Copy link
Collaborator Author

jsdw commented Oct 5, 2022

This might be resolved by #504

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant