Telemetry core: "Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217030)" #501

jsdw · 2022-09-30T11:01:40Z

At some point recently, telemetry.polkadot.io went downwith lots of errors like:

2022-09-30 10:33:26,536 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174701)
2022-09-30 10:33:26,538 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(1)/ShardNodeId(217267)
2022-09-30 10:33:26,905 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(1)/ShardNodeId(217346)
2022-09-30 10:33:27,001 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174702)
2022-09-30 10:33:27,001 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174702)
2022-09-30 10:33:27,070 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(2)/ShardNodeId(217363)
2022-09-30 10:33:27,070 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(2)/ShardNodeId(217363)
2022-09-30 10:33:27,202 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(2)/ShardNodeId(217364)
2022-09-30 10:33:27,204 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(2)/ShardNodeId(217364)
2022-09-30 10:33:27,834 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174703)
2022-09-30 10:33:27,834 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174703)
2022-09-30 10:33:28,577 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174704)
2022-09-30 10:33:28,577 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(5)/ShardNodeId(174704)
2022-09-30 10:33:28,680 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217030)
2022-09-30 10:33:29,421 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(216564)
2022-09-30 10:33:29,458 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217031)
2022-09-30 10:33:29,458 ERROR [telemetry_core::aggregator::inner_loop] Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217031)
^C

Restarting the telemetry-core pod didn't help.
Restarting the shards make things work again.

These errors imply that shards were sending information abotu nodes that the core knew nothing about.

Is there a chance that the core was restarted at some point (perhaps due to being out of memory or whatnot) and the shards didn't properly handle this and send new node information?

Alternately, is it possible that the connection between core and shards faultered and the core didn't properly clean up its internal state when this happened? (Right offhand I can't see anything that would drop all of the nodes in the core when a shard connection was lost).

The latter is also something that's a little harder to test locally (we'll have tested restarting shards and core plenty). Perhaps #497 also arose as a result of some conneciton issue like this that led to duplicates not being cleaned up?

jsdw · 2022-10-05T11:38:13Z

This might be resolved by #504

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telemetry core: "Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217030)" #501

Telemetry core: "Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217030)" #501

jsdw commented Sep 30, 2022

jsdw commented Oct 5, 2022

Telemetry core: "Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217030)" #501

Telemetry core: "Cannot find ID for node with shard/connectionId of ConnId(3)/ShardNodeId(217030)" #501

Comments

jsdw commented Sep 30, 2022

jsdw commented Oct 5, 2022