Investigate ServiceMap Prepper resilience #514

wrijeff · 2021-04-14T19:21:11Z

Related to #479 and opendistro-for-elasticsearch/trace-analytics#32.

Typically service map records are emitted in pairs: destination (client) and target (server). An ES cluster got into a bad state where only half of the pair was received, which caused the front-end JS code to error. We're asking the front-end to add null checks, however we need to check if there's anything we can do our end to improve resilience. First thoughts might be to:

Confirm that failed writes are being retried by the ES sink
Potentially have the ServiceMap prepper re-send records that might have already been sent.

For the ServiceMap changes, the current logic is:

After a set interval, find relationships between nodes in memory
Before sending the relationship record to the ES sink, first check if that record has already been sent previously
- This is to prevent "duplicate" records from being sent every few minutes

It might make sense to just remove the duplicate record check and just continuously send service map records. Yes we'll be increasing ES writes, but if can assume that only a few hundred records will be sent every 3 minutes then that seems to be a decent fallback to "fill-in" missing service map gaps.

wrijeff added the maintenance Chores that need to be done label Apr 14, 2021

wrijeff mentioned this issue Apr 14, 2021

Data prepper fails when sending traces from different EKS (multiple data prepper singletons) to ES #479

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate ServiceMap Prepper resilience #514

Investigate ServiceMap Prepper resilience #514

wrijeff commented Apr 14, 2021

Investigate ServiceMap Prepper resilience #514

Investigate ServiceMap Prepper resilience #514

Comments

wrijeff commented Apr 14, 2021