Skip to content
This repository has been archived by the owner on Feb 15, 2022. It is now read-only.

Investigate ServiceMap Prepper resilience #514

Open
wrijeff opened this issue Apr 14, 2021 · 0 comments
Open

Investigate ServiceMap Prepper resilience #514

wrijeff opened this issue Apr 14, 2021 · 0 comments
Labels
maintenance Chores that need to be done

Comments

@wrijeff
Copy link
Contributor

wrijeff commented Apr 14, 2021

Related to #479 and opendistro-for-elasticsearch/trace-analytics#32.

Typically service map records are emitted in pairs: destination (client) and target (server). An ES cluster got into a bad state where only half of the pair was received, which caused the front-end JS code to error. We're asking the front-end to add null checks, however we need to check if there's anything we can do our end to improve resilience. First thoughts might be to:

  • Confirm that failed writes are being retried by the ES sink
  • Potentially have the ServiceMap prepper re-send records that might have already been sent.

For the ServiceMap changes, the current logic is:

  1. After a set interval, find relationships between nodes in memory
  2. Before sending the relationship record to the ES sink, first check if that record has already been sent previously
    • This is to prevent "duplicate" records from being sent every few minutes

It might make sense to just remove the duplicate record check and just continuously send service map records. Yes we'll be increasing ES writes, but if can assume that only a few hundred records will be sent every 3 minutes then that seems to be a decent fallback to "fill-in" missing service map gaps.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
maintenance Chores that need to be done
Projects
None yet
Development

No branches or pull requests

1 participant