[APM] Top dependencies request sometimes fails when searching outside of the boost window #178979

ablnk · 2024-03-19T16:13:30Z

Version:
Serverless project v 8.14.0

Description:
GET internal/apm/dependencies/top_dependencies request fails with status code 502 and returns backend closed connection when searching for top dependencies outside of the boost window.

Preconditions:
I reproduced the issue having 102 dependencies and 761 services.

Steps to reproduce:

Go to Applications - Dependencies.
Filter data by last 30 days.

Expected behavior:
Dependencies available within 30 days returned.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-03-19T16:13:33Z

Pinging @elastic/apm-ui (Team:APM)

elasticmachine · 2024-03-29T15:13:37Z

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

chrisdistasio · 2024-04-23T21:08:46Z

@smith can we have someone assigned to this to do some additional investigation to determine whether this is directly related to some issue searching outside boost window?

is it possible to quantify the number of services and dependencies that appear to be a threshold for causing the issue?

trying to get a better understanding of the severity as it relates to boost window.

in qa the dependencies themselves comeback pretty quickly, but the sparklines are slower to load, but eventually do. i acknowledge the number of services and dependencies is far fewer than what was tested--again, trying to determine where the threshold is.

crespocarlos · 2024-05-02T15:40:18Z

I've managed to reproduce the same problem in QA with ~100 dependencies

It could be that the changes that will be done as part of #178491 might solve this problem. I suspect that the histogram aggregation is slowing down the query.

@neptunian , the second option described in your comment, could be a more robust solution to prevent this from happening. As part of #178491 it could be worth it checking if the problem described in this ticket will be solved as well.

dgieselaar · 2024-05-02T17:23:39Z

@crespocarlos with regards to the bucketing problem, could be worth trying out ES|QL here - bucketing is much more relaxed there. Although it's probably easier to separate the date histo buckets from the single search request.

However, if the bucket limit is the issue, you'd get an error describing it as such. It won't take down an Elasticsearch node, at least not in ES. This might be a different issue. Do we know if APM data is enabled for this cluster and where it goes/how we can identify it (e.g. by some label)?

neptunian · 2024-05-02T18:46:18Z

Do we know if APM data is enabled for this cluster and where it goes/how we can identify it (e.g. by some label)?

I think this is necessary in determining what's causing the error. I'm not sure less buckets will solve it as querying large time ranges outside this "boost" window might still take too long, if that's the problem. I've asked in Slack channel about having APM data for the cluster.

dgieselaar · 2024-05-02T18:52:05Z

@neptunian found it, the issue is twofold:

the es request to get sample spans times out after 30s
the request is then retried 3 times, and eventually times out (it shouldn't, I have raised this with Core/Platform)

I have spoken to @crespocarlos about this, I would recommend to do a simple request to get the total amount of hits, and then based on that, calculate a sample rate that returns statistically significant results, and use the random_sampler agg if that sample rate is < 0.5. You will potentially lose the long tail of results but the alternative is a request that times out.

crespocarlos · 2024-05-03T11:47:47Z

@dgieselaar ~~how many hits could we consider as cutoff to use random_sampler agg?~~. Nvm, I understand now what you meant.

crespocarlos · 2024-05-03T12:04:47Z

An example of what Dario described above in get_log_categories. We can follow this same idea.

crespocarlos · 2024-05-27T08:32:41Z

@chrisdistasio, following up on @paulb-elastic comment. I was wondering if you have something in mind to help users understand eventual data loss due to the random-sampler aggregation usage.

I just want to highlight that the changes in #182828 might affect (depending on the amount of data + date range)

APM > Dependencies
APM > Dependencies > Overview (Upstream Services section) - less likely
APM > Services > Overview (Dependencies tab) - less likely

chrisdistasio · 2024-05-29T17:28:12Z

we have an analog for this someplace in services (IIRC). I'm trying to locate it in the UI. I would like to use consistent language if we can.

@timestamp

Fixes [#178979](#178979) ## Summary This PR changes the `get_exit_span_samples` query to use `random_sampler` aggregation, to limit the number of documents used in the query and avoid 502 errors reported seen in serverless. **IMPORTANT** ❗ The change impacts other places and may lead to a potential loss in the long tail of results ### UI The dependencies page will show a badge to inform users when the data is being sampled **default** https://github.com/elastic/kibana/assets/2767137/ea13031d-8ba1-48bb-a2e4-992eabfa90dd **sampled data** https://github.com/elastic/kibana/assets/2767137/6811c293-c2a1-42fd-bd38-b91e084e8d21 ### How to test The following can be tested on `https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud` Document count for a 30-day range: `594537153` ``` GET traces-apm*,apm*/_search { "size": 0, "track_total_hits": true, "query": { "bool": { "filter": [ { "terms": { "processor.event": [ "span" ] } } ], "must": [ { "bool": { "filter": [ { "exists": { "field": "span.destination.service.resource" } }, { "range": { "@timestamp": { "gte": 1712587750933, "lte": 1715179750933, "format": "epoch_millis" } } }, { "bool": { "must_not": [ { "terms": { "agent.name": [ "js-base", "rum-js", "opentelemetry/webjs" ] } } ] } } ] } } ] } } } ``` A sample rate is calculated based on the doc. count. eg: `100000/594537153 = 0,000168198067178` `0,000168198067178` is the probability sampling passed to the `random_sampler` aggregation. ``` GET traces-apm*,apm*/_search { "track_total_hits": false, "size": 0, "query": { "bool": { "filter": [ { "terms": { "processor.event": [ "span" ] } } ], "must": [ { "bool": { "filter": [ { "exists": { "field": "span.destination.service.resource" } }, { "range": { "@timestamp": { "gte": 1712587750933, "lte": 1715179750933, "format": "epoch_millis" } } }, { "bool": { "must_not": [ { "terms": { "agent.name": [ "js-base", "rum-js", "opentelemetry/webjs" ] } } ] } } ] } } ] } }, "aggs": { "sampling": { "random_sampler": { "probability": 0.000168198067178, "seed": 815613888 }, "aggs": { "connections": { "composite": { "size": 10000, "sources": [ { "dependencyName": { "terms": { "field": "span.destination.service.resource" } } }, { "eventOutcome": { "terms": { "field": "event.outcome" } } } ] }, "aggs": { "sample": { "top_metrics": { "size": 1, "metrics": [ { "field": "span.type" }, { "field": "span.subtype" }, { "field": "span.id" } ], "sort": [ { "@timestamp": "asc" } ] } } } } } } } } ``` - It's hard to create an environment with such a data volume. We can use the query above in `https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud/`, change the date ranges, and validate if the main query will work. ### Alternatively - Start Kibana pointing to an oblt cluster (non-serverless) - Navigate to APM > Dependencies - Try different time ranges ### For reviewers This change affects - APM > Dependencies - APM > Dependencies > Overview (Upstream Services section) - APM > Services > Overview (Dependencies tab) - Assistant's `get_apm_downstream_dependencies` function --------- Co-authored-by: Kibana Machine <[email protected]>

@timestamp

Fixes [elastic#178979](elastic#178979) ## Summary This PR changes the `get_exit_span_samples` query to use `random_sampler` aggregation, to limit the number of documents used in the query and avoid 502 errors reported seen in serverless. **IMPORTANT** ❗ The change impacts other places and may lead to a potential loss in the long tail of results ### UI The dependencies page will show a badge to inform users when the data is being sampled **default** https://github.com/elastic/kibana/assets/2767137/ea13031d-8ba1-48bb-a2e4-992eabfa90dd **sampled data** https://github.com/elastic/kibana/assets/2767137/6811c293-c2a1-42fd-bd38-b91e084e8d21 ### How to test The following can be tested on `https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud` Document count for a 30-day range: `594537153` ``` GET traces-apm*,apm*/_search { "size": 0, "track_total_hits": true, "query": { "bool": { "filter": [ { "terms": { "processor.event": [ "span" ] } } ], "must": [ { "bool": { "filter": [ { "exists": { "field": "span.destination.service.resource" } }, { "range": { "@timestamp": { "gte": 1712587750933, "lte": 1715179750933, "format": "epoch_millis" } } }, { "bool": { "must_not": [ { "terms": { "agent.name": [ "js-base", "rum-js", "opentelemetry/webjs" ] } } ] } } ] } } ] } } } ``` A sample rate is calculated based on the doc. count. eg: `100000/594537153 = 0,000168198067178` `0,000168198067178` is the probability sampling passed to the `random_sampler` aggregation. ``` GET traces-apm*,apm*/_search { "track_total_hits": false, "size": 0, "query": { "bool": { "filter": [ { "terms": { "processor.event": [ "span" ] } } ], "must": [ { "bool": { "filter": [ { "exists": { "field": "span.destination.service.resource" } }, { "range": { "@timestamp": { "gte": 1712587750933, "lte": 1715179750933, "format": "epoch_millis" } } }, { "bool": { "must_not": [ { "terms": { "agent.name": [ "js-base", "rum-js", "opentelemetry/webjs" ] } } ] } } ] } } ] } }, "aggs": { "sampling": { "random_sampler": { "probability": 0.000168198067178, "seed": 815613888 }, "aggs": { "connections": { "composite": { "size": 10000, "sources": [ { "dependencyName": { "terms": { "field": "span.destination.service.resource" } } }, { "eventOutcome": { "terms": { "field": "event.outcome" } } } ] }, "aggs": { "sample": { "top_metrics": { "size": 1, "metrics": [ { "field": "span.type" }, { "field": "span.subtype" }, { "field": "span.id" } ], "sort": [ { "@timestamp": "asc" } ] } } } } } } } } ``` - It's hard to create an environment with such a data volume. We can use the query above in `https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud/`, change the date ranges, and validate if the main query will work. ### Alternatively - Start Kibana pointing to an oblt cluster (non-serverless) - Navigate to APM > Dependencies - Try different time ranges ### For reviewers This change affects - APM > Dependencies - APM > Dependencies > Overview (Upstream Services section) - APM > Services > Overview (Dependencies tab) - Assistant's `get_apm_downstream_dependencies` function --------- Co-authored-by: Kibana Machine <[email protected]>

crespocarlos · 2024-06-11T09:55:10Z

I've tested the fix in QA with 30-day range:

crespocarlos · 2024-06-11T14:15:23Z

I'm reopening this because I'm seing intermittent circuit breaker errors. Perhaps the random sampler probability needs to be adjusted

crespocarlos · 2024-06-12T09:18:32Z

It seems like the errors are caused by a transform https://elastic.slack.com/archives/C05UT5PP1EF/p1718023213609569

ablnk · 2024-10-02T07:21:04Z

I'm reopening the issue because it is still reproducible, even within the boost window.

dgieselaar · 2024-10-03T15:19:41Z

@crespocarlos I think we should just use ES|QL, it's way faster here. Hold me honest, I think they're equivalent:

ES|QL request (2.5s)

POST _query?
{
"query": """
  FROM metrics-apm*
  | STATS MAX(agent.name), MAX(span.type), MAX(span.subtype), failure_rate = COUNT(CASE(event.outcome == "failure", 1, NULL)) / COUNT(CASE(event.outcome == "success" OR event.outcome == "failure", 1, NULL)) BY timestamp = BUCKET(@timestamp, 43200 seconds), service.name, span.destination.service.resource | STATS VALUES(timestamp) BY service.name, span.destination.service.resource | LIMIT 10000
  """,
"filter": {
  "bool": {
    "filter": [
      {
        "terms": {
          "processor.event": [
            "metric"
          ]
        }
      },
      {
        "bool": {
          "filter": [
            {
              "bool": {
                "filter": [
                  {
                    "term": {
                      "metricset.name": "service_destination"
                    }
                  }
                ],
                "must_not": [
                  {
                    "terms": {
                      "metricset.interval": [
                        "10m",
                        "60m"
                      ]
                    }
                  }
                ]
              }
            }
          ]
        }
      },
      {
        "bool": {
          "must_not": [
            {
              "terms": {
                "_tier": []
              }
            }
          ]
        }
      }
    ],
    "must": [
      {
        "bool": {
          "filter": [
            {
              "bool": {
                "filter": [
                  {
                    "term": {
                      "metricset.name": "service_destination"
                    }
                  }
                ],
                "must_not": [
                  {
                    "terms": {
                      "metricset.interval": [
                        "10m",
                        "60m"
                      ]
                    }
                  }
                ]
              }
            },
            {
              "range": {
                "@timestamp": {
                  "gte": "now-7d",
                  "lte": "now",
                  "format": "epoch_millis"
                }
              }
            },
            {
              "bool": {
                "must_not": [
                  {
                    "terms": {
                      "agent.name": [
                        "js-base",
                        "rum-js",
                        "opentelemetry/webjs",
                        "otlp/webjs"
                      ]
                    }
                  }
                ]
              }
            }
          ]
        }
      }
    ]
  }
}
}

_search request (11s)

POST metrics-apm*/_search?request_cache=false
{
  "track_total_hits": true,
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "processor.event": [
              "metric"
            ]
          }
        },
        {
          "bool": {
            "filter": [
              {
                "bool": {
                  "filter": [
                    {
                      "term": {
                        "metricset.name": "service_destination"
                      }
                    }
                  ],
                  "must_not": [
                    {
                      "terms": {
                        "metricset.interval": [
                          "10m",
                          "60m"
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        },
        {
          "bool": {
            "must_not": [
              {
                "terms": {
                  "_tier": []
                }
              }
            ]
          }
        }
      ],
      "must": [
        {
          "bool": {
            "filter": [
              {
                "bool": {
                  "filter": [
                    {
                      "term": {
                        "metricset.name": "service_destination"
                      }
                    }
                  ],
                  "must_not": [
                    {
                      "terms": {
                        "metricset.interval": [
                          "10m",
                          "60m"
                        ]
                      }
                    }
                  ]
                }
              },
              {
                "range": {
                  "@timestamp": {
                    "gte": "now-7d",
                    "lte": "now",
                    "format": "epoch_millis"
                  }
                }
              },
              {
                "bool": {
                  "must_not": [
                    {
                      "terms": {
                        "agent.name": [
                          "js-base",
                          "rum-js",
                          "opentelemetry/webjs",
                          "otlp/webjs"
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "connections": {
      "composite": {
        "size": 1500,
        "sources": [
          {
            "serviceName": {
              "terms": {
                "field": "service.name"
              }
            }
          },
          {
            "dependencyName": {
              "terms": {
                "field": "span.destination.service.resource"
              }
            }
          }
        ]
      },
      "aggs": {
        "sample": {
          "top_metrics": {
            "size": 1,
            "metrics": [
              {
                "field": "service.environment"
              },
              {
                "field": "agent.name"
              },
              {
                "field": "span.type"
              },
              {
                "field": "span.subtype"
              }
            ],
            "sort": {
              "@timestamp": "desc"
            }
          }
        },
        "total_latency_sum": {
          "sum": {
            "field": "span.destination.service.response_time.sum.us"
          }
        },
        "total_latency_count": {
          "sum": {
            "field": "span.destination.service.response_time.count"
          }
        },
        "timeseries": {
          "date_histogram": {
            "field": "@timestamp",
            "fixed_interval": "43200s",
            "extended_bounds": {
              "min": "now-7d",
              "max": "now"
            }
          },
          "aggs": {
            "latency_sum": {
              "sum": {
                "field": "span.destination.service.response_time.sum.us"
              }
            },
            "count": {
              "sum": {
                "field": "span.destination.service.response_time.count"
              }
            },
            "event.outcome": {
              "terms": {
                "field": "event.outcome"
              },
              "aggs": {
                "count": {
                  "sum": {
                    "field": "span.destination.service.response_time.count"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

crespocarlos · 2024-10-03T15:58:36Z

I'm reopening the issue because it is still reproducible, even within the boost window.

@ablnk which env did you use to reproduce the problem?

dgieselaar · 2024-10-03T16:11:18Z

I forgot the statistics 🤦 I updated the query with the failure rate, I cannot do the latency stats because of a type mismatch but I've added the failure rate stats (the type mismatch should be fixed as soon as ES|QL supports union types)

edit: works by type casting:

FROM metrics-apm*
    | STATS
        agent.name = MAX(agent.name),
        span.type = MAX(span.type),
        span.subtype = MAX(span.subtype),
        avg_latency = SUM(span.destination.service.response_time.sum.us::long) / SUM(span.destination.service.response_time.count::long),
        failure_rate = COUNT(CASE(event.outcome == "failure", 1, NULL)) / COUNT(CASE(event.outcome == "success" OR event.outcome == "failure", 1, NULL))
        BY timestamp = BUCKET(@timestamp, 43200 seconds), service.name, span.destination.service.resource
    | STATS
        timeseries = VALUES(timestamp),
        span.subtype = MAX(span.subtype),
        span.type = MAX(span.type),
        agent.name = MAX(agent.name)
        BY service.name, span.destination.service.resource
    | LIMIT 10000

ablnk · 2024-10-04T07:10:08Z

@crespocarlos keep-serverless-qa

ablnk added bug Fixes for quality problems that affect the customer experience Team:APM All issues that need APM UI Team support labels Mar 19, 2024

smith added apm:dependencies-ui apm:serverless Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team labels Mar 29, 2024

smith added needs-refinement A reason and acceptance criteria need to be defined for this issue and removed Team:APM All issues that need APM UI Team support labels Mar 29, 2024

crespocarlos self-assigned this May 2, 2024

crespocarlos removed the needs-refinement A reason and acceptance criteria need to be defined for this issue label May 6, 2024

This was referenced May 8, 2024

[APM] Use random sampler usage in span query #182828

Merged

[APM] Requesting trace sample of a dependency operation from large time interval results in browser tab with Kibana to hang #178985

Closed

crespocarlos closed this as completed in #182828 Jun 4, 2024

crespocarlos reopened this Jun 11, 2024

crespocarlos closed this as completed Jun 12, 2024

ablnk reopened this Oct 2, 2024

smith unassigned crespocarlos Oct 2, 2024

smith added the apm:performance APM UI - Performance Work label Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[APM] Top dependencies request sometimes fails when searching outside of the boost window #178979

[APM] Top dependencies request sometimes fails when searching outside of the boost window #178979

ablnk commented Mar 19, 2024

elasticmachine commented Mar 19, 2024

elasticmachine commented Mar 29, 2024

chrisdistasio commented Apr 23, 2024

crespocarlos commented May 2, 2024

dgieselaar commented May 2, 2024

neptunian commented May 2, 2024

dgieselaar commented May 2, 2024

crespocarlos commented May 3, 2024 •

edited

Loading

crespocarlos commented May 3, 2024

crespocarlos commented May 27, 2024 •

edited

Loading

chrisdistasio commented May 29, 2024

crespocarlos commented Jun 11, 2024

crespocarlos commented Jun 11, 2024 •

edited

Loading

crespocarlos commented Jun 12, 2024

ablnk commented Oct 2, 2024

dgieselaar commented Oct 3, 2024 •

edited

Loading

crespocarlos commented Oct 3, 2024

dgieselaar commented Oct 3, 2024 •

edited

Loading

ablnk commented Oct 4, 2024

[APM] Top dependencies request sometimes fails when searching outside of the boost window #178979

[APM] Top dependencies request sometimes fails when searching outside of the boost window #178979

Comments

ablnk commented Mar 19, 2024

elasticmachine commented Mar 19, 2024

elasticmachine commented Mar 29, 2024

chrisdistasio commented Apr 23, 2024

crespocarlos commented May 2, 2024

dgieselaar commented May 2, 2024

neptunian commented May 2, 2024

dgieselaar commented May 2, 2024

crespocarlos commented May 3, 2024 • edited Loading

crespocarlos commented May 3, 2024

crespocarlos commented May 27, 2024 • edited Loading

chrisdistasio commented May 29, 2024

crespocarlos commented Jun 11, 2024

crespocarlos commented Jun 11, 2024 • edited Loading

crespocarlos commented Jun 12, 2024

ablnk commented Oct 2, 2024

dgieselaar commented Oct 3, 2024 • edited Loading

crespocarlos commented Oct 3, 2024

dgieselaar commented Oct 3, 2024 • edited Loading

ablnk commented Oct 4, 2024

crespocarlos commented May 3, 2024 •

edited

Loading

crespocarlos commented May 27, 2024 •

edited

Loading

crespocarlos commented Jun 11, 2024 •

edited

Loading

dgieselaar commented Oct 3, 2024 •

edited

Loading

dgieselaar commented Oct 3, 2024 •

edited

Loading