-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[APM] Top dependencies request sometimes fails when searching outside of the boost window #178979
Comments
Pinging @elastic/apm-ui (Team:APM) |
Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services) |
@smith can we have someone assigned to this to do some additional investigation to determine whether this is directly related to some issue searching outside boost window? is it possible to quantify the number of services and dependencies that appear to be a threshold for causing the issue? trying to get a better understanding of the severity as it relates to boost window. in qa the dependencies themselves comeback pretty quickly, but the sparklines are slower to load, but eventually do. i acknowledge the number of services and dependencies is far fewer than what was tested--again, trying to determine where the threshold is. |
I've managed to reproduce the same problem in QA with ~100 dependencies It could be that the changes that will be done as part of #178491 might solve this problem. I suspect that the histogram aggregation is slowing down the query. @neptunian , the second option described in your comment, could be a more robust solution to prevent this from happening. As part of #178491 it could be worth it checking if the problem described in this ticket will be solved as well. |
@crespocarlos with regards to the bucketing problem, could be worth trying out ES|QL here - bucketing is much more relaxed there. Although it's probably easier to separate the date histo buckets from the single search request. However, if the bucket limit is the issue, you'd get an error describing it as such. It won't take down an Elasticsearch node, at least not in ES. This might be a different issue. Do we know if APM data is enabled for this cluster and where it goes/how we can identify it (e.g. by some label)? |
I think this is necessary in determining what's causing the error. I'm not sure less buckets will solve it as querying large time ranges outside this "boost" window might still take too long, if that's the problem. I've asked in Slack channel about having APM data for the cluster. |
@neptunian found it, the issue is twofold:
I have spoken to @crespocarlos about this, I would recommend to do a simple request to get the total amount of hits, and then based on that, calculate a sample rate that returns statistically significant results, and use the random_sampler agg if that sample rate is < 0.5. You will potentially lose the long tail of results but the alternative is a request that times out. |
@dgieselaar |
An example of what Dario described above in get_log_categories. We can follow this same idea. |
@chrisdistasio, following up on @paulb-elastic comment. I was wondering if you have something in mind to help users understand eventual data loss due to the random-sampler aggregation usage. I just want to highlight that the changes in #182828 might affect (depending on the amount of data + date range)
|
we have an analog for this someplace in services (IIRC). I'm trying to locate it in the UI. I would like to use consistent language if we can. |
Fixes [#178979](#178979) ## Summary This PR changes the `get_exit_span_samples` query to use `random_sampler` aggregation, to limit the number of documents used in the query and avoid 502 errors reported seen in serverless. **IMPORTANT** ❗ The change impacts other places and may lead to a potential loss in the long tail of results ### UI The dependencies page will show a badge to inform users when the data is being sampled **default** https://github.com/elastic/kibana/assets/2767137/ea13031d-8ba1-48bb-a2e4-992eabfa90dd **sampled data** https://github.com/elastic/kibana/assets/2767137/6811c293-c2a1-42fd-bd38-b91e084e8d21 ### How to test The following can be tested on `https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud` Document count for a 30-day range: `594537153` ``` GET traces-apm*,apm*/_search { "size": 0, "track_total_hits": true, "query": { "bool": { "filter": [ { "terms": { "processor.event": [ "span" ] } } ], "must": [ { "bool": { "filter": [ { "exists": { "field": "span.destination.service.resource" } }, { "range": { "@timestamp": { "gte": 1712587750933, "lte": 1715179750933, "format": "epoch_millis" } } }, { "bool": { "must_not": [ { "terms": { "agent.name": [ "js-base", "rum-js", "opentelemetry/webjs" ] } } ] } } ] } } ] } } } ``` A sample rate is calculated based on the doc. count. eg: `100000/594537153 = 0,000168198067178` `0,000168198067178` is the probability sampling passed to the `random_sampler` aggregation. ``` GET traces-apm*,apm*/_search { "track_total_hits": false, "size": 0, "query": { "bool": { "filter": [ { "terms": { "processor.event": [ "span" ] } } ], "must": [ { "bool": { "filter": [ { "exists": { "field": "span.destination.service.resource" } }, { "range": { "@timestamp": { "gte": 1712587750933, "lte": 1715179750933, "format": "epoch_millis" } } }, { "bool": { "must_not": [ { "terms": { "agent.name": [ "js-base", "rum-js", "opentelemetry/webjs" ] } } ] } } ] } } ] } }, "aggs": { "sampling": { "random_sampler": { "probability": 0.000168198067178, "seed": 815613888 }, "aggs": { "connections": { "composite": { "size": 10000, "sources": [ { "dependencyName": { "terms": { "field": "span.destination.service.resource" } } }, { "eventOutcome": { "terms": { "field": "event.outcome" } } } ] }, "aggs": { "sample": { "top_metrics": { "size": 1, "metrics": [ { "field": "span.type" }, { "field": "span.subtype" }, { "field": "span.id" } ], "sort": [ { "@timestamp": "asc" } ] } } } } } } } } ``` - It's hard to create an environment with such a data volume. We can use the query above in `https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud/`, change the date ranges, and validate if the main query will work. ### Alternatively - Start Kibana pointing to an oblt cluster (non-serverless) - Navigate to APM > Dependencies - Try different time ranges ### For reviewers This change affects - APM > Dependencies - APM > Dependencies > Overview (Upstream Services section) - APM > Services > Overview (Dependencies tab) - Assistant's `get_apm_downstream_dependencies` function --------- Co-authored-by: Kibana Machine <[email protected]>
Fixes [elastic#178979](elastic#178979) ## Summary This PR changes the `get_exit_span_samples` query to use `random_sampler` aggregation, to limit the number of documents used in the query and avoid 502 errors reported seen in serverless. **IMPORTANT** ❗ The change impacts other places and may lead to a potential loss in the long tail of results ### UI The dependencies page will show a badge to inform users when the data is being sampled **default** https://github.com/elastic/kibana/assets/2767137/ea13031d-8ba1-48bb-a2e4-992eabfa90dd **sampled data** https://github.com/elastic/kibana/assets/2767137/6811c293-c2a1-42fd-bd38-b91e084e8d21 ### How to test The following can be tested on `https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud` Document count for a 30-day range: `594537153` ``` GET traces-apm*,apm*/_search { "size": 0, "track_total_hits": true, "query": { "bool": { "filter": [ { "terms": { "processor.event": [ "span" ] } } ], "must": [ { "bool": { "filter": [ { "exists": { "field": "span.destination.service.resource" } }, { "range": { "@timestamp": { "gte": 1712587750933, "lte": 1715179750933, "format": "epoch_millis" } } }, { "bool": { "must_not": [ { "terms": { "agent.name": [ "js-base", "rum-js", "opentelemetry/webjs" ] } } ] } } ] } } ] } } } ``` A sample rate is calculated based on the doc. count. eg: `100000/594537153 = 0,000168198067178` `0,000168198067178` is the probability sampling passed to the `random_sampler` aggregation. ``` GET traces-apm*,apm*/_search { "track_total_hits": false, "size": 0, "query": { "bool": { "filter": [ { "terms": { "processor.event": [ "span" ] } } ], "must": [ { "bool": { "filter": [ { "exists": { "field": "span.destination.service.resource" } }, { "range": { "@timestamp": { "gte": 1712587750933, "lte": 1715179750933, "format": "epoch_millis" } } }, { "bool": { "must_not": [ { "terms": { "agent.name": [ "js-base", "rum-js", "opentelemetry/webjs" ] } } ] } } ] } } ] } }, "aggs": { "sampling": { "random_sampler": { "probability": 0.000168198067178, "seed": 815613888 }, "aggs": { "connections": { "composite": { "size": 10000, "sources": [ { "dependencyName": { "terms": { "field": "span.destination.service.resource" } } }, { "eventOutcome": { "terms": { "field": "event.outcome" } } } ] }, "aggs": { "sample": { "top_metrics": { "size": 1, "metrics": [ { "field": "span.type" }, { "field": "span.subtype" }, { "field": "span.id" } ], "sort": [ { "@timestamp": "asc" } ] } } } } } } } } ``` - It's hard to create an environment with such a data volume. We can use the query above in `https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud/`, change the date ranges, and validate if the main query will work. ### Alternatively - Start Kibana pointing to an oblt cluster (non-serverless) - Navigate to APM > Dependencies - Try different time ranges ### For reviewers This change affects - APM > Dependencies - APM > Dependencies > Overview (Upstream Services section) - APM > Services > Overview (Dependencies tab) - Assistant's `get_apm_downstream_dependencies` function --------- Co-authored-by: Kibana Machine <[email protected]>
I'm reopening this because I'm seing intermittent circuit breaker errors. Perhaps the random sampler probability needs to be adjusted |
It seems like the errors are caused by a transform https://elastic.slack.com/archives/C05UT5PP1EF/p1718023213609569 |
I'm reopening the issue because it is still reproducible, even within the boost window. |
@crespocarlos I think we should just use ES|QL, it's way faster here. Hold me honest, I think they're equivalent: ES|QL request (2.5s)POST _query?
{
"query": """
FROM metrics-apm*
| STATS MAX(agent.name), MAX(span.type), MAX(span.subtype), failure_rate = COUNT(CASE(event.outcome == "failure", 1, NULL)) / COUNT(CASE(event.outcome == "success" OR event.outcome == "failure", 1, NULL)) BY timestamp = BUCKET(@timestamp, 43200 seconds), service.name, span.destination.service.resource | STATS VALUES(timestamp) BY service.name, span.destination.service.resource | LIMIT 10000
""",
"filter": {
"bool": {
"filter": [
{
"terms": {
"processor.event": [
"metric"
]
}
},
{
"bool": {
"filter": [
{
"bool": {
"filter": [
{
"term": {
"metricset.name": "service_destination"
}
}
],
"must_not": [
{
"terms": {
"metricset.interval": [
"10m",
"60m"
]
}
}
]
}
}
]
}
},
{
"bool": {
"must_not": [
{
"terms": {
"_tier": []
}
}
]
}
}
],
"must": [
{
"bool": {
"filter": [
{
"bool": {
"filter": [
{
"term": {
"metricset.name": "service_destination"
}
}
],
"must_not": [
{
"terms": {
"metricset.interval": [
"10m",
"60m"
]
}
}
]
}
},
{
"range": {
"@timestamp": {
"gte": "now-7d",
"lte": "now",
"format": "epoch_millis"
}
}
},
{
"bool": {
"must_not": [
{
"terms": {
"agent.name": [
"js-base",
"rum-js",
"opentelemetry/webjs",
"otlp/webjs"
]
}
}
]
}
}
]
}
}
]
}
}
} _search request (11s)POST metrics-apm*/_search?request_cache=false
{
"track_total_hits": true,
"size": 0,
"query": {
"bool": {
"filter": [
{
"terms": {
"processor.event": [
"metric"
]
}
},
{
"bool": {
"filter": [
{
"bool": {
"filter": [
{
"term": {
"metricset.name": "service_destination"
}
}
],
"must_not": [
{
"terms": {
"metricset.interval": [
"10m",
"60m"
]
}
}
]
}
}
]
}
},
{
"bool": {
"must_not": [
{
"terms": {
"_tier": []
}
}
]
}
}
],
"must": [
{
"bool": {
"filter": [
{
"bool": {
"filter": [
{
"term": {
"metricset.name": "service_destination"
}
}
],
"must_not": [
{
"terms": {
"metricset.interval": [
"10m",
"60m"
]
}
}
]
}
},
{
"range": {
"@timestamp": {
"gte": "now-7d",
"lte": "now",
"format": "epoch_millis"
}
}
},
{
"bool": {
"must_not": [
{
"terms": {
"agent.name": [
"js-base",
"rum-js",
"opentelemetry/webjs",
"otlp/webjs"
]
}
}
]
}
}
]
}
}
]
}
},
"aggs": {
"connections": {
"composite": {
"size": 1500,
"sources": [
{
"serviceName": {
"terms": {
"field": "service.name"
}
}
},
{
"dependencyName": {
"terms": {
"field": "span.destination.service.resource"
}
}
}
]
},
"aggs": {
"sample": {
"top_metrics": {
"size": 1,
"metrics": [
{
"field": "service.environment"
},
{
"field": "agent.name"
},
{
"field": "span.type"
},
{
"field": "span.subtype"
}
],
"sort": {
"@timestamp": "desc"
}
}
},
"total_latency_sum": {
"sum": {
"field": "span.destination.service.response_time.sum.us"
}
},
"total_latency_count": {
"sum": {
"field": "span.destination.service.response_time.count"
}
},
"timeseries": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "43200s",
"extended_bounds": {
"min": "now-7d",
"max": "now"
}
},
"aggs": {
"latency_sum": {
"sum": {
"field": "span.destination.service.response_time.sum.us"
}
},
"count": {
"sum": {
"field": "span.destination.service.response_time.count"
}
},
"event.outcome": {
"terms": {
"field": "event.outcome"
},
"aggs": {
"count": {
"sum": {
"field": "span.destination.service.response_time.count"
}
}
}
}
}
}
}
}
}
} |
@ablnk which env did you use to reproduce the problem? |
I forgot the statistics 🤦 I updated the query with the failure rate, I cannot do the latency stats because of a type mismatch but I've added the failure rate stats (the type mismatch should be fixed as soon as ES|QL supports union types) edit: works by type casting:
|
Version:
Serverless project v 8.14.0
Description:
GET internal/apm/dependencies/top_dependencies
request fails with status code 502 and returnsbackend closed connection
when searching for top dependencies outside of the boost window.Preconditions:
I reproduced the issue having 102 dependencies and 761 services.
Steps to reproduce:
Expected behavior:
Dependencies available within 30 days returned.
The text was updated successfully, but these errors were encountered: