Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Top dependencies request sometimes fails when searching outside of the boost window #178979

Open
ablnk opened this issue Mar 19, 2024 · 19 comments · Fixed by #182828
Open
Labels
apm:dependencies-ui apm:performance APM UI - Performance Work apm:serverless bug Fixes for quality problems that affect the customer experience Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team

Comments

@ablnk
Copy link

ablnk commented Mar 19, 2024

Version:
Serverless project v 8.14.0

Description:
GET internal/apm/dependencies/top_dependencies request fails with status code 502 and returns backend closed connection when searching for top dependencies outside of the boost window.

Preconditions:
I reproduced the issue having 102 dependencies and 761 services.

Steps to reproduce:

  1. Go to Applications - Dependencies.
  2. Filter data by last 30 days.

Expected behavior:
Dependencies available within 30 days returned.

@ablnk ablnk added bug Fixes for quality problems that affect the customer experience Team:APM All issues that need APM UI Team support labels Mar 19, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:APM)

@smith smith added apm:dependencies-ui apm:serverless Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team labels Mar 29, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

@smith smith added needs-refinement A reason and acceptance criteria need to be defined for this issue and removed Team:APM All issues that need APM UI Team support labels Mar 29, 2024
@chrisdistasio
Copy link

@smith can we have someone assigned to this to do some additional investigation to determine whether this is directly related to some issue searching outside boost window?

is it possible to quantify the number of services and dependencies that appear to be a threshold for causing the issue?

trying to get a better understanding of the severity as it relates to boost window.

in qa the dependencies themselves comeback pretty quickly, but the sparklines are slower to load, but eventually do. i acknowledge the number of services and dependencies is far fewer than what was tested--again, trying to determine where the threshold is.

@crespocarlos crespocarlos self-assigned this May 2, 2024
@crespocarlos
Copy link
Contributor

I've managed to reproduce the same problem in QA with ~100 dependencies

Image

It could be that the changes that will be done as part of #178491 might solve this problem. I suspect that the histogram aggregation is slowing down the query.

@neptunian , the second option described in your comment, could be a more robust solution to prevent this from happening. As part of #178491 it could be worth it checking if the problem described in this ticket will be solved as well.

@dgieselaar
Copy link
Member

@crespocarlos with regards to the bucketing problem, could be worth trying out ES|QL here - bucketing is much more relaxed there. Although it's probably easier to separate the date histo buckets from the single search request.

However, if the bucket limit is the issue, you'd get an error describing it as such. It won't take down an Elasticsearch node, at least not in ES. This might be a different issue. Do we know if APM data is enabled for this cluster and where it goes/how we can identify it (e.g. by some label)?

@neptunian
Copy link
Contributor

Do we know if APM data is enabled for this cluster and where it goes/how we can identify it (e.g. by some label)?

I think this is necessary in determining what's causing the error. I'm not sure less buckets will solve it as querying large time ranges outside this "boost" window might still take too long, if that's the problem. I've asked in Slack channel about having APM data for the cluster.

@dgieselaar
Copy link
Member

@neptunian found it, the issue is twofold:

  • the es request to get sample spans times out after 30s
  • the request is then retried 3 times, and eventually times out (it shouldn't, I have raised this with Core/Platform)

I have spoken to @crespocarlos about this, I would recommend to do a simple request to get the total amount of hits, and then based on that, calculate a sample rate that returns statistically significant results, and use the random_sampler agg if that sample rate is < 0.5. You will potentially lose the long tail of results but the alternative is a request that times out.

@crespocarlos
Copy link
Contributor

crespocarlos commented May 3, 2024

@dgieselaar how many hits could we consider as cutoff to use random_sampler agg?. Nvm, I understand now what you meant.

@crespocarlos
Copy link
Contributor

An example of what Dario described above in get_log_categories. We can follow this same idea.

@crespocarlos
Copy link
Contributor

crespocarlos commented May 27, 2024

@chrisdistasio, following up on @paulb-elastic comment. I was wondering if you have something in mind to help users understand eventual data loss due to the random-sampler aggregation usage.

I just want to highlight that the changes in #182828 might affect (depending on the amount of data + date range)

  • APM > Dependencies
  • APM > Dependencies > Overview (Upstream Services section) - less likely
  • APM > Services > Overview (Dependencies tab) - less likely

@chrisdistasio
Copy link

we have an analog for this someplace in services (IIRC). I'm trying to locate it in the UI. I would like to use consistent language if we can.

crespocarlos added a commit that referenced this issue Jun 4, 2024
Fixes [#178979](#178979)

## Summary

This PR changes the `get_exit_span_samples` query to use
`random_sampler` aggregation, to limit the number of documents used in
the query and avoid 502 errors reported seen in serverless.

**IMPORTANT** ❗ 
The change impacts other places and may lead to a potential loss in the
long tail of results


### UI

The dependencies page will show a badge to inform users when the data is
being sampled

**default**


https://github.com/elastic/kibana/assets/2767137/ea13031d-8ba1-48bb-a2e4-992eabfa90dd

**sampled data**


https://github.com/elastic/kibana/assets/2767137/6811c293-c2a1-42fd-bd38-b91e084e8d21


### How to test

The following can be tested on
`https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud`

Document count for a 30-day range: `594537153`
```
GET traces-apm*,apm*/_search
{
  "size": 0,
  "track_total_hits": true,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "processor.event": [
              "span"
            ]
          }
        }
      ],
      "must": [
        {
          "bool": {
            "filter": [
              {
                "exists": {
                  "field": "span.destination.service.resource"
                }
              },
              {
                "range": {
                  "@timestamp": {
                    "gte": 1712587750933,
                    "lte": 1715179750933,
                    "format": "epoch_millis"
                  }
                }
              },
              {
                "bool": {
                  "must_not": [
                    {
                      "terms": {
                        "agent.name": [
                          "js-base",
                          "rum-js",
                          "opentelemetry/webjs"
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  }
}
```

A sample rate is calculated based on the doc. count. eg:
`100000/594537153 = 0,000168198067178`

`0,000168198067178` is the probability sampling passed to the
`random_sampler` aggregation.

```

GET traces-apm*,apm*/_search
{
  "track_total_hits": false,
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "processor.event": [
              "span"
            ]
          }
        }
      ],
      "must": [
        {
          "bool": {
            "filter": [
              {
                "exists": {
                  "field": "span.destination.service.resource"
                }
              },
              {
                "range": {
                  "@timestamp": {
                    "gte": 1712587750933,
                    "lte": 1715179750933,
                    "format": "epoch_millis"
                  }
                }
              },
              {
                "bool": {
                  "must_not": [
                    {
                      "terms": {
                        "agent.name": [
                          "js-base",
                          "rum-js",
                          "opentelemetry/webjs"
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "sampling": {
      "random_sampler": {
        "probability": 0.000168198067178,
        "seed": 815613888
      },
      "aggs": {
        "connections": {
          "composite": {
            "size": 10000,
            "sources": [
              {
                "dependencyName": {
                  "terms": {
                    "field": "span.destination.service.resource"
                  }
                }
              },
              {
                "eventOutcome": {
                  "terms": {
                    "field": "event.outcome"
                  }
                }
              }
            ]
          },
          "aggs": {
            "sample": {
              "top_metrics": {
                "size": 1,
                "metrics": [
                  {
                    "field": "span.type"
                  },
                  {
                    "field": "span.subtype"
                  },
                  {
                    "field": "span.id"
                  }
                ],
                "sort": [
                  {
                    "@timestamp": "asc"
                  }
                ]
              }
            }
          }
        }
      }
    }
  }
}
```

- It's hard to create an environment with such a data volume. We can use
the query above in
`https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud/`,
change the date ranges, and validate if the main query will work.

### Alternatively

- Start Kibana pointing to an oblt cluster (non-serverless)
- Navigate to APM > Dependencies
- Try different time ranges

### For reviewers

This change affects
- APM > Dependencies
- APM > Dependencies > Overview (Upstream Services section)
- APM > Services > Overview (Dependencies tab)
- Assistant's `get_apm_downstream_dependencies` function

---------

Co-authored-by: Kibana Machine <[email protected]>
rohanxz pushed a commit to honeyn303/kibana that referenced this issue Jun 4, 2024
Fixes [elastic#178979](elastic#178979)

## Summary

This PR changes the `get_exit_span_samples` query to use
`random_sampler` aggregation, to limit the number of documents used in
the query and avoid 502 errors reported seen in serverless.

**IMPORTANT** ❗ 
The change impacts other places and may lead to a potential loss in the
long tail of results


### UI

The dependencies page will show a badge to inform users when the data is
being sampled

**default**


https://github.com/elastic/kibana/assets/2767137/ea13031d-8ba1-48bb-a2e4-992eabfa90dd

**sampled data**


https://github.com/elastic/kibana/assets/2767137/6811c293-c2a1-42fd-bd38-b91e084e8d21


### How to test

The following can be tested on
`https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud`

Document count for a 30-day range: `594537153`
```
GET traces-apm*,apm*/_search
{
  "size": 0,
  "track_total_hits": true,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "processor.event": [
              "span"
            ]
          }
        }
      ],
      "must": [
        {
          "bool": {
            "filter": [
              {
                "exists": {
                  "field": "span.destination.service.resource"
                }
              },
              {
                "range": {
                  "@timestamp": {
                    "gte": 1712587750933,
                    "lte": 1715179750933,
                    "format": "epoch_millis"
                  }
                }
              },
              {
                "bool": {
                  "must_not": [
                    {
                      "terms": {
                        "agent.name": [
                          "js-base",
                          "rum-js",
                          "opentelemetry/webjs"
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  }
}
```

A sample rate is calculated based on the doc. count. eg:
`100000/594537153 = 0,000168198067178`

`0,000168198067178` is the probability sampling passed to the
`random_sampler` aggregation.

```

GET traces-apm*,apm*/_search
{
  "track_total_hits": false,
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "processor.event": [
              "span"
            ]
          }
        }
      ],
      "must": [
        {
          "bool": {
            "filter": [
              {
                "exists": {
                  "field": "span.destination.service.resource"
                }
              },
              {
                "range": {
                  "@timestamp": {
                    "gte": 1712587750933,
                    "lte": 1715179750933,
                    "format": "epoch_millis"
                  }
                }
              },
              {
                "bool": {
                  "must_not": [
                    {
                      "terms": {
                        "agent.name": [
                          "js-base",
                          "rum-js",
                          "opentelemetry/webjs"
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "sampling": {
      "random_sampler": {
        "probability": 0.000168198067178,
        "seed": 815613888
      },
      "aggs": {
        "connections": {
          "composite": {
            "size": 10000,
            "sources": [
              {
                "dependencyName": {
                  "terms": {
                    "field": "span.destination.service.resource"
                  }
                }
              },
              {
                "eventOutcome": {
                  "terms": {
                    "field": "event.outcome"
                  }
                }
              }
            ]
          },
          "aggs": {
            "sample": {
              "top_metrics": {
                "size": 1,
                "metrics": [
                  {
                    "field": "span.type"
                  },
                  {
                    "field": "span.subtype"
                  },
                  {
                    "field": "span.id"
                  }
                ],
                "sort": [
                  {
                    "@timestamp": "asc"
                  }
                ]
              }
            }
          }
        }
      }
    }
  }
}
```

- It's hard to create an environment with such a data volume. We can use
the query above in
`https://keepserverless-qa-oblt-b4ba07.kb.eu-west-1.aws.qa.elastic.cloud/`,
change the date ranges, and validate if the main query will work.

### Alternatively

- Start Kibana pointing to an oblt cluster (non-serverless)
- Navigate to APM > Dependencies
- Try different time ranges

### For reviewers

This change affects
- APM > Dependencies
- APM > Dependencies > Overview (Upstream Services section)
- APM > Services > Overview (Dependencies tab)
- Assistant's `get_apm_downstream_dependencies` function

---------

Co-authored-by: Kibana Machine <[email protected]>
@crespocarlos
Copy link
Contributor

I've tested the fix in QA with 30-day range:

Image

@crespocarlos crespocarlos reopened this Jun 11, 2024
@crespocarlos
Copy link
Contributor

crespocarlos commented Jun 11, 2024

I'm reopening this because I'm seing intermittent circuit breaker errors. Perhaps the random sampler probability needs to be adjusted

@crespocarlos
Copy link
Contributor

It seems like the errors are caused by a transform https://elastic.slack.com/archives/C05UT5PP1EF/p1718023213609569

@ablnk
Copy link
Author

ablnk commented Oct 2, 2024

I'm reopening the issue because it is still reproducible, even within the boost window.

@ablnk ablnk reopened this Oct 2, 2024
@smith smith added the apm:performance APM UI - Performance Work label Oct 2, 2024
@dgieselaar
Copy link
Member

dgieselaar commented Oct 3, 2024

@crespocarlos I think we should just use ES|QL, it's way faster here. Hold me honest, I think they're equivalent:

ES|QL request (2.5s)
POST _query?
{
"query": """
  FROM metrics-apm*
  | STATS MAX(agent.name), MAX(span.type), MAX(span.subtype), failure_rate = COUNT(CASE(event.outcome == "failure", 1, NULL)) / COUNT(CASE(event.outcome == "success" OR event.outcome == "failure", 1, NULL)) BY timestamp = BUCKET(@timestamp, 43200 seconds), service.name, span.destination.service.resource | STATS VALUES(timestamp) BY service.name, span.destination.service.resource | LIMIT 10000
  """,
"filter": {
  "bool": {
    "filter": [
      {
        "terms": {
          "processor.event": [
            "metric"
          ]
        }
      },
      {
        "bool": {
          "filter": [
            {
              "bool": {
                "filter": [
                  {
                    "term": {
                      "metricset.name": "service_destination"
                    }
                  }
                ],
                "must_not": [
                  {
                    "terms": {
                      "metricset.interval": [
                        "10m",
                        "60m"
                      ]
                    }
                  }
                ]
              }
            }
          ]
        }
      },
      {
        "bool": {
          "must_not": [
            {
              "terms": {
                "_tier": []
              }
            }
          ]
        }
      }
    ],
    "must": [
      {
        "bool": {
          "filter": [
            {
              "bool": {
                "filter": [
                  {
                    "term": {
                      "metricset.name": "service_destination"
                    }
                  }
                ],
                "must_not": [
                  {
                    "terms": {
                      "metricset.interval": [
                        "10m",
                        "60m"
                      ]
                    }
                  }
                ]
              }
            },
            {
              "range": {
                "@timestamp": {
                  "gte": "now-7d",
                  "lte": "now",
                  "format": "epoch_millis"
                }
              }
            },
            {
              "bool": {
                "must_not": [
                  {
                    "terms": {
                      "agent.name": [
                        "js-base",
                        "rum-js",
                        "opentelemetry/webjs",
                        "otlp/webjs"
                      ]
                    }
                  }
                ]
              }
            }
          ]
        }
      }
    ]
  }
}
}
_search request (11s)
POST metrics-apm*/_search?request_cache=false
{
  "track_total_hits": true,
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "terms": {
            "processor.event": [
              "metric"
            ]
          }
        },
        {
          "bool": {
            "filter": [
              {
                "bool": {
                  "filter": [
                    {
                      "term": {
                        "metricset.name": "service_destination"
                      }
                    }
                  ],
                  "must_not": [
                    {
                      "terms": {
                        "metricset.interval": [
                          "10m",
                          "60m"
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        },
        {
          "bool": {
            "must_not": [
              {
                "terms": {
                  "_tier": []
                }
              }
            ]
          }
        }
      ],
      "must": [
        {
          "bool": {
            "filter": [
              {
                "bool": {
                  "filter": [
                    {
                      "term": {
                        "metricset.name": "service_destination"
                      }
                    }
                  ],
                  "must_not": [
                    {
                      "terms": {
                        "metricset.interval": [
                          "10m",
                          "60m"
                        ]
                      }
                    }
                  ]
                }
              },
              {
                "range": {
                  "@timestamp": {
                    "gte": "now-7d",
                    "lte": "now",
                    "format": "epoch_millis"
                  }
                }
              },
              {
                "bool": {
                  "must_not": [
                    {
                      "terms": {
                        "agent.name": [
                          "js-base",
                          "rum-js",
                          "opentelemetry/webjs",
                          "otlp/webjs"
                        ]
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "connections": {
      "composite": {
        "size": 1500,
        "sources": [
          {
            "serviceName": {
              "terms": {
                "field": "service.name"
              }
            }
          },
          {
            "dependencyName": {
              "terms": {
                "field": "span.destination.service.resource"
              }
            }
          }
        ]
      },
      "aggs": {
        "sample": {
          "top_metrics": {
            "size": 1,
            "metrics": [
              {
                "field": "service.environment"
              },
              {
                "field": "agent.name"
              },
              {
                "field": "span.type"
              },
              {
                "field": "span.subtype"
              }
            ],
            "sort": {
              "@timestamp": "desc"
            }
          }
        },
        "total_latency_sum": {
          "sum": {
            "field": "span.destination.service.response_time.sum.us"
          }
        },
        "total_latency_count": {
          "sum": {
            "field": "span.destination.service.response_time.count"
          }
        },
        "timeseries": {
          "date_histogram": {
            "field": "@timestamp",
            "fixed_interval": "43200s",
            "extended_bounds": {
              "min": "now-7d",
              "max": "now"
            }
          },
          "aggs": {
            "latency_sum": {
              "sum": {
                "field": "span.destination.service.response_time.sum.us"
              }
            },
            "count": {
              "sum": {
                "field": "span.destination.service.response_time.count"
              }
            },
            "event.outcome": {
              "terms": {
                "field": "event.outcome"
              },
              "aggs": {
                "count": {
                  "sum": {
                    "field": "span.destination.service.response_time.count"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

@crespocarlos
Copy link
Contributor

I'm reopening the issue because it is still reproducible, even within the boost window.

@ablnk which env did you use to reproduce the problem?

@dgieselaar
Copy link
Member

dgieselaar commented Oct 3, 2024

I forgot the statistics 🤦 I updated the query with the failure rate, I cannot do the latency stats because of a type mismatch but I've added the failure rate stats (the type mismatch should be fixed as soon as ES|QL supports union types)

edit: works by type casting:

FROM metrics-apm*
    | STATS
        agent.name = MAX(agent.name),
        span.type = MAX(span.type),
        span.subtype = MAX(span.subtype),
        avg_latency = SUM(span.destination.service.response_time.sum.us::long) / SUM(span.destination.service.response_time.count::long),
        failure_rate = COUNT(CASE(event.outcome == "failure", 1, NULL)) / COUNT(CASE(event.outcome == "success" OR event.outcome == "failure", 1, NULL))
        BY timestamp = BUCKET(@timestamp, 43200 seconds), service.name, span.destination.service.resource
    | STATS
        timeseries = VALUES(timestamp),
        span.subtype = MAX(span.subtype),
        span.type = MAX(span.type),
        agent.name = MAX(agent.name)
        BY service.name, span.destination.service.resource
    | LIMIT 10000

@ablnk
Copy link
Author

ablnk commented Oct 4, 2024

@crespocarlos keep-serverless-qa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apm:dependencies-ui apm:performance APM UI - Performance Work apm:serverless bug Fixes for quality problems that affect the customer experience Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants