Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm chart for Kubernetes metrics quickstart #562

Closed
7 tasks done
jaronoff97 opened this issue Dec 12, 2022 · 31 comments
Closed
7 tasks done

Helm chart for Kubernetes metrics quickstart #562

jaronoff97 opened this issue Dec 12, 2022 · 31 comments

Comments

@jaronoff97
Copy link
Contributor

jaronoff97 commented Dec 12, 2022

Many prometheus and kubernetes users are familiar with the kube-prometheus-stack chart which aims to quickly set up and manage a prometheus and grafana installation for a user that collects mostly all of the Kubernetes metrics available. It achieves this using the Prometheus operator and ServiceMonitor and PodMonitor custom resources that configure a user's Prometheus scrape config. We have the ability to do the same using the OpenTelemetry Operator and the Target Allocator. In order to provide an easy and familiar migration path to existing (or new) Prometheus and Kubernetes users, I created the kube-otel-stack chart which installs a pre-configured collector and target allocator to dynamically ServiceMonitor and PodMonitor custom resources to scrape various Kubernetes metrics. You can see below some of the metrics this collector is scraping.
Screen Shot 2022-12-12 at 6 06 49 PM

This has since become a requested feature across the otel slack from what i can tell, as I've DM'ed this chart to at least 3 different people at this point. I was wondering if it would be welcome for me to clean up and make more generic this slightly opinionated helm chart and donate it to the repository.

Other options considered

  • Add a new preset to the existing collector chart
    • I decided not to do this for two reasons:
      • My chart utilizes / requires the target allocator's CRD discovery functionality, which in turn requires the Operator to run. The CRD functionality of this chart is also one of its biggest benefits as it allows users coming from an existing Prometheus installation to easily migrate.
      • Even without using the CRD functionality, the scrape configs required are very long and may not work for all users without some tweaking
  • Finish writing Add ability to create target allocator from helm chart #336 and then add the functionality in as a preset

TODO

@jaronoff97 jaronoff97 changed the title Helm chart for kubernetes metrics quickstart Helm chart for Kubernetes metrics quickstart Dec 12, 2022
@austinlparker
Copy link
Member

I've seen O(tens) of requests for this on the OpenTelemetry slack channels. Having it in the community would be great, as we could promote its adoption more widely.

@TylerHelmuth
Copy link
Member

TylerHelmuth commented Dec 13, 2022

I am certainly interested in this if users are interested in this. A couple questions:

  1. @jaronoff97 if this chart was accepted, would you be available as a CodeOwner for the chart?
  2. Is there anything specific to Lightstep that would need stripped out or can the entire chart be taken verbatim?
  3. Is the chart testable via chart-testing?
  4. What has the upkeep of the chart been like? Is it relatively stable (except for operator bumps)?

@jaronoff97
Copy link
Contributor Author

Thanks for your questions :)

  1. Yes, happy to be a codeowner for it.
  2. Yes, I would generalize anything that is LS specific in the PR i would make to the repo
  3. I'm not sure how chart-testing works (never used it before.) I think as long as we could install the operator as part of the testing flow, it should be fine?
  4. Relatively stable, occasionally there's a small change here and there. I'd imagine we'd get some more requests as more people use this, but it shouldn't be changing too drastically

@povilasv
Copy link
Contributor

I really like this idea, but I have a question - is there a plan to move away from kube-state-metrics, node-exporter etc in favour of otel collector native receivers (k8sclusterreceiver and hostmetrics) ?

I think in general we should strive to collect all the prometheus metrics from k8s components, but not use any of the Prometheus ecosystem components and use Collector's native features :)

@TylerHelmuth
Copy link
Member

@jaronoff97 I'm also curious if your chart handles the installation of the operator and the OpentelemetryCollector object like discussed here: #69

@jcdauchy-moodys
Copy link

jcdauchy-moodys commented Jan 16, 2023

I have been using this chart for 3 weeks, it is working out of the box but it will need to be improved (of course). It brings almost the same functionalities as "Prometheus Operator with kube-prometheus-stack chart". It is much lightweight as you only deploy "agents" to scrape your logs/metrics/traces. I am using it to send metrics to AWS AMP (managed prometheus).

Here are the main issue I encountered so far :

  • if I use "statefulset" deployment and one of the Availibility zone" goes down( 1 / 3), I loose the scraping on 1/3 of the targets :(
  • I did not know which version of "Prometheus CRDs" to install, could it be documented which version is supported by the targetAllocator.

Thanks for the good work.

@jaronoff97
Copy link
Contributor Author

updates/context setting: @TylerHelmuth I still want to donate this if that's still okay. I've validated with a few other people that this would be a great thing for the community to have. The only blocker for this work is to figure out if we can install the operator in the same chart which would make for a better experience. My team is going to be investigating this.

@TylerHelmuth
Copy link
Member

@jaronoff97 sounds good. @open-telemetry/helm-approvers please add your thoughts.

@Allex1
Copy link
Contributor

Allex1 commented Mar 13, 2023

I approve. Thanks @jaronoff97

@dmitryax
Copy link
Member

dmitryax commented Mar 14, 2023

I don't think I agree that we need another chart for this. I'd rather go with adding the TA option to the collector chart.

Also, why do we promote using Prometheus for scraping kubernetes/kubelet metrics instead of using specialized collector receivers that collect metrics complaint with OTel semantic conventions without additional transformations?

@Allex1
Copy link
Contributor

Allex1 commented Mar 14, 2023

I think this would provide a bridge for existing kps users that otherwise would not care to switch (afaik Prometheus is still used in ~ 99.x% of Kubernetes deployments for cluster monitoring).
Reusing the existing Prometheus-Operator objects would smooth out that migration.

@TylerHelmuth
Copy link
Member

TylerHelmuth commented Mar 14, 2023

I also see value in a "transition" chart. Long term (like long long term), I think a need for a chart like this diminishes, but for users today who have extensive Prometheus setups but want to try out OTel or start transitioning to OTel I think this chart fits their needs.

@dmitryax
Copy link
Member

Ok, I'm not blocking it. If most @open-telemetry/helm-approvers think it's a good addition, let's add it

@dmitryax
Copy link
Member

The name should somehow reflect the Prometheus bridge/transition in its name. kube-otel-stack doesn't seem right to me

@TylerHelmuth
Copy link
Member

TylerHelmuth commented Mar 14, 2023

Could also be cool to include somewhere how to grab the same telemetry using the collector and its components.

@povilasv
Copy link
Contributor

I'm not sure how this transitioning chart would work? Should we assume that user installed kube-prometheus-stack and we try to somehow migrate it from that to this chart?

I was thinking having kube-otel-stack which initially works like kube-prometheus-stack, collects metrics using Prometheus, but slowly we could refactor it to use native OpenTelemetry Collector receivers and functionality.

@Allex1
Copy link
Contributor

Allex1 commented Mar 15, 2023

I'm not sure how this transitioning chart would work? Should we assume that user installed kube-prometheus-stack and we try to somehow migrate it from that to this chart?

We should probably assume that the majority of admins scrape their k8s api endpoints with Prometheus via prometheus-operator objects like Service/PodMonitor that we can reuse with this stack.
As such a user, initially I would have both Prometheus and otel collector scraping this data and comparing the results/setup complexity before making any decision.

@austinlparker
Copy link
Member

I would also see this as a 'transition' chart, but the migration path to me is something like...

kube-prometheus-stack -> kube-otel-stack -> opentelemetry-operator

In the (admittedly, kinda far?) future, I can see the operator using native OpenTelemetry components and monitoring CRDs to perform the same basic functions as this stack, but in the short-to-medium term, having this in the org will give us a pat answer for "how should I monitor k8s with OpenTelemetry?"

@austinlparker
Copy link
Member

Hi, quick bump on this issue - one pretty common piece of feedback we got at KubeCon EU was the amount of people who didn't know the operator existed. I believe getting this chart brought in would help a lot with that, as we could then signpost this from the docs as a "how to get started with kubernetes".

@TylerHelmuth
Copy link
Member

@dmitryax is there anything else we're waiting on before accepting PRs adding this chart?

@jaronoff97
Copy link
Contributor Author

@TylerHelmuth I think this issue is still a blocker. I'm going to run some tests right now to track this down and solve it.

@jaronoff97
Copy link
Contributor Author

Okay after a little mish-moshing of things... i was able to get a chart that installs cert-manager (a requirement of the operator), the operator, and a collector to install together in a single chart. The problem is that it doesn't all install at once for a few reasons.

Option where we install cert-manager with the chart

TL;DR there are some race conditions and annoyances here

First installation

In order for the first installation to work for the chart, you need to set the operator's admission webhook to false. This is because helm installs resources in a particular order (here) and if you attempt to install cert-manager and the operator simultaneously with the webhook enabled you get the following error:

Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: [unable to recognize "": no matches for kind "Certificate" in version "cert-manager.io/v1", unable to recognize "": no matches for kind "Issuer" in version "cert-manager.io/v1"]

This is fine, because we can just initially disable the webhook on otel-operator installation so the otel-operator can come up healthy after the CRDs for cert-manager are installed.

Second installation

Now we have to re-enable the webhook, applying that again will get you another fun group of errors.

⎨ 11:46:28⎬ ⎨ ⛵️kind-kind : kind-kind⎬ ⎨ ...opentelemetry-helm-charts/charts/kube-otel-stack⎬ ⎨  same-chart-operator-install ✘ ✭⎬
⫸ helm install kube-otel-stack . -f values.yaml
Error: INSTALLATION FAILED: Internal error occurred: failed calling webhook "mopentelemetrycollector.kb.io": failed to call webhook: Post "https://opentelemetry-operator-webhook-service.default.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s": dial tcp 10.96.94.177:443: connect: connection refused

⎨ ✘⎬ ⎨ 11:46:43⎬ ⎨ ⛵️kind-kind : kind-kind⎬ ⎨ ...opentelemetry-helm-charts/charts/kube-otel-stack⎬ ⎨  same-chart-operator-install ✘ ✭⎬
⫸ helm upgrade kube-otel-stack . -f values.yaml
Error: UPGRADE FAILED: failed to create resource: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://kube-otel-stack-cert-manager-webhook.default.svc:443/mutate?timeout=10s": dial tcp 10.96.176.233:443: connect: connection refused

These are due to pods not being ready in order for the webhooks to be called.

Third installation

After waiting maybe ten seconds, instead of being impatient like me... you are able to successfully install the chart in its entirety

⫸ helm upgrade kube-otel-stack . -f values.yaml --install
Release "kube-otel-stack" has been upgraded. Happy Helming!
NAME: kube-otel-stack
LAST DEPLOYED: Mon Apr 24 11:56:05 2023
NAMESPACE: default
STATUS: deployed
REVISION: 3

Option where we assume cert-manager is pre-installed

Given most clusters will already have cert-manager installed, here's what the installation process would look like...

A bit smoother, but still the same webhook race condition at the end

First installation

⫸ helm upgrade kube-otel-stack . -f values.yaml -n kube-otel-stack --create-namespace --install
Release "kube-otel-stack" does not exist. Installing it now.
Error: Internal error occurred: failed calling webhook "mopentelemetrycollector.kb.io": failed to call webhook: Post "https://opentelemetry-operator-webhook-service.kube-otel-stack.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s": dial tcp 10.96.102.5:443: connect: connection refused

Trying again after a few seconds...

⫸ helm upgrade kube-otel-stack . -f values.yaml -n kube-otel-stack --create-namespace --install
Release "kube-otel-stack" has been upgraded. Happy Helming!
NAME: kube-otel-stack
LAST DEPLOYED: Mon Apr 24 12:00:13 2023
NAMESPACE: kube-otel-stack
STATUS: deployed
REVISION: 2

Proposed remediations

  • It's possible that the operator is responding "ready" too quickly, which would cause this issue (kubernetes issue). If we were to modify the operator's readiness probe on installation we may be able to fix this.
    • Thought: this is probably the "correct" thing to do, but it's unclear to me if this will permanently fix the problem
  • Setting the failurePolicy on the MutatingWebhookConfiguration object to Ignore could also solve this on first install
    • Thought: this is potentially dangerous as the mutating webhook for setting defaults could silently fail going forward. Upon testing this theory by setting the following
opentelemetry-operator:
  admissionWebhooks:
    failurePolicy: 'Ignore'

The operator and collector installed together successfully! An end user using this chart could just as easily enable the mutating webhook post-install as well, but that's not an ideal experience IMO.

I would love to hear thoughts on this, and see if there's anything I missed in my findings here. cc @open-telemetry/helm-maintainers

@TylerHelmuth
Copy link
Member

For the cert manager my preference would be to copy whatever pattern kube-prometheus-stack is using. If we can't install the cert manager as part of the chart install that will at least follow our existing pattern for the operator, although there is an issue opened about that friction: #550

Setting the failurePolicy on the MutatingWebhookConfiguration object to Ignore

When I investigated this a while ago this is the solution I stumbled upon and I believe it is the solution that kube-prometheus-stack uses.

@jaronoff97
Copy link
Contributor Author

Looking as to what the kube-prometheus-stack does right now.

@jaronoff97
Copy link
Contributor Author

It looks like it's configurable (obv) It's default behavior is empty and enabled, which means the policy is going to be set to Ignore so I think that seems reasonable for us to do.

They also recommend pre-installing cert-manager on a cluster to use these webhooks.

@TylerHelmuth
Copy link
Member

Seeing as the chart is trying to follow the same pattern for value I think it makes sense to follow the same technical patterns as well.

@jaronoff97
Copy link
Contributor Author

Agreed. I can work on it this week and next week to match those expectations. I'll include some docs about these decisions as well.

@ferrucc-io
Copy link

Is this something someone is still working on? Given how complex the whole ecosystem was to grasp for me starting out, what would makes the most sense from my perspective is have some way to add presets into the Opentelemetry Operator.

IMO if someone wants to plug in Otel to their cluster most likely they'll want to have the ability to get:

  • Traces
  • Pod metrics

It would be ideal if the default setup of the operator easily allowed you to get a setup like the one Honeycomb suggests in their getting started

@jaronoff97
Copy link
Contributor Author

@ferrucc-io yes I'm still working on this, I've had a whole slew of other priorities that keep taking precedence.

@jaronoff97
Copy link
Contributor Author

Hello all! All of the PRs required to get the core functionality for the chart have been merged.

My team and I will be testing this chart thoroughly (it's already been tested a lot!) and adding lots of documentation in the coming weeks/months. If you have any issues with the chart, please open a new issue in the repo tagging it with chart:kube-stack . Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants