Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetryCollector fails if prometheus-operator is installed before opentelemetry-operator #1700

Open
andyjansson opened this issue May 2, 2023 · 9 comments
Labels
needs-info question Further information is requested

Comments

@andyjansson
Copy link

When applying the OpenTelemetryCollector template:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: simplest
spec:
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:

    exporters:
      logging:

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: []
          exporters: [logging]

if prometheus-operator is installed before openetelemetry-operator, the application will fail with the following error:

Error from server (InternalError): error when creating ".\\opentelemetry-collector.yaml": Internal error occurred: failed calling webhook "[mopentelemetrycollector.kb.io](http://mopentelemetrycollector.kb.io/)": failed to call webhook: Post [https://opentelemetry-operator-webhook-service.opentelemetry-operator-system.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s](https://opentelemetry-operator-webhook-service.opentelemetry-operator-system.svc/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s): context deadline exceeded

If opentelemetry-operator is installed before prometheus-operator, everything is fine.

I'm running kind 0.18 on Windows 10 with podman 4.4.4.

@jaronoff97
Copy link
Contributor

I don't think I understand how the prometheus operator plays in to this, however, I recently did a thread in the helm-charts repo detailing this exact scenario.

Could you detail where the prometheus operator comes in to play here? If not, attempt setting the failPolicy on the webhook to Ignore which should resolve your issue.

@jaronoff97 jaronoff97 added question Further information is requested needs-info labels May 2, 2023
@tstringer-fn
Copy link

I'm also seeing this same behavior. For what it's worth, setting the failure policy to ignore isn't really ideal either, but I understand if it's the only known workaround right now. It looks like the mutating webhook has a fairly short list of mutations, so probably best to "manually mutate" and ensure the fields exist prior to creating.

Some additional diagnostics here as well. Oddly enough, it doesn't even look like any traffic is getting to the otel operator during these times. There are no opentelemetrycollector-resource logger messages there, as there are when there is successful mutation and validation. My original theory was that the mutating webhook does a Prometheus operation that times out or something to that effect, but that doesn't seem to be the case looking at the code.

@jaronoff97
Copy link
Contributor

Yeah i'm not aware of any prometheus operations that would block this. Until we have some more details here, i'm not sure I can debug more currently.

@obrus-corcentric
Copy link

Same here, installing:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install prometheus -n prometheus --create-namespace prometheus-community/kube-prometheus-stack

causes error while installing opentelemetry operator

Error: UPGRADE FAILED: failed to create resource: Internal error occurred: failed calling webhook "mopentelemetrycollector.kb.io": failed to call webhook: Post "https://opentelemetry-operator-webhook-service.opentelemetry-operator-system.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s": dial tcp 10.96.42.55:443: connect: connection refused

@jaronoff97
Copy link
Contributor

I tried reproducing this last year and was unable to. Are there any operator logs you can share? Otherwise, i've seen this issue when you haven't installed cert-manager, are using mismatched operator versions, or have a firewall blocking connections.

@SG127421
Copy link

SG127421 commented Feb 18, 2024

Unfortunatelly I hit the same behavior.

client.go:414:
Error: failed to create resource: Internal error occurred: failed calling webhook "mopentelemetrycollector.kb.io": failed to call webhook: Post "https://opentelemetry-operator-webhook.opentelemetry-operator-system.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s": no endpoints available for service "opentelemetry-operator-webhook"

cluster AKS, inside nothing but kube-prometheus-stack and otel operator + collector. On my local k3d cluster it just works. on aks not at all

I've finally found a soultion ( mentioned somewhere earlier ):

opentelemetry-operator:
admissionWebhooks:
failurePolicy: 'Ignore'

But let's be honest; this thing should not behave like this

@jaronoff97
Copy link
Contributor

@SG127421 one would usually see this error when the operator is not available or when you attempt to install a collector at the same time as the operator. There is a race in kubernetes whose timing creates this problem. I have documented this extensively here

@SG127421
Copy link

@SG127421 one would usually see this error when the operator is not available or when you attempt to install a collector at the same time as the operator. There is a race in kubernetes whose timing creates this problem.

It might be something else. During my trials I've removed operator dependency from my helm and then installed it manually to be present in the cluster before anything else gets deployed. Then I deployed helm with just a collector and my app. The result was negative as described. Then I gave up and applied failurePolicy as mentioned. This problem is specific to AKS. Cannot replicate it anywhere else.

@jaronoff97
Copy link
Contributor

Hm... very odd. I'm sorry you experienced that issue. I don't have easy access to AKS unfortunately and would be challenging for me to replicate as a result. This may imply something is wrong with the AKS interaction with cert manager. Would you mind opening a separate issue containing the steps you used to reproduce that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-info question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants