OpenTelemetryCollector fails if prometheus-operator is installed before opentelemetry-operator #1700

andyjansson · 2023-05-02T09:49:08Z

When applying the OpenTelemetryCollector template:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: simplest
spec:
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:
    processors:

    exporters:
      logging:

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: []
          exporters: [logging]

if prometheus-operator is installed before openetelemetry-operator, the application will fail with the following error:

Error from server (InternalError): error when creating ".\\opentelemetry-collector.yaml": Internal error occurred: failed calling webhook "[mopentelemetrycollector.kb.io](http://mopentelemetrycollector.kb.io/)": failed to call webhook: Post [https://opentelemetry-operator-webhook-service.opentelemetry-operator-system.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s](https://opentelemetry-operator-webhook-service.opentelemetry-operator-system.svc/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s): context deadline exceeded

If opentelemetry-operator is installed before prometheus-operator, everything is fine.

I'm running kind 0.18 on Windows 10 with podman 4.4.4.

The text was updated successfully, but these errors were encountered:

jaronoff97 · 2023-05-02T19:23:17Z

I don't think I understand how the prometheus operator plays in to this, however, I recently did a thread in the helm-charts repo detailing this exact scenario.

Could you detail where the prometheus operator comes in to play here? If not, attempt setting the failPolicy on the webhook to Ignore which should resolve your issue.

tstringer-fn · 2023-07-10T17:49:12Z

I'm also seeing this same behavior. For what it's worth, setting the failure policy to ignore isn't really ideal either, but I understand if it's the only known workaround right now. It looks like the mutating webhook has a fairly short list of mutations, so probably best to "manually mutate" and ensure the fields exist prior to creating.

Some additional diagnostics here as well. Oddly enough, it doesn't even look like any traffic is getting to the otel operator during these times. There are no opentelemetrycollector-resource logger messages there, as there are when there is successful mutation and validation. My original theory was that the mutating webhook does a Prometheus operation that times out or something to that effect, but that doesn't seem to be the case looking at the code.

jaronoff97 · 2023-07-18T19:48:16Z

Yeah i'm not aware of any prometheus operations that would block this. Until we have some more details here, i'm not sure I can debug more currently.

obrus-corcentric · 2024-02-01T15:54:02Z

Same here, installing:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm upgrade --install prometheus -n prometheus --create-namespace prometheus-community/kube-prometheus-stack

causes error while installing opentelemetry operator

Error: UPGRADE FAILED: failed to create resource: Internal error occurred: failed calling webhook "mopentelemetrycollector.kb.io": failed to call webhook: Post "https://opentelemetry-operator-webhook-service.opentelemetry-operator-system.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s": dial tcp 10.96.42.55:443: connect: connection refused

jaronoff97 · 2024-02-01T16:15:45Z

I tried reproducing this last year and was unable to. Are there any operator logs you can share? Otherwise, i've seen this issue when you haven't installed cert-manager, are using mismatched operator versions, or have a firewall blocking connections.

SG127421 · 2024-02-18T23:28:48Z

Unfortunatelly I hit the same behavior.

client.go:414:
Error: failed to create resource: Internal error occurred: failed calling webhook "mopentelemetrycollector.kb.io": failed to call webhook: Post "https://opentelemetry-operator-webhook.opentelemetry-operator-system.svc:443/mutate-opentelemetry-io-v1alpha1-opentelemetrycollector?timeout=10s": no endpoints available for service "opentelemetry-operator-webhook"

cluster AKS, inside nothing but kube-prometheus-stack and otel operator + collector. On my local k3d cluster it just works. on aks not at all

I've finally found a soultion ( mentioned somewhere earlier ):

opentelemetry-operator:
admissionWebhooks:
failurePolicy: 'Ignore'

But let's be honest; this thing should not behave like this

jaronoff97 · 2024-02-20T15:50:55Z

@SG127421 one would usually see this error when the operator is not available or when you attempt to install a collector at the same time as the operator. There is a race in kubernetes whose timing creates this problem. I have documented this extensively here

SG127421 · 2024-02-20T19:41:32Z

@SG127421 one would usually see this error when the operator is not available or when you attempt to install a collector at the same time as the operator. There is a race in kubernetes whose timing creates this problem.

It might be something else. During my trials I've removed operator dependency from my helm and then installed it manually to be present in the cluster before anything else gets deployed. Then I deployed helm with just a collector and my app. The result was negative as described. Then I gave up and applied failurePolicy as mentioned. This problem is specific to AKS. Cannot replicate it anywhere else.

jaronoff97 · 2024-02-20T20:27:02Z

Hm... very odd. I'm sorry you experienced that issue. I don't have easy access to AKS unfortunately and would be challenging for me to replicate as a result. This may imply something is wrong with the AKS interaction with cert manager. Would you mind opening a separate issue containing the steps you used to reproduce that?

jaronoff97 added question Further information is requested needs-info labels May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenTelemetryCollector fails if prometheus-operator is installed before opentelemetry-operator #1700

OpenTelemetryCollector fails if prometheus-operator is installed before opentelemetry-operator #1700

andyjansson commented May 2, 2023

jaronoff97 commented May 2, 2023

tstringer-fn commented Jul 10, 2023

jaronoff97 commented Jul 18, 2023

obrus-corcentric commented Feb 1, 2024

jaronoff97 commented Feb 1, 2024

SG127421 commented Feb 18, 2024 •

edited

Loading

jaronoff97 commented Feb 20, 2024

SG127421 commented Feb 20, 2024

jaronoff97 commented Feb 20, 2024

OpenTelemetryCollector fails if prometheus-operator is installed before opentelemetry-operator #1700

OpenTelemetryCollector fails if prometheus-operator is installed before opentelemetry-operator #1700

Comments

andyjansson commented May 2, 2023

jaronoff97 commented May 2, 2023

tstringer-fn commented Jul 10, 2023

jaronoff97 commented Jul 18, 2023

obrus-corcentric commented Feb 1, 2024

jaronoff97 commented Feb 1, 2024

SG127421 commented Feb 18, 2024 • edited Loading

jaronoff97 commented Feb 20, 2024

SG127421 commented Feb 20, 2024

jaronoff97 commented Feb 20, 2024

SG127421 commented Feb 18, 2024 •

edited

Loading