Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enough console output from the Trino worker container can cause an OOM for vector container #685

Open
zaultooz opened this issue Dec 10, 2024 · 2 comments
Labels

Comments

@zaultooz
Copy link

zaultooz commented Dec 10, 2024

Affected Stackable version

24.3

Affected Trino version

414

Current and expected behavior

I have observed that in some cases with a lot of console output from the Trino worker container can cause the Vector container to crash on an OOM and thereby restart the whole pod.

I expect the Trino worker to be able to run without restarting.

Possible solution

Change the default settings for the vector container to have a bit more memory to work with or add a possibility to more easily set it might be nice.

I have added a podOverride section for the TrinoCluster section to increase the vectors memory settings to try and resolve the issue.

Additional context

I don't know have critical this is but I thought I should report it anyway. In case you don't find it useful, feel free to close the bug.

Environment

Client Version: v1.29.10
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.10

Would you like to work on fixing this bug?

maybe

@siegfriedweber
Copy link
Member

I was not able to reproduce this bug. The Vector container has a memory limit of 128 Mi and even under heavy load, only up to 61 Mi were used.

I also tested the file rollover and pruning and found no problem:

total 10044
lrwxrwxrwx 1 stackable stackable      35 Dec 10 10:48 server.airlift.json -> server.airlift.json-20241210.104821
-rw-r--r-- 1 stackable stackable 5242569 Dec 10 10:48 server.airlift.json-20241210.104755
-rw-r--r-- 1 stackable stackable 5034693 Dec 10 10:48 server.airlift.json-20241210.104821

total 5160
lrwxrwxrwx 1 stackable stackable      35 Dec 10 10:48 server.airlift.json -> server.airlift.json-20241210.104846
-rw-r--r-- 1 stackable stackable 5242754 Dec 10 10:48 server.airlift.json-20241210.104821
-rw-r--r-- 1 stackable stackable   36384 Dec 10 10:48 server.airlift.json-20241210.104846

Can you please provide your Trino cluster spec and also the podOverride which fixed it?

@zaultooz
Copy link
Author

Name:             gbif-trino-worker-default-5
Namespace:        a_namespace
Priority:         0
Node:             a_node
Start Time:       Tue, 10 Dec 2024 12:51:51 +0100
Labels:           app.kubernetes.io/component=worker
                  app.kubernetes.io/instance=gbif-trino
                  app.kubernetes.io/managed-by=trino.stackable.tech_trinocluster
                  app.kubernetes.io/name=trino
                  app.kubernetes.io/role-group=default
                  app.kubernetes.io/version=414-414.1-stackable24.3.0
                  applicationId=trino
                  stackable.tech/vendor=Stackable
                  statefulset.kubernetes.io/pod-name=gbif-trino-worker-default-5
Annotations:      some_annotations
Status:           Running
IP:               an_ip
IPs:
  IP:           an_ip
Controlled By:  StatefulSet/gbif-trino-worker-default
Init Containers:
  ...
Containers:
  trino:
    Command:
      /bin/bash
      -x
      -euo
      pipefail
      -c
    Args: ...
    State:          Running
      Started:      Tue, 10 Dec 2024 12:52:25 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     6
      memory:  16Gi
    Requests:
      cpu:      1
      memory:   16Gi
    Liveness:   tcp-socket :https delay=30s timeout=1s period=10s #success=1 #failure=3
    Readiness:  tcp-socket :https delay=10s timeout=1s period=10s #success=1 #failure=5
    Environment: ...
    Mounts:
      /gbif/geocode-layers from gbif-geocode-layers (rw)
      /stackable/client_tls from client-tls (rw)
      /stackable/config from config (rw)
      /stackable/config/catalog from catalog (rw)
      /stackable/config/catalog/hive/hdfs-config from hive-hdfs (rw)
      /stackable/config/catalog/iceberg/hdfs-config from iceberg-hdfs (rw)
      /stackable/data from data (rw)
      /stackable/internal_tls from internal-tls (rw)
      /stackable/log from log (rw)
      /stackable/mount_internal_tls from internal-tls-mount (rw)
      /stackable/mount_server_tls from server-tls-mount (rw)
      /stackable/rwconfig from rwconfig (rw)
      /stackable/server_tls from server-tls (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m8flz (ro)
  vector:
    Command:
      /bin/bash
      -x
      -euo
      pipefail
      -c
    Args:
      # Vector will ignore SIGTERM (as PID != 1) and must be shut down by writing a shutdown trigger file
      vector --config /stackable/config/vector.yaml & vector_pid=$!
      if [ ! -f "/stackable/log/_vector/shutdown" ]; then
        mkdir -p /stackable/log/_vector && inotifywait -qq --event create /stackable/log/_vector; fi
      sleep 1
      kill $vector_pid

    State:          Running
      Started:      Tue, 10 Dec 2024 14:26:45 +0100
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Tue, 10 Dec 2024 14:22:19 +0100
      Finished:     Tue, 10 Dec 2024 14:26:33 +0100
    Ready:          True
    Restart Count:  5
    Limits:
      cpu:     500m
      memory:  128Mi
    Requests:
      cpu:     250m
      memory:  128Mi
    Environment:
      VECTOR_LOG:  info
    Mounts:
      /stackable/config from config (rw)
      /stackable/log from log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m8flz (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       True
  ContainersReady             True
  PodScheduled                True
Volumes:
  ...
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                  From     Message
  ----     ------   ----                 ----     -------
  Normal   Pulled   34m                  kubelet  Successfully pulled image "repo/trino:414.1-stackable24.3.0" in 33ms (33ms including waiting)
  Normal   Pulled   27m                  kubelet  Successfully pulled image "repo/trino:414.1-stackable24.3.0" in 40ms (40ms including waiting)
  Normal   Pulled   6m44s (x2 over 38m)  kubelet  Successfully pulled image "repo/trino:414.1-stackable24.3.0" in 32ms (32ms including waiting)
  Warning  BackOff  2m29s (x4 over 34m)  kubelet  Back-off restarting failed container vector in pod gbif-trino-worker-default-5
  Normal   Pulling  2m18s (x6 over 96m)  kubelet  Pulling image "repo/trino:414.1-stackable24.3.0"
  Normal   Created  2m18s (x6 over 96m)  kubelet  Created container vector
  Normal   Started  2m18s (x6 over 96m)  kubelet  Started container vector
  Normal   Pulled   2m18s                kubelet  Successfully pulled image "repo/trino:414.1-stackable24.3.0" in 34ms (34ms including waiting)

This is a cut of the pod description (I removed the init container + ips, addresses and such) where we see the restarting happening. From the the output I can see I have set the Trino version incorrectly in the bug report.

We do a extended versioning of the image because we package your image with a plugin that contains our custom UDFs.

Here is the snippet for the podOverrides I have done:

  workers:
    podOverrides:
{{- if or .Values.yunikorn.enabled }}
      metadata:
        labels:
{{- include "gbif-chart-lib.yunikornLabels" . | nindent 10 }}
{{- end }}
{{- end }}
      spec:
{{- if or .Values.geocodeLayer.enabled }}
        containers:
        - name: trino
          volumeMounts:
          - name: gbif-geocode-layers
            mountPath: /gbif/geocode-layers
{{- end }}
        - name: vector
          resources:
            limits:
              // I have replaced the template variable with the value we replace with.
              memory: 256Mi
            requests:
              memory: 256Mi

I haven't tested the podOverride on in the environment with error yet as it is currently in use for testing something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants