Error: No nodes replied within time constraint #12

rrrnld · 2024-09-04T06:24:32Z

We're self-hosting saleor and running into issues with our celery deployment, where the worker appears to get stuck after a while. We're deploying to k8s and run celery workers like this:

celery -A saleor --app=saleor.celeryconf:app worker --loglevel=info --beat

This is taken from the config that was removed here: saleor/saleor#13777

I can see the worker processes are running. It's also what this repo uses to deploy saleor:

helm-charts/charts/saleor/templates/celery_deployment.yaml

Lines 26 to 52 in fbe6ce6

    
           containers: 
        
             - name: "{{ $fullName }}-celery" 
        
               {{- if .Values.image.imageName }} 
        
               image: "{{ lower .Values.image.imageName }}" 
        
               {{- else }} 
        
               image: "{{ lower .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}" 
        
               {{- end }} 
        
               imagePullPolicy: {{ .Values.image.pullPolicy }} 
        
               env: 
        
               {{- range .Values.global.env }} 
        
                 - name: {{ .name }} 
        
                   value: {{ .value | quote}} 
        
               {{- end }} 
        
                 - name: ALLOWED_HOSTS 
        
                   value: {{ .Values.global.allowedHosts }} 
        
                 - name: ALLOWED_CLIENT_HOSTS 
        
                   value: {{ .Values.global.allowedHosts }} 
        
               envFrom: 
        
               - secretRef: 
        
                   name: {{ include "saleor-helm.fullname" . }} 
        
               args: 
        
                 - celery  
        
                 - --app=saleor  
        
                 - --app=saleor.celeryconf:app  
        
                 - worker 
        
                 - --loglevel=INFO 
        
                 - --beat

Is this the correct way to? I'm asking because celery -A saleor --app=saleor.celeryconf:app is redundant for example. Also, shelling into the container and trying to inspect it via celery -A saleor --app=saleor.celeryconf:app inspect active or celery -A saleor --app=saleor.celeryconf:app status both fail, and the lifetime check here in this repo does not seem to be working at all.

Error: No nodes replied within time constraint

Any idea what might be wrong with our healthchecks / lifetime checks?

The text was updated successfully, but these errors were encountered:

JannikZed · 2024-09-06T16:50:50Z

@rrrnld we honestly did not use the helm chart with the most recent Saleor versions, as we did move to the cloud deployment, but it did work before. So currently I don't have the capacity to test that again, but we will most likely try the self-hosted deployment again in the future..
we added this liveness checks to make really sure, that the workers are alive and the redis connection is still active and that used to work fine. How does the being stuck look like to you?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: No nodes replied within time constraint #12

Error: No nodes replied within time constraint #12

rrrnld commented Sep 4, 2024

JannikZed commented Sep 6, 2024

Error: No nodes replied within time constraint #12

Error: No nodes replied within time constraint #12

Comments

rrrnld commented Sep 4, 2024

JannikZed commented Sep 6, 2024