Invalid file chunk number errors #169

tejavegesna · 2021-04-05T20:28:41Z

Summary

Uploaded File chunks rerouted to different pods. Which is giving us invalid chunk number errors
If we have autoscaling, does this mean that each of those 10 chunks gets sent to a different pod?

What is the current bug behavior?

When you upload files to WIPP, you cut a file up into 1MB chunks rather than just creating a continuous file stream
So if we have a 10MB file, we send 10 (1MB) chunks of files to the backend
Nginx is routing different file chunk parts to different pods
If you look at the logs attached, it is showing that there are invalid flow chunks

What is the expected correct behavior?

No chunk errors should be there when the file conversion happen

Steps to reproduce

@Nicholas-Schaub Uploading 1500 Images
5 Replica K8s running which is scaled using Horizontal pod Autoscaler (HPA)
Min Pods 1 and Max pods 5
Cpu Requests: 1
Cpu Limits: 2

Relevant screenshots and/or logs
pod 1 running initially And
pod2 & pod3 started by autoscaling activity
pod1.txt
pod2.txt
pod3.txt

Environment info

labshare/wipp-backend:3.0.0-generic

Possible fixes

Not So Sure

cc: @Nicholas-Schaub

MyleneSimon · 2021-04-06T14:43:10Z

Hi @tejavegesna is it still happening after the 2 additional pods have been running for a while? Or when you were hard-coding the number of replica (as opposed to autoscaling)?

Looking at the time stamps in the logs, I am wondering if it might be an issue with the readiness of the pod/app (since we don't have readiness probe for wipp-backend, the pod might be ready before the app actually is). So some of the chunks would have been sent to pods 2 and 3 right when the autoscaling kicked in (and before they were actually ready to receive the chunks) and then that would mess up the whole chunk registration and image conversion.
Not saying this is the only issue here, but just wanted to check that first if you get a chance to test that.

tejavegesna · 2021-04-06T19:19:39Z

@MyleneSimon We saw the issue when pods are autoscaled using Horizontal Pod Autoscaler

And yes this even happens when the number of pods are static like we tried with 2 pods & no autoscaler involved

MyleneSimon · 2021-04-06T20:01:38Z

@tejavegesna thanks for testing, I checked the backend chunk upload code and there is a ConcurrentMap there that I am afraid might not be playing well with the pod replication... But I will investigate a bit more to make sure this is the issue here. In the meantime can you guys fo back to 1 for the number of replica and do some scaling with the ome.converter.threads value?

tejavegesna added the bug Something isn't working label Apr 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid file chunk number errors #169

Invalid file chunk number errors #169

tejavegesna commented Apr 5, 2021 •

edited

Loading

MyleneSimon commented Apr 6, 2021

tejavegesna commented Apr 6, 2021

MyleneSimon commented Apr 6, 2021

Invalid file chunk number errors #169

Invalid file chunk number errors #169

Comments

tejavegesna commented Apr 5, 2021 • edited Loading

MyleneSimon commented Apr 6, 2021

tejavegesna commented Apr 6, 2021

MyleneSimon commented Apr 6, 2021

tejavegesna commented Apr 5, 2021 •

edited

Loading