Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid file chunk number errors #169

Open
tejavegesna opened this issue Apr 5, 2021 · 3 comments
Open

Invalid file chunk number errors #169

tejavegesna opened this issue Apr 5, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@tejavegesna
Copy link
Contributor

tejavegesna commented Apr 5, 2021

Summary

Uploaded File chunks rerouted to different pods. Which is giving us invalid chunk number errors
If we have autoscaling, does this mean that each of those 10 chunks gets sent to a different pod?

What is the current bug behavior?

When you upload files to WIPP, you cut a file up into 1MB chunks rather than just creating a continuous file stream
So if we have a 10MB file, we send 10 (1MB) chunks of files to the backend
Nginx is routing different file chunk parts to different pods
If you look at the logs attached, it is showing that there are invalid flow chunks

What is the expected correct behavior?

No chunk errors should be there when the file conversion happen

Steps to reproduce

@Nicholas-Schaub Uploading 1500 Images
5 Replica K8s running which is scaled using Horizontal pod Autoscaler (HPA)
Min Pods 1 and Max pods 5
Cpu Requests: 1
Cpu Limits: 2

Relevant screenshots and/or logs
pod 1 running initially And
pod2 & pod3 started by autoscaling activity
pod1.txt
pod2.txt
pod3.txt

Environment info

labshare/wipp-backend:3.0.0-generic

Possible fixes

Not So Sure

cc: @Nicholas-Schaub

@tejavegesna tejavegesna added the bug Something isn't working label Apr 5, 2021
@MyleneSimon
Copy link
Collaborator

Hi @tejavegesna is it still happening after the 2 additional pods have been running for a while? Or when you were hard-coding the number of replica (as opposed to autoscaling)?

Looking at the time stamps in the logs, I am wondering if it might be an issue with the readiness of the pod/app (since we don't have readiness probe for wipp-backend, the pod might be ready before the app actually is). So some of the chunks would have been sent to pods 2 and 3 right when the autoscaling kicked in (and before they were actually ready to receive the chunks) and then that would mess up the whole chunk registration and image conversion.
Not saying this is the only issue here, but just wanted to check that first if you get a chance to test that.

@tejavegesna
Copy link
Contributor Author

@MyleneSimon We saw the issue when pods are autoscaled using Horizontal Pod Autoscaler

And yes this even happens when the number of pods are static like we tried with 2 pods & no autoscaler involved

@MyleneSimon
Copy link
Collaborator

@tejavegesna thanks for testing, I checked the backend chunk upload code and there is a ConcurrentMap there that I am afraid might not be playing well with the pod replication... But I will investigate a bit more to make sure this is the issue here. In the meantime can you guys fo back to 1 for the number of replica and do some scaling with the ome.converter.threads value?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants