-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] - Submitting multiple builds can fill queue causing frontend to be stop working #598
Comments
Update: it worked itself back into unstuck state after about an hour or so - Now I can log into conda-store and use the UI to VIEW envs |
@kcpevey I cannot repro on Also, I'm confused by this:
Which URL are you accessing? There's no Create button. I've tried editing envs via the new UI (front page) and via |
I'll have to wait until someone with access to the logs can look into this. |
This was an incompatibility between the conda-store packages releases. I think we had the new conda-store release 2023.9.2 and older conda-store jlab release. At any rate, we've upgraded to the suggested versions here nebari-dev/nebari#1948 and this is now resolved. |
Reopening since we've seen this again after updating to the recommended versions nebari-dev/nebari#1948 (comment) |
@nkaretnikov @kcpevey and I were able to reproduce this error. When the same (or multiple users) submit multiple builds at roughly the same time, the conda-store QueuePool limit (5 build) is reached. We can tell that this is the case by inspecting the logs on the backend: conda-store-server logs
When this happens, the front end becomes gets completely blocked and no user interaction is permitted. Once the builds in the queue are processed, the front end becomes usable again. |
This issue has to do with SQLAlchemy, Sessions, and FastAPI (threads/async/await). I have spent a long time trying to figure issues around this... I don't understand it |
We talked about this today and @nkaretnikov will be writing an integration test to hopefully reproduce this issue. I think that this issue will only surface when using non-sqlite databases as a backend. |
Summary of the call:
P.S. There's another issue related to parallel builds that we might run into. Chris said conda/mamba write stuff into the same directory and this might be a race condition if two builds are launched at the same time. One package could be writing an archive that will be overwritten by another download, which will get corrupted. |
We should perhaps be using something like flower to monitor the celery tasks Flower — Flower 2.0.0 documentation https://flower.readthedocs.io/ Also we are using Redis as some sort of broker right @costrouc? |
Can we just set num_builds=1 for now and bypass this issue by disabling parallel builds. |
From today's meeting:
|
Started working on this. UPD: Hmm, I still cannot repro on 6d66e77 when running in docker. I'll see what version nebari uses and try that one. |
Adding a tad more context
For this item to be completed and merged (potentially) the change is needed + adding the relevant tests |
This comment was marked as outdated.
This comment was marked as outdated.
Should be fixed by #622. Added a test, which used to fail and now passes. The issue was due to not closing Sessions properly, so we can keep using QueuePool. If you're interested to know how I arrived at this solution, see my hidden comment above. It contains a lot of information and a lot of wrong assumptions, so read from the end. I'm keeping it around for future reference. |
Describe the bug
I'll describe this issue as I've seen it through the Nebari jlab conda-store UI, but I believe the issue is likely the backend.
Open an existing env in the UI, go to edit, make a minor change, click Create. Normally upon clicking Create, you'd be redirected back to the "non-edit" UI screen, but now clicking the button doesn't appear to do anything.
Now attempt to go to the
/conda-store/admin
page - it will not load anything, just spin.Now shutdown your server, and log back in. Going to the conda-store UI and attempting to log in will have no effect, you just get the spinning logo in the browser tab. The rest of nebari remains operational, but there is no way to get conda-store out of this state immediately.
This happened to me last night and when I came back this morning it had worked itself into an operational state again. I went through the above process this morning on one nebari deployment and the same thing happened. Then I went through the above process on a different nebari deployment and the same thing happened.
I will also note that I tried to Delete environments and saw similar behavior but I'm not sure if that was because it was already in a broken state or if that also is causing something similar to happen.
Expected behavior
I expect to be able to Edit and Save environments.
How to Reproduce the problem?
Reproducer explained above.
Output
No response
Versions and dependencies used.
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: