Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - Submitting multiple builds can fill queue causing frontend to be stop working #598

Closed
kcpevey opened this issue Sep 26, 2023 · 16 comments · Fixed by #622
Closed
Labels
impact: high 🟥 This issue affects most of the conda-store users or is a critical issue type: bug 🐛 Something isn't working

Comments

@kcpevey
Copy link
Contributor

kcpevey commented Sep 26, 2023

Describe the bug

I'll describe this issue as I've seen it through the Nebari jlab conda-store UI, but I believe the issue is likely the backend.

Open an existing env in the UI, go to edit, make a minor change, click Create. Normally upon clicking Create, you'd be redirected back to the "non-edit" UI screen, but now clicking the button doesn't appear to do anything.

Now attempt to go to the /conda-store/admin page - it will not load anything, just spin.

Now shutdown your server, and log back in. Going to the conda-store UI and attempting to log in will have no effect, you just get the spinning logo in the browser tab. The rest of nebari remains operational, but there is no way to get conda-store out of this state immediately.

image

This happened to me last night and when I came back this morning it had worked itself into an operational state again. I went through the above process this morning on one nebari deployment and the same thing happened. Then I went through the above process on a different nebari deployment and the same thing happened.

I will also note that I tried to Delete environments and saw similar behavior but I'm not sure if that was because it was already in a broken state or if that also is causing something similar to happen.

Expected behavior

I expect to be able to Edit and Save environments.

How to Reproduce the problem?

Reproducer explained above.

Output

No response

Versions and dependencies used.

No response

Anything else?

No response

@kcpevey kcpevey added type: bug 🐛 Something isn't working impact: high 🟥 This issue affects most of the conda-store users or is a critical issue labels Sep 26, 2023
@kcpevey
Copy link
Contributor Author

kcpevey commented Sep 26, 2023

Update: it worked itself back into unstuck state after about an hour or so - Now I can log into conda-store and use the UI to VIEW envs

@nkaretnikov
Copy link
Contributor

@kcpevey I cannot repro on 64552a28e649d7305ec971332cfb739286769ec9, tag: 2023.9.2.

Also, I'm confused by this:

Open an existing env in the UI, go to edit, make a minor change, click Create. Normally upon clicking Create, you'd be redirected back to the "non-edit" UI screen, but now clicking the button doesn't appear to do anything.

Which URL are you accessing? There's no Create button. I've tried editing envs via the new UI (front page) and via http://localhost:5000/conda-store/admin. Cannot repro either way.

@kcpevey
Copy link
Contributor Author

kcpevey commented Sep 26, 2023

I'll have to wait until someone with access to the logs can look into this.

@kcpevey
Copy link
Contributor Author

kcpevey commented Sep 26, 2023

This was an incompatibility between the conda-store packages releases. I think we had the new conda-store release 2023.9.2 and older conda-store jlab release. At any rate, we've upgraded to the suggested versions here nebari-dev/nebari#1948 and this is now resolved.

@kcpevey kcpevey closed this as completed Sep 26, 2023
@github-project-automation github-project-automation bot moved this from New 🚦 to Done 💪🏾 in conda-store 🐍 Sep 26, 2023
@kcpevey kcpevey reopened this Sep 29, 2023
@github-project-automation github-project-automation bot moved this from Done 💪🏾 to New 🚦 in conda-store 🐍 Sep 29, 2023
@kcpevey
Copy link
Contributor Author

kcpevey commented Sep 29, 2023

Reopening since we've seen this again after updating to the recommended versions nebari-dev/nebari#1948 (comment)

@iameskild
Copy link
Contributor

@nkaretnikov @kcpevey and I were able to reproduce this error. When the same (or multiple users) submit multiple builds at roughly the same time, the conda-store QueuePool limit (5 build) is reached. We can tell that this is the case by inspecting the logs on the backend:

conda-store-server logs
search=voila&exact=true&distinct_on=version&page=1&order=desc&sort_by=version HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/anyio/streams/memory.py", line 98, in receive
    return self.receive_nowait()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/anyio/streams/memory.py", line 93, in receive_nowait
    raise WouldBlock
anyio.WouldBlock

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/base.py", line 78, in call_next
    message = await recv_stream.receive()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/anyio/streams/memory.py", line 118, in receive
    raise EndOfStream
anyio.EndOfStream

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/fastapi/applications.py", line 292, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/base.py", line 108, in __call__
    response = await self.dispatch_func(request, call_next)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/conda_store_server/server/app.py", line 235, in conda_store_middleware
    response = await call_next(request)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/base.py", line 84, in call_next
    raise app_exc
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/base.py", line 70, in coro
    await self.app(scope, receive_or_disconnect, send_no_error)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/sessions.py", line 86, in __call__
    await self.app(scope, receive, send_wrapper)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/fastapi/routing.py", line 190, in run_endpoint_function
    return await dependant.call(**values)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/conda_store_server/server/views/api.py", line 819, in api_list_packages
    return paginated_api_response(
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/conda_store_server/server/views/api.py", line 98, in paginated_api_response
    count = query.count()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 3176, in count
    return self._from_self(col).enable_eagerloads(False).scalar()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2893, in scalar
    ret = self.one()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2870, in one
    return self._iter().one()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2916, in _iter
    result = self.session.execute(
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1713, in execute
    conn = self._connection_for_bind(bind)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1552, in _connection_for_bind
    return self._transaction._connection_for_bind(
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 747, in _connection_for_bind
    conn = bind.connect()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3315, in connect
    return self._connection_cls(self, close_with_result=close_with_result)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 96, in __init__
    else engine.raw_connection()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3394, in raw_connection
    return self._wrap_pool_connect(self.pool.connect, _connection)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3361, in _wrap_pool_connect
    return fn()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 327, in connect
    return _ConnectionFairy._checkout(self)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 894, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 493, in checkout
    rec = pool._do_get()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 134, in _do_get
    raise exc.TimeoutError(
sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30.00 (Background on this error at: https://sqlalche.me/e/14/3o7r)

When this happens, the front end becomes gets completely blocked and no user interaction is permitted. Once the builds in the queue are processed, the front end becomes usable again.

@iameskild iameskild changed the title [BUG] - Editing and saving environment leaves conda-store in broken state [BUG] - Submitting multiple builds can fill queue causing frontend to be stop working Oct 3, 2023
@costrouc
Copy link
Member

costrouc commented Oct 4, 2023

This issue has to do with SQLAlchemy, Sessions, and FastAPI (threads/async/await). I have spent a long time trying to figure issues around this... I don't understand it

@costrouc
Copy link
Member

costrouc commented Oct 4, 2023

We talked about this today and @nkaretnikov will be writing an integration test to hopefully reproduce this issue. I think that this issue will only surface when using non-sqlite databases as a backend.

@nkaretnikov
Copy link
Contributor

nkaretnikov commented Oct 4, 2023

Summary of the call:

  • Write this as an integration test, by calling the API. This will be tested with postgres/docker on CI (see integration-test-conda-store-server).
  • Use this issue to repro
  • Look into the number of DB connections we're making. Clicking a button 4 times shouldn't lock up Postgres.
  • Some things to look at:
    • Code where QueuePool is used (try a different pool (NullPool)?, pool might need to be bigger)
    • Code where get_db is defined (could be something like an issue with sharing database session between threads)
    • How db session object is used
    • FastAPI and ThreadPoolExecutor
    • etc.

P.S. There's another issue related to parallel builds that we might run into. Chris said conda/mamba write stuff into the same directory and this might be a race condition if two builds are launched at the same time. One package could be writing an archive that will be overwritten by another download, which will get corrupted.

@trallard
Copy link
Collaborator

trallard commented Oct 5, 2023

We should perhaps be using something like flower to monitor the celery tasks Flower — Flower 2.0.0 documentation https://flower.readthedocs.io/

Also we are using Redis as some sort of broker right @costrouc?
If so the worker might be polling the database for new tasks causing the IO to raise (a lot) and slowing down/freezing the web app.
It might be more efficient using a proper broker like RabbitMQ - that way the queues will reside in memory and will stop polling the database.
In such case even if the broker gets clobbered it should not bring the whole app down.

@dharhas
Copy link
Member

dharhas commented Oct 10, 2023

Can we just set num_builds=1 for now and bypass this issue by disabling parallel builds.

@pavithraes
Copy link
Member

From today's meeting:

  • We can make a short-term fix to unblock release by changing QueuePool to NullPool
  • Long term, we need to look into the core issue here - @costrouc will open a new issue about this with historical context

@nkaretnikov
Copy link
Contributor

nkaretnikov commented Oct 11, 2023

Started working on this.

UPD: Hmm, I still cannot repro on 6d66e77 when running in docker. I'll see what version nebari uses and try that one.

@trallard trallard moved this from New 🚦 to In Progress 🏗 in conda-store 🐍 Oct 11, 2023
@trallard
Copy link
Collaborator

Adding a tad more context

We can make a short-term fix to unblock release by changing QueuePool to NullPool

For this item to be completed and merged (potentially) the change is needed + adding the relevant tests

@nkaretnikov

This comment was marked as outdated.

nkaretnikov pushed a commit to nkaretnikov/conda-store that referenced this issue Oct 13, 2023
nkaretnikov pushed a commit to nkaretnikov/conda-store that referenced this issue Oct 15, 2023
nkaretnikov pushed a commit to nkaretnikov/conda-store that referenced this issue Oct 15, 2023
nkaretnikov pushed a commit to nkaretnikov/conda-store that referenced this issue Oct 15, 2023
@nkaretnikov
Copy link
Contributor

nkaretnikov commented Oct 15, 2023

Should be fixed by #622. Added a test, which used to fail and now passes. The issue was due to not closing Sessions properly, so we can keep using QueuePool.

If you're interested to know how I arrived at this solution, see my hidden comment above. It contains a lot of information and a lot of wrong assumptions, so read from the end. I'm keeping it around for future reference.

costrouc pushed a commit that referenced this issue Oct 16, 2023
@github-project-automation github-project-automation bot moved this from In Progress 🏗 to Done 💪🏾 in conda-store 🐍 Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact: high 🟥 This issue affects most of the conda-store users or is a critical issue type: bug 🐛 Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants