[BUG] - Submitting multiple builds can fill queue causing frontend to be stop working #598

kcpevey · 2023-09-26T13:54:20Z

Describe the bug

I'll describe this issue as I've seen it through the Nebari jlab conda-store UI, but I believe the issue is likely the backend.

Open an existing env in the UI, go to edit, make a minor change, click Create. Normally upon clicking Create, you'd be redirected back to the "non-edit" UI screen, but now clicking the button doesn't appear to do anything.

Now attempt to go to the /conda-store/admin page - it will not load anything, just spin.

Now shutdown your server, and log back in. Going to the conda-store UI and attempting to log in will have no effect, you just get the spinning logo in the browser tab. The rest of nebari remains operational, but there is no way to get conda-store out of this state immediately.

This happened to me last night and when I came back this morning it had worked itself into an operational state again. I went through the above process this morning on one nebari deployment and the same thing happened. Then I went through the above process on a different nebari deployment and the same thing happened.

I will also note that I tried to Delete environments and saw similar behavior but I'm not sure if that was because it was already in a broken state or if that also is causing something similar to happen.

Expected behavior

I expect to be able to Edit and Save environments.

How to Reproduce the problem?

Reproducer explained above.

Output

No response

Versions and dependencies used.

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

kcpevey · 2023-09-26T14:40:13Z

Update: it worked itself back into unstuck state after about an hour or so - Now I can log into conda-store and use the UI to VIEW envs

nkaretnikov · 2023-09-26T14:41:44Z

@kcpevey I cannot repro on 64552a28e649d7305ec971332cfb739286769ec9, tag: 2023.9.2.

Also, I'm confused by this:

Open an existing env in the UI, go to edit, make a minor change, click Create. Normally upon clicking Create, you'd be redirected back to the "non-edit" UI screen, but now clicking the button doesn't appear to do anything.

Which URL are you accessing? There's no Create button. I've tried editing envs via the new UI (front page) and via http://localhost:5000/conda-store/admin. Cannot repro either way.

kcpevey · 2023-09-26T14:46:39Z

I'll have to wait until someone with access to the logs can look into this.

kcpevey · 2023-09-26T18:53:12Z

This was an incompatibility between the conda-store packages releases. I think we had the new conda-store release 2023.9.2 and older conda-store jlab release. At any rate, we've upgraded to the suggested versions here nebari-dev/nebari#1948 and this is now resolved.

kcpevey · 2023-09-29T11:37:20Z

Reopening since we've seen this again after updating to the recommended versions nebari-dev/nebari#1948 (comment)

iameskild · 2023-10-03T17:06:51Z

@nkaretnikov @kcpevey and I were able to reproduce this error. When the same (or multiple users) submit multiple builds at roughly the same time, the conda-store QueuePool limit (5 build) is reached. We can tell that this is the case by inspecting the logs on the backend:

conda-store-server logs

search=voila&exact=true&distinct_on=version&page=1&order=desc&sort_by=version HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/anyio/streams/memory.py", line 98, in receive
    return self.receive_nowait()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/anyio/streams/memory.py", line 93, in receive_nowait
    raise WouldBlock
anyio.WouldBlock

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/base.py", line 78, in call_next
    message = await recv_stream.receive()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/anyio/streams/memory.py", line 118, in receive
    raise EndOfStream
anyio.EndOfStream

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/fastapi/applications.py", line 292, in __call__
    await super().__call__(scope, receive, send)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/base.py", line 108, in __call__
    response = await self.dispatch_func(request, call_next)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/conda_store_server/server/app.py", line 235, in conda_store_middleware
    response = await call_next(request)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/base.py", line 84, in call_next
    raise app_exc
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/base.py", line 70, in coro
    await self.app(scope, receive_or_disconnect, send_no_error)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/sessions.py", line 86, in __call__
    await self.app(scope, receive, send_wrapper)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/fastapi/routing.py", line 190, in run_endpoint_function
    return await dependant.call(**values)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/conda_store_server/server/views/api.py", line 819, in api_list_packages
    return paginated_api_response(
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/conda_store_server/server/views/api.py", line 98, in paginated_api_response
    count = query.count()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 3176, in count
    return self._from_self(col).enable_eagerloads(False).scalar()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2893, in scalar
    ret = self.one()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2870, in one
    return self._iter().one()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2916, in _iter
    result = self.session.execute(
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1713, in execute
    conn = self._connection_for_bind(bind)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1552, in _connection_for_bind
    return self._transaction._connection_for_bind(
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 747, in _connection_for_bind
    conn = bind.connect()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3315, in connect
    return self._connection_cls(self, close_with_result=close_with_result)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 96, in __init__
    else engine.raw_connection()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3394, in raw_connection
    return self._wrap_pool_connect(self.pool.connect, _connection)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3361, in _wrap_pool_connect
    return fn()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 327, in connect
    return _ConnectionFairy._checkout(self)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 894, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 493, in checkout
    rec = pool._do_get()
  File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 134, in _do_get
    raise exc.TimeoutError(
sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30.00 (Background on this error at: https://sqlalche.me/e/14/3o7r)

When this happens, the front end becomes gets completely blocked and no user interaction is permitted. Once the builds in the queue are processed, the front end becomes usable again.

costrouc · 2023-10-04T15:43:51Z

This issue has to do with SQLAlchemy, Sessions, and FastAPI (threads/async/await). I have spent a long time trying to figure issues around this... I don't understand it

costrouc · 2023-10-04T16:02:02Z

We talked about this today and @nkaretnikov will be writing an integration test to hopefully reproduce this issue. I think that this issue will only surface when using non-sqlite databases as a backend.

nkaretnikov · 2023-10-04T16:09:57Z

Summary of the call:

Write this as an integration test, by calling the API. This will be tested with postgres/docker on CI (see integration-test-conda-store-server).
Use this issue to repro
Look into the number of DB connections we're making. Clicking a button 4 times shouldn't lock up Postgres.
Some things to look at:
- Code where QueuePool is used (try a different pool (NullPool)?, pool might need to be bigger)
- Code where get_db is defined (could be something like an issue with sharing database session between threads)
- How db session object is used
- FastAPI and ThreadPoolExecutor
- etc.

P.S. There's another issue related to parallel builds that we might run into. Chris said conda/mamba write stuff into the same directory and this might be a race condition if two builds are launched at the same time. One package could be writing an archive that will be overwritten by another download, which will get corrupted.

trallard · 2023-10-05T19:23:58Z

We should perhaps be using something like flower to monitor the celery tasks Flower — Flower 2.0.0 documentation https://flower.readthedocs.io/

Also we are using Redis as some sort of broker right @costrouc?
If so the worker might be polling the database for new tasks causing the IO to raise (a lot) and slowing down/freezing the web app.
It might be more efficient using a proper broker like RabbitMQ - that way the queues will reside in memory and will stop polling the database.
In such case even if the broker gets clobbered it should not bring the whole app down.

dharhas · 2023-10-10T14:01:42Z

Can we just set num_builds=1 for now and bypass this issue by disabling parallel builds.

pavithraes · 2023-10-10T16:41:47Z

From today's meeting:

We can make a short-term fix to unblock release by changing QueuePool to NullPool
Long term, we need to look into the core issue here - @costrouc will open a new issue about this with historical context

nkaretnikov · 2023-10-11T07:14:01Z

Started working on this.

UPD: Hmm, I still cannot repro on 6d66e77 when running in docker. I'll see what version nebari uses and try that one.

trallard · 2023-10-11T12:03:39Z

Adding a tad more context

We can make a short-term fix to unblock release by changing QueuePool to NullPool

For this item to be completed and merged (potentially) the change is needed + adding the relevant tests

Fixes conda-incubator#598.

nkaretnikov · 2023-10-15T14:08:33Z

Should be fixed by #622. Added a test, which used to fail and now passes. The issue was due to not closing Sessions properly, so we can keep using QueuePool.

If you're interested to know how I arrived at this solution, see my hidden comment above. It contains a lot of information and a lot of wrong assumptions, so read from the end. I'm keeping it around for future reference.

Fixes #598.

kcpevey added type: bug 🐛 Something isn't working impact: high 🟥 This issue affects most of the conda-store users or is a critical issue labels Sep 26, 2023

github-project-automation bot added this to conda-store 🐍 Sep 26, 2023

github-project-automation bot moved this to New 🚦 in conda-store 🐍 Sep 26, 2023

kcpevey closed this as completed Sep 26, 2023

github-project-automation bot moved this from New 🚦 to Done 💪🏾 in conda-store 🐍 Sep 26, 2023

kcpevey reopened this Sep 29, 2023

github-project-automation bot moved this from Done 💪🏾 to New 🚦 in conda-store 🐍 Sep 29, 2023

iameskild changed the title ~~[BUG] - Editing and saving environment leaves conda-store in broken state~~ [BUG] - Submitting multiple builds can fill queue causing frontend to be stop working Oct 3, 2023

nkaretnikov mentioned this issue Oct 4, 2023

Add a test for concurrent builds #567

Closed

pavithraes mentioned this issue Oct 5, 2023

[MAINT] - Integrate latest conda-store versions nebari-dev/nebari#1948

Closed

trallard moved this from New 🚦 to In Progress 🏗 in conda-store 🐍 Oct 11, 2023

This comment was marked as outdated.

Sign in to view

nkaretnikov pushed a commit to nkaretnikov/conda-store that referenced this issue Oct 13, 2023

Bug fix: do not share Session objects

866395a

Fixes conda-incubator#598.

nkaretnikov mentioned this issue Oct 13, 2023

Bug fix: use context manager for Sessions #622

Merged

nkaretnikov pushed a commit to nkaretnikov/conda-store that referenced this issue Oct 15, 2023

Bug fix: use context manager for Sessions

cf7ba93

Fixes conda-incubator#598.

nkaretnikov pushed a commit to nkaretnikov/conda-store that referenced this issue Oct 15, 2023

Bug fix: use context manager for Sessions

e5b30cf

Fixes conda-incubator#598.

nkaretnikov pushed a commit to nkaretnikov/conda-store that referenced this issue Oct 15, 2023

Bug fix: use context manager for Sessions

9053e86

Fixes conda-incubator#598.

costrouc closed this as completed in #622 Oct 16, 2023

costrouc pushed a commit that referenced this issue Oct 16, 2023

Bug fix: use context manager for Sessions (#622)

cabbc0e

Fixes #598.

github-project-automation bot moved this from In Progress 🏗 to Done 💪🏾 in conda-store 🐍 Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] - Submitting multiple builds can fill queue causing frontend to be stop working #598

[BUG] - Submitting multiple builds can fill queue causing frontend to be stop working #598

kcpevey commented Sep 26, 2023 •

edited

Loading

kcpevey commented Sep 26, 2023

nkaretnikov commented Sep 26, 2023

kcpevey commented Sep 26, 2023

kcpevey commented Sep 26, 2023

kcpevey commented Sep 29, 2023

iameskild commented Oct 3, 2023

costrouc commented Oct 4, 2023

costrouc commented Oct 4, 2023

nkaretnikov commented Oct 4, 2023 •

edited

Loading

trallard commented Oct 5, 2023

dharhas commented Oct 10, 2023 •

edited

Loading

pavithraes commented Oct 10, 2023

nkaretnikov commented Oct 11, 2023 •

edited

Loading

trallard commented Oct 11, 2023

This comment was marked as outdated.

nkaretnikov commented Oct 15, 2023 •

edited

Loading

[BUG] - Submitting multiple builds can fill queue causing frontend to be stop working #598

[BUG] - Submitting multiple builds can fill queue causing frontend to be stop working #598

Comments

kcpevey commented Sep 26, 2023 • edited Loading

Describe the bug

Expected behavior

How to Reproduce the problem?

Output

Versions and dependencies used.

Anything else?

kcpevey commented Sep 26, 2023

nkaretnikov commented Sep 26, 2023

kcpevey commented Sep 26, 2023

kcpevey commented Sep 26, 2023

kcpevey commented Sep 29, 2023

iameskild commented Oct 3, 2023

costrouc commented Oct 4, 2023

costrouc commented Oct 4, 2023

nkaretnikov commented Oct 4, 2023 • edited Loading

trallard commented Oct 5, 2023

dharhas commented Oct 10, 2023 • edited Loading

pavithraes commented Oct 10, 2023

nkaretnikov commented Oct 11, 2023 • edited Loading

trallard commented Oct 11, 2023

This comment was marked as outdated.

nkaretnikov commented Oct 15, 2023 • edited Loading

kcpevey commented Sep 26, 2023 •

edited

Loading

nkaretnikov commented Oct 4, 2023 •

edited

Loading

dharhas commented Oct 10, 2023 •

edited

Loading

nkaretnikov commented Oct 11, 2023 •

edited

Loading

nkaretnikov commented Oct 15, 2023 •

edited

Loading