-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing tests - recovery tests #1766
Comments
For the admin port health check(#2092), we'd also need to test:
With |
Dropping an idea for my own future reference here: To allow breaking / unbreaking the pgrst <-> pg connection, we can create an individual symlink to the pg socket for each test-case - and then rename that accordingly. If renamed to something else this will break the connection. This will allow us to keep the PG server up - should be a lot faster than starting and stopping all the time. And it will not prevent us from running the io tests in parallel down the road. |
At startup, when the schema cache load fails(could be because a This is done here: postgrest/src/PostgREST/Workers.hs Lines 93 to 94 in e13d912
Is not simple to remove the autoreload of the schema cache because the connection can be lost in the connection worker (which can be because of DNS error, too many clients error from pg, etc). If we remove the autoreload we might have a regression like #1685. Somehow joining the schema cache loading process on the connection worker and retrying it with exponential backoff(instead of endless loop) could be a solution. |
Optional: A recovery test for EMFILE could be added as well: #2158 (comment) |
Metrics(#2129) is required to test #1094 and #2364 (pool protection). Related PostgREST/postgrest-docs#557 |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
We also need tests for when we abort the recovery procedure. Namely postgrest/src/PostgREST/AppState.hs Lines 451 to 475 in 07cb0b5
|
Pending refactorThe recovery logic can now be totally inside AppState, this would make it more understadable/manageable. So right now we pass postgrest/src/PostgREST/App.hs Lines 102 to 103 in 54b9a0b
And then we activate postgrest/src/PostgREST/App.hs Lines 126 to 130 in 54b9a0b
Instead, this logic could be inside postgrest/src/PostgREST/AppState.hs Lines 142 to 148 in 07cb0b5
We'd just have to catch SQL exceptions that we map to 503 there. Like: postgrest/src/PostgREST/Error.hs Line 408 in 54b9a0b
I hesitate to refactor this right now that we don't have tests. |
Looking at the libpq haskell lib, it has a reset function that says:
That makes me wonder if we could have the connection recovery off-core. Say in a The ideal interface for us would expose a useRetry :: Pool -> IO () -> IO () -> Session a -> IO (Either WrappedUsageError a)
useRetry pool retryAction successAction sess = Hasql.use pool -- ... On On
This seems generally useful outside of PostgREST to me. @robx WDYT? The Having a wait time(which we could equate to This would be very interesting bc it's somewhat similar to pgbouncer pause/resume. Later on it could be used as a way to scale to multiple databases(#2798). |
Sorry, have been offline for a bit and missed this. Catching up these days Regarding the concrete question about I don't think that's going to be particularly useful for us. All it does is close the underlying connection and open a new one. This would only save allocating a new connection object (and the pool management overhead); but that should be insignificant compared to actually establishing a new connection to the postgres server. I think I like the idea of a generic
But then why not leave running |
Yes, for example when the password changes upstream (retrying is no use) - then the user would have to edit the database connection string anyway. We also have some extra conditions on checkIsFatal for stopping retrying. Hm, maybe
Hm, yeah. I think that could work too. The interface was just an idea. In case it helps, I've documented the recovery process here. |
@robx Thinking more about it, we can have a much simpler interface. Just: useRetry :: Pool -> Session a -> IO (Either WrappedUsageError a)
useRetry pool sess = Hasql.use pool -- ...
No postgrest/src/PostgREST/AppState.hs Lines 142 to 148 in 078c6ec
The resulting
No
That would also be unnecessary since we can also act on the So really the main goal is to have |
Also, I was thinking we should have this timeout be equal to the acquisition timeout but maybe it can be another configurable timeout. I see HikariCP having a initializationFailTimeout, which is similar to what we want to do here. |
The simplest initial test case I think would be having a Then we would cover other cases like a socket error as Wolfgang mentioned above. |
Hm, forgot about one thing. So say we lose the connection and at this time the user runs migrations on the db, event trigger notifications won't fire for us. So maybe: useRetry :: Pool -> Session a -> IO (Either WrappedUsageError (a, Bool)) The |
Related to the above, I just tried moving the socket file:
And it does not break the connection if it's already established, doing The listener doesn't fail too. So it looks like if we want to add io tests for this we also need to wait for the pool lifetime (looks prone to CI errors though) or else find another way to immediately break connections. |
We need a recovery test for only breaking a LISTEN connection too. Related to #3572. |
Currently recovery tests are done manually, it'd be great to have them as automated tests.
These are the main scenarios:
(the connection recovery worker is referred as just "worker")
1. postgrest started with a pg connection, then pg becomes unavailable
{"details":"no connection to the server\n","message":"Database client error. Retrying the connection."}
ALTER ROLE postgrest_test_authenticator SET pgrst.db_schemas = 'public';
and try aGET /public_consumers
which should give a 404 if the in-db config isn't re-read.2. unavailable pg, postgrest started
503 {"message":"Database connection lost. Retrying the connection."}
Connection refused
. This must be because of themvarConnectionStatus
MVar, it doesn't happen on 1 though.3. SIGUSR1 - NOTIFY reload schema
refIsWorkerOn
, this can be confirmed by doing several SIGUSR1 and just noting oneAttempting to reconnect to the database in 1 seconds...
message. IfrefIsWorkerOn
is removed, there will be severalAttempting to reconnect to the database in 1 seconds...
mesagges.listener
recovers, e.g. doing aNOTIFY 'reload cache/load config'
should work after recovery.The text was updated successfully, but these errors were encountered: