-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IO tests depend too heavily on timing / sleeping #3424
Comments
Maybe we can separate those tests ( |
This seems like a related issue in CI: https://github.com/PostgREST/postgrest/actions/runs/8788477073/job/24116006025?pr=3427 We sure want to run them in CI, so separating them wouldn't help, I guess. |
Another random CI failure: https://github.com/PostgREST/postgrest/actions/runs/8842033759/job/24280038657?pr=3453 |
Hm, perhaps some of the pool timeout errors could be solved by using the new Specifically https://postgrest.org/en/latest/references/observability.html#pgrst-db-pool-timeouts-total |
Additionally I think the acq timeout can be less than a second https://hackage.haskell.org/package/hasql-pool-1.1/docs/Hasql-Pool-Config.html#v:acquisitionTimeout.. but not sure if our config allows it now |
I've also noted sometimes the io tests get stuck: https://github.com/PostgREST/postgrest/actions/runs/8857677646/job/24325423347?pr=3383 I had that happen just one time locally after #3229. That should be solved by using metrics, I think. |
io test got stuck again https://github.com/PostgREST/postgrest/actions/runs/8857677646/job/24325423347?pr=3383 We do some waiting when reading the logs postgrest/test/io/postgrest.py Lines 60 to 70 in 7e5fd31
But it's capped at 1 second.. it's weird why it gets stuck Edit 1: https://stackoverflow.com/questions/24640079/process-stdout-readline-hangs-how-to-use-it-properly Edit 2: Seems flushing worked #3465 This looks to be the test that fails Lines 1361 to 1389 in 7e5fd31
|
I looked into this a bit - it seems like the main challenge for IO tests is that it's impossible to determine whether PostgREST is currently in the process of reloading the config / schema cache - or has already finished doing so and failed. Or in other words: We only return either 200 or 503 from I think it would be helpful if we'd return more fine-grained status codes:
This should allow us to replace the following with calls to some variation of def sleep_until_postgrest_scache_reload():
"Sleep until schema cache reload"
time.sleep(0.3)
def sleep_until_postgrest_config_reload():
"Sleep until config reload"
time.sleep(0.2)
def sleep_until_postgrest_full_reload():
"Sleep until schema cache plus config reload"
time.sleep(0.3) It would also better for observability, I guess? |
@wolfgangwalther Great idea! I agree. |
This is happening on almost every commit I merge to main now - they just get stuck entirely somewhere in the IO tests. Would be great if @steve-chavez you could look at implementing my proposal above. I am not really familiar with those parts of the code. |
@wolfgangwalther Looking into this now. Admin API has another bug too (connection drop not detected), so will fix that first. |
Looking more closely at the above proposal, 500 will only happen when the API socket is unreachable. For all other cases we always retry and the status will be 503. This also implies that the This changed turned to be a bit more complex than what I thought. |
Not sure if we should conflate 503 for config reloading. During config reloading we're operational and serving requests normally. Unlike schema cache reloading (makes requests wait) and connection reloading (makes requests fail). Perhaps we can use another 2xx for config reloading? |
To avoid the breaking change and also the problem with responding with 503 while we're operational. How about adding a custom health check endpoint? There we could have the more fine-grained behavior as is. I'm thinking about a @wolfgangwalther WDYT? |
Do we not have any kind of back-off there? Are we really doing and endless loop of immediately retrying again?
503 is considered a "temporary" problem, so I would argue that this is actually a bugfix for /live. Also, I did we really promise "/live fails with 503 exactly" or is the promise our api gives rather like "/live gives you 4xx or 5xx if something is wrong"? I doubt that anyone is using the live endpoint with an exact match on the error code. So I don't think we need to treat this as a breaking change.
Right, we should not try to put other stuff into the ready endpoint that is clearly not related.
At this stage it might be easier to just add better log parsing to the IO tests, because all this information (config reload started, config reload finished) etc. are available through the log, I think. This should still allow us to remove the quite random "wait for X seconds" cases I mentioned above. TLDR: I think we don't need to change /ready and/or /live for the purpose of IO tests. If you still want to do the improvement regarding "schema cache reload in progress" - cool. |
Yes, we do have a backoff.
Agree.
Cool. Will finish that on #3490.
Parsing the log is still prone to failure though, implementing a |
So check this out, kubernetes has individual health checks. Since the admin server already has |
Hm, but that's only because k8s has different components which work independently of each other. That's not the case for us. It's not that kubernetes could consume separate ready endpoints in any way. You'd still need only one ready endpoint to actually make use of it. |
What kinds of failures do you have in mind? |
https://kubernetes.io/docs/reference/using-api/health-checks/#individual-health-checks I don't see why we cannot adopt that and have
The error message can change, reading the log can hang #3465 (and this PR didn't solve it, it still happens https://github.com/PostgREST/postgrest/actions/runs/8972240697/job/24639793119). |
Are you sure the backoff works for example when the schema cache query fails due to timeout like in the I can reproduce the hanging IO tests with this test-case, like this:
This immediately hangs forever. When I then add a
Why are the timestamps not increasing here? This doesn't look like a backoff. This does not explain the hanging test, yet, just an observation so far. |
This indicates that there is no backoff involved and instead the reload of the schema cache is tried 10x / s: postgrest/src/PostgREST/AppState.hs Lines 169 to 175 in f9e9740
This does not match my observation, however, because I had 33 attempts to connect with the exact same timestamp (second precision). So either the display of the timestamp is broken, or this is repeated much more than just 10x / s. I assume the 10x / s debounce only affects the "happy path" inside |
Merged that on #3490. I tried replacing the Overall this part of the code is too hard to refactor/modify because we don't have tests #1766 (my latest attempt at making progress was on #1766 (comment)). It's effectively at the "make it work" stage for many years now. |
Right, the backoff is only for the connection recovery ( postgrest/src/PostgREST/AppState.hs Lines 365 to 398 in 1b584f7
I remember I kept it that way because the schema cache concept is tied to connections too (they have their own cache #2620). But it could be improved. If we really want to improve this part we need the testing utilities (#1766), otherwise any attempt at refactoring will cause regressions. |
I can reproduce the above by just doing Additionally, since the connection recovery process is so tied to the schema cache, I think I can use the same backoff for both queries (pg version for the connection, schema cache idem). Otherwise if we have a dedicated backoff for the schema cache query we wouldn't know if it's because the connection is down or if there was a statement timeout. |
There is another failure of the IO tests after the latest commit to main (changelog): https://github.com/PostgREST/postgrest/actions/runs/9045664975/job/24855643429 This seems just like a simple timeout when starting up postgrest in that test. Quite random, though. |
From the log lines, that failure looks like the request was hitting the API server instead of the Admin server.
Because:
So maybe an error on the python code? Maybe the |
The log says:
So surely quite random. I assume More interesting is the fact that PostgREST doesn't seem to fail when both ports are set to the same value? When I try this locally, then I always get |
In fact... I can reproduce this:
This only works when setting host to
And then both are listening on different ipv4/ipv6:
both return different results... But since we do |
localhost:3001/metrics now includes: pgrst_schema_cache_loads_total{status="FAIL"} 352.0 pgrst_schema_cache_loads_total{status="SUCCESS"} 3.0 This allows testing the failure case on: PostgREST#3424 (comment)
localhost:3001/metrics now includes: pgrst_schema_cache_loads_total{status="FAIL"} 352.0 pgrst_schema_cache_loads_total{status="SUCCESS"} 3.0 This allows testing the failure case on: PostgREST#3424 (comment)
localhost:3001/metrics now includes: pgrst_schema_cache_loads_total{status="FAIL"} 352.0 pgrst_schema_cache_loads_total{status="SUCCESS"} 3.0 This allows testing the failure case on: #3424 (comment)
Since 6be5906 I get CI failures for IO tests on almost every push to main. |
Had some IO test failures in CI today again: https://github.com/PostgREST/postgrest/actions/runs/11491870772/job/31984992549 |
I just ran the IO tests on a heavily overloaded system.. and had 32 tests failing, because of some timings related issues. Those pass just fine without load. This is annoying, because I can't continue working on PostgREST while I have other heavy tasks (rebuilding the world for nixpkgs...) running. This is also potentially an issue on much slower systems.
The text was updated successfully, but these errors were encountered: