Collation upgrade safety #89

reivilibre · 2025-01-20T10:30:54Z

Context

Normally, when the system version of glibc (the GNU C library) is upgraded, this can essentially lead to Postgres' indices becoming corrupt.

Specifically, new versions of glibc can change the ordering rules in collations.

Because the indices in Postgres databases rely on these ordering rules to be consistent (e.g. a b-tree index stores the index rows in order), if the ordering rules change then the index starts to give wrong results, e.g. unique indices now start allowing some duplicate rows, because when a search is performed in the index, it might not find the row that was already there.

(This is not just an academic risk; I have had bug reports from people where this has occurred; as a result some software that I work on at work will refuse to use any collation other than C since the sorting rules for that one won't change.)

Reindexing

It's nice to see that docker-pgautoupgrade already does this when upgrading the Postgres version, but the safe and right solution to prevent this corruption is to rebuild (reindex) all the indices when upgrading glibc version.
(In the containerised builds of Postgres, I imagine the glibc version will only be changed when the Postgres version is, so this seems reasonable)

However, it appears that docker-pgautoupgrade will allow application connections whilst the rebuild is taking place (though please do correct me if I'm wrong, and advertise this safety aspect in the readme :D). So, essentially, there is a window of opportunity for the application to cause corruption before the index has been rebuilt.

I guess many users would prefer the reduced downtime given by the current solution, but blocking application connections before the rebuilds have been done seems like the safest approach. I could also imagine users opting-in to the more 'dangerous' approach if they really wanted.

I'm curious to know the perspectives on this issue; for me this is always a prickly part of operating Postgres.

The text was updated successfully, but these errors were encountered:

andyundso · 2025-01-20T21:06:18Z

connections while the index happens are open, this is correct. it is mostly convience right now, since running the index operation also requires a Postgres server to run, it is easier when can start the main process. not sure how complicated it will be to start a Postgres server that is only bound to the socket. and likely we also want a flag to opt-out of this behaviour, because otherwise I can see complaints.

you specifically talked about glibc changes after the upgrade, but I am wondering how this is in general. The biggest issue for us that we would not know when glibc has been upgraded. I would assume the upstream Postgres image fixes it for the duration of a major version. But I am not sure about this, maybe the glibc version changes between minor version. How can we track this?

Then there is the potential for people who use the "one shot" mode that the glibc version in the container mismatches the one on their host.

is this the same issue on Alpine with musl?

justinclift · 2025-01-21T04:59:16Z

the safe and right solution to prevent this corruption is to rebuild (reindex) all the indices when upgrading glibc version.

As @andyundso points out, if this is something we can detect then we'll have options. 😄

not sure how complicated it will be to start a Postgres server that is only bound to the socket

That's doable by launching the postgres binary directly with the --single option, like we do here a few times to extract collation (etc) info:

docker-pgautoupgrade/docker-entrypoint.sh

Line 526 in 82fa7ad

    
           ENCODING=$(echo 'SHOW SERVER_ENCODING' | "${OLDPATH}/bin/postgres" --single -D "${OLD}" "${POSTGRES_DB}" | grep 'server_encoding = "' | cut -d '"' -f 2)

It's not as fully featured as the full database server, but it'd do the job if needed.

The problem is that for large databases reindexing can take quite some time (even hours). Doing the reindexing while the database is online and available gives much better availability than using a single-connection-PostgreSQL to do it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collation upgrade safety #89

Collation upgrade safety #89

reivilibre commented Jan 20, 2025

andyundso commented Jan 20, 2025

justinclift commented Jan 21, 2025

Collation upgrade safety #89

Collation upgrade safety #89

Comments

reivilibre commented Jan 20, 2025

Context

Reindexing

andyundso commented Jan 20, 2025

justinclift commented Jan 21, 2025