Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collation upgrade safety #89

Open
reivilibre opened this issue Jan 20, 2025 · 2 comments
Open

Collation upgrade safety #89

reivilibre opened this issue Jan 20, 2025 · 2 comments

Comments

@reivilibre
Copy link

Context

Normally, when the system version of glibc (the GNU C library) is upgraded, this can essentially lead to Postgres' indices becoming corrupt.

Specifically, new versions of glibc can change the ordering rules in collations.

Because the indices in Postgres databases rely on these ordering rules to be consistent (e.g. a b-tree index stores the index rows in order), if the ordering rules change then the index starts to give wrong results, e.g. unique indices now start allowing some duplicate rows, because when a search is performed in the index, it might not find the row that was already there.

(This is not just an academic risk; I have had bug reports from people where this has occurred; as a result some software that I work on at work will refuse to use any collation other than C since the sorting rules for that one won't change.)

Reindexing

It's nice to see that docker-pgautoupgrade already does this when upgrading the Postgres version, but the safe and right solution to prevent this corruption is to rebuild (reindex) all the indices when upgrading glibc version.
(In the containerised builds of Postgres, I imagine the glibc version will only be changed when the Postgres version is, so this seems reasonable)

However, it appears that docker-pgautoupgrade will allow application connections whilst the rebuild is taking place (though please do correct me if I'm wrong, and advertise this safety aspect in the readme :D). So, essentially, there is a window of opportunity for the application to cause corruption before the index has been rebuilt.

I guess many users would prefer the reduced downtime given by the current solution, but blocking application connections before the rebuilds have been done seems like the safest approach. I could also imagine users opting-in to the more 'dangerous' approach if they really wanted.

I'm curious to know the perspectives on this issue; for me this is always a prickly part of operating Postgres.

@andyundso
Copy link
Member

connections while the index happens are open, this is correct. it is mostly convience right now, since running the index operation also requires a Postgres server to run, it is easier when can start the main process. not sure how complicated it will be to start a Postgres server that is only bound to the socket. and likely we also want a flag to opt-out of this behaviour, because otherwise I can see complaints.

you specifically talked about glibc changes after the upgrade, but I am wondering how this is in general. The biggest issue for us that we would not know when glibc has been upgraded. I would assume the upstream Postgres image fixes it for the duration of a major version. But I am not sure about this, maybe the glibc version changes between minor version. How can we track this?

Then there is the potential for people who use the "one shot" mode that the glibc version in the container mismatches the one on their host.

is this the same issue on Alpine with musl?

@justinclift
Copy link
Member

the safe and right solution to prevent this corruption is to rebuild (reindex) all the indices when upgrading glibc version.

As @andyundso points out, if this is something we can detect then we'll have options. 😄

not sure how complicated it will be to start a Postgres server that is only bound to the socket

That's doable by launching the postgres binary directly with the --single option, like we do here a few times to extract collation (etc) info:

ENCODING=$(echo 'SHOW SERVER_ENCODING' | "${OLDPATH}/bin/postgres" --single -D "${OLD}" "${POSTGRES_DB}" | grep 'server_encoding = "' | cut -d '"' -f 2)

It's not as fully featured as the full database server, but it'd do the job if needed.

The problem is that for large databases reindexing can take quite some time (even hours). Doing the reindexing while the database is online and available gives much better availability than using a single-connection-PostgreSQL to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants