Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Segfault Under Continuous Load, Followed By Effective Halt Of DB #196

Open
Eli-Airis opened this issue Jan 22, 2025 · 4 comments
Open
Labels
bug Something isn't working community pgvectorscale

Comments

@Eli-Airis
Copy link

Eli-Airis commented Jan 22, 2025

What happened?

My goal was to create a StreamingDiskANN index of 10,000,000,000 records.
To this end, I was running my DB under heavy INSERT load (see reproduction steps below) for many hours.
All 32 cores of the machine were at 85%+ CPU usage.

In the morning, I found the machine almost idle - all cores at 20% use at most.
All the client processes I had running on the same machine were still up, but at 0% CPU and in sleeping state.
~4 DB processes were running, and showed "INSERT" state in htop.

I was able to connect to the DB via psql, and could query other tables, but not my main table. When I tried to query my main table the connection was lost.
Those psql query attempts triggered the following messages in the DB's log:

2025-01-22 09:57:50.152 UTC [194477] FATAL:  the database system is not yet accepting connections
2025-01-22 09:57:50.152 UTC [194477] DETAIL:  Consistent recovery state has not been yet reached.

I am running the timescale/timescaledb-ha:pg16.4-ts2.17.1 docker image (ID 7f9533ca34d7), since the 2.17.2 is unusable for me due to #193 .

pgvectorscale extension affected

No response

PostgreSQL version used

16.4

What operating system did you use?

Ubuntu 22 x64

What installation method did you use?

Docker

What platform did you run on?

Google Cloud Platform (GCP)

Relevant log output and stack trace

2025-01-22 09:53:05.951 UTC [19] LOG:  checkpoint starting: time
2025-01-22 09:55:39.338 UTC [1] LOG:  server process (PID 182319) was terminated by signal 11: Segmentation fault
2025-01-22 09:55:39.338 UTC [1] DETAIL:  Failed process was running: SELECT (class_embedding <=> (select class_embedding from basic_objects where id=228703)) as cosine_dist, (class_embedding <-> (select class_embedding from basic_objects where id=228703)) as l2_dist, *
        FROM basic_objects ORDER BY cosine_dist LIMIT 100 offset 0;
2025-01-22 09:55:39.338 UTC [1] LOG:  terminating any other active server processes
2025-01-22 09:55:39.368 UTC [182421] FATAL:  the database system is in recovery mode
2025-01-22 09:55:39.957 UTC [182423] FATAL:  the database system is in recovery mode
2025-01-22 09:55:39.957 UTC [182424] FATAL:  the database system is in recovery mode
2025-01-22 09:55:39.959 UTC [182422] FATAL:  the database system is in recovery mode
2025-01-22 09:55:39.964 UTC [182426] FATAL:  the database system is in recovery mode
2025-01-22 09:55:39.972 UTC [182427] FATAL:  the database system is in recovery mode
2025-01-22 09:55:39.976 UTC [182428] FATAL:  the database system is in recovery mode
2025-01-22 09:55:39.993 UTC [182429] FATAL:  the database system is in recovery mode
2025-01-22 09:55:39.998 UTC [182425] FATAL:  the database system is in recovery mode
2025-01-22 09:55:40.012 UTC [182431] FATAL:  the database system is in recovery mode

How can we reproduce the bug?

Use 24 parallel processes, each running 2 async workers uploading data from the payload I shared in #193 using multi-inserts of batches of 1000 records.

Are you going to work on the bugfix?

🆘 No, could someone else please work on the bugfix?

@Eli-Airis Eli-Airis added bug Something isn't working community pgvectorscale labels Jan 22, 2025
@Eli-Airis
Copy link
Author

The relevant part of the logs from docker logs

docker-short.zip

@Eli-Airis
Copy link
Author

Clarification: To closely reproduce my flow, add noise to the inserted vectors (I only have <300,000 true vectors).

@Eli-Airis
Copy link
Author

Also: my disk is very far from full. This is not the issue. df -h gives:

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        29T  305G   29T   2% /

I provisioned a huge disk.

@Eli-Airis
Copy link
Author

Note that the segfault itself was reached when I tried to query the DB. I don't see an indication of a bad state before that, despite the state I found the DB in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community pgvectorscale
Projects
None yet
Development

No branches or pull requests

1 participant