[Bug]: Segfault Under Continuous Load, Followed By Effective Halt Of DB #196

Eli-Airis · 2025-01-22T11:25:57Z

What happened?

My goal was to create a StreamingDiskANN index of 10,000,000,000 records.
To this end, I was running my DB under heavy INSERT load (see reproduction steps below) for many hours.
All 32 cores of the machine were at 85%+ CPU usage.

In the morning, I found the machine almost idle - all cores at 20% use at most.
All the client processes I had running on the same machine were still up, but at 0% CPU and in sleeping state.
~4 DB processes were running, and showed "INSERT" state in htop.

I was able to connect to the DB via psql, and could query other tables, but not my main table. When I tried to query my main table the connection was lost.
Those psql query attempts triggered the following messages in the DB's log:

2025-01-22 09:57:50.152 UTC [194477] FATAL:  the database system is not yet accepting connections
2025-01-22 09:57:50.152 UTC [194477] DETAIL:  Consistent recovery state has not been yet reached.

I am running the timescale/timescaledb-ha:pg16.4-ts2.17.1 docker image (ID 7f9533ca34d7), since the 2.17.2 is unusable for me due to #193 .

pgvectorscale extension affected

No response

PostgreSQL version used

16.4

What operating system did you use?

Ubuntu 22 x64

What installation method did you use?

Docker

What platform did you run on?

Google Cloud Platform (GCP)

Relevant log output and stack trace

2025-01-22 09:53:05.951 UTC [19] LOG:  checkpoint starting: time
2025-01-22 09:55:39.338 UTC [1] LOG:  server process (PID 182319) was terminated by signal 11: Segmentation fault
2025-01-22 09:55:39.338 UTC [1] DETAIL:  Failed process was running: SELECT (class_embedding <=> (select class_embedding from basic_objects where id=228703)) as cosine_dist, (class_embedding <-> (select class_embedding from basic_objects where id=228703)) as l2_dist, *
        FROM basic_objects ORDER BY cosine_dist LIMIT 100 offset 0;
2025-01-22 09:55:39.338 UTC [1] LOG:  terminating any other active server processes
2025-01-22 09:55:39.368 UTC [182421] FATAL:  the database system is in recovery mode
2025-01-22 09:55:39.957 UTC [182423] FATAL:  the database system is in recovery mode
2025-01-22 09:55:39.957 UTC [182424] FATAL:  the database system is in recovery mode
2025-01-22 09:55:39.959 UTC [182422] FATAL:  the database system is in recovery mode
2025-01-22 09:55:39.964 UTC [182426] FATAL:  the database system is in recovery mode
2025-01-22 09:55:39.972 UTC [182427] FATAL:  the database system is in recovery mode
2025-01-22 09:55:39.976 UTC [182428] FATAL:  the database system is in recovery mode
2025-01-22 09:55:39.993 UTC [182429] FATAL:  the database system is in recovery mode
2025-01-22 09:55:39.998 UTC [182425] FATAL:  the database system is in recovery mode
2025-01-22 09:55:40.012 UTC [182431] FATAL:  the database system is in recovery mode

How can we reproduce the bug?

Use 24 parallel processes, each running 2 async workers uploading data from the payload I shared in #193 using multi-inserts of batches of 1000 records.

Are you going to work on the bugfix?

🆘 No, could someone else please work on the bugfix?

The text was updated successfully, but these errors were encountered:

Eli-Airis · 2025-01-22T11:34:29Z

The relevant part of the logs from docker logs

docker-short.zip

Eli-Airis · 2025-01-22T11:37:49Z

Clarification: To closely reproduce my flow, add noise to the inserted vectors (I only have <300,000 true vectors).

Eli-Airis · 2025-01-22T11:54:51Z

Also: my disk is very far from full. This is not the issue. df -h gives:

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        29T  305G   29T   2% /

I provisioned a huge disk.

Eli-Airis · 2025-01-22T12:00:06Z

Note that the segfault itself was reached when I tried to query the DB. I don't see an indication of a bad state before that, despite the state I found the DB in.

Eli-Airis added bug Something isn't working community pgvectorscale labels Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Segfault Under Continuous Load, Followed By Effective Halt Of DB #196

[Bug]: Segfault Under Continuous Load, Followed By Effective Halt Of DB #196

Eli-Airis commented Jan 22, 2025 •

edited

Loading

Eli-Airis commented Jan 22, 2025

Eli-Airis commented Jan 22, 2025

Eli-Airis commented Jan 22, 2025

Eli-Airis commented Jan 22, 2025

[Bug]: Segfault Under Continuous Load, Followed By Effective Halt Of DB #196

[Bug]: Segfault Under Continuous Load, Followed By Effective Halt Of DB #196

Comments

Eli-Airis commented Jan 22, 2025 • edited Loading

What happened?

pgvectorscale extension affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

Are you going to work on the bugfix?

Eli-Airis commented Jan 22, 2025

Eli-Airis commented Jan 22, 2025

Eli-Airis commented Jan 22, 2025

Eli-Airis commented Jan 22, 2025

Eli-Airis commented Jan 22, 2025 •

edited

Loading