Optimize slow query that uses a high amount of temporary disk space to find relations #191

michalc · 2025-01-18T08:31:15Z

Problem

The macro postgres_get_relations in relations.sql was extremely slow and used an extremely high amount of temporary disk space on a system with high numbers of schemas, tables, and dependencies between database objects (rows in pg_depend). Slow to the point of not completing in 50 minutes and using more than 160GB disk space (at which point PostgreSQL ran out of disk space and aborted the query).

Solution

The solution here optimises the query and so it runs in ~500ms on my system. It does this by being heavily inspired by the definition of information_schema.view_table_usage, and specifically:

Stripping out CTEs that can be optimisation blockers, often by causing CTEs to be materialised to disk (especially in older PostgreSQL, but I suspect in recent too in some cases).
Removing unnecessary filtering on relkind: going via pg_rewrite (or rather, the equivalent row on pg_depend) is equivalent to that
Avoiding sequential scans on any table by structuring joins/where clause to leverage indexes, especially on pg_depend
Removing unnecessary filtering out system catalog tables from dependents (they are excluded by the remaining filters on referenced tables).
Not having select distinct ... from pg_depend in the innards of the query, and instead having a top level select distinct - on my system this saved over 45 seconds.
Excluding self-relations that depend on themselves by using oid rather than using the names of tables and schemas.

I suspect this is also more robust because oids I think can be repeated between system tables, and so when querying pg_depend filtering on classid and refclassid is required (and I think also means indexes are better leveraged).

Comparing calls to explain it reduces the largest "rows" value from 5,284,141,410,595,979 (over five quadrillion) to 219 and the actual run time from never completing within 50 minutes (because it used all of the 160GB available) to completing in ~500ms.

It also has some style/naming changes:

Using a distinct on the top level rather than a group by for clarity (performance seemed the same in my case).
Flips the definition of "referenced" and "dependent" in the query to match both the definitions in pg_depend, and the code at

dbt-postgres/dbt/adapters/postgres/impl.py

Line 113 in 05f0337

for dep_schema, dep_name, refed_schema, refed_name in table:
Re-orders the join to I think a slightly clearer order that "flows" from views-> the linking table (pg_depend) to the tables referenced in the views.
Lowers the abstraction/indirection levels in naming/aliases, using names closer to the PostgreSQL catalog tables - this made it easier to write and understand, and so I suspect easier to make changes in future (I found I had to keep in mind the PostgreSQL definitions more than the output of the query when making changes).

Checklist

I have read the contributing guide and understand what's expected of me
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

…o find relations Resolves https://github.com/dbt-labs/dbt-postgres/issues/189 The macro postgres_get_relations in relations.sql was extremely slow and used an extremely high amount of temporary disk space on a system with high numbers of schemas, tables, and dependencies between database objects (rows in pg_depend). Slow to the point of not completing in 50 minutes and using more than 160GB disk space (at which point PostgreSQL ran out of disk space and aborted the query). The solution here optimises the query and so it runs in just under 1 second on my system. It does this by being heavily inspired by the definition of information_schema.view_table_usage, and specifically: - Stripping out CTEs that can be optimisation blockers, often by causing CTEs to be materialised to disk (especially in older PostgreSQL, but I suspect in recent too in some cases). - Removing unnecessary filtering on relkind: going via pg_rewrite (or rather, the equivalent row on pg_depend) is equivalent to that - Avoiding sequential scans on any table by structuring joins/where clause to leverage indexes, especially on pg_depend - Removing unnecessary filtering out system catalog tables from dependents (they are excluded by the remaining filters on referenced tables). - Not having `select distinct ... from pg_dependent` in the innards of the query, and instead having a top level `select distinct` - on my system this saved over 45 seconds. - Excluding self-relations that depend on themselves by using oid rather than using the names of tables and schemas. I suspect this is also more robust because oids I think _can_ be repeated between system tables, and so when querying pg_depend filtering on classid and refclassid is required (and I think also means indexes are better leveraged). Comparing calls to `explain` it reduces the largest "rows" value from 5,284,141,410,595,979 (over five quadrillion) to 219 and the actual run time from never completing within 50 minutes (because it used all of the 160GB available) to completing in ~500ms. It also has some style/naming changes: - Using a `distinct` on the top level rather than a group by for clarity (performance seemed the same in my case). - Flips the definition of "referenced" and "dependent" in the query to match both the definitions in pg_depend, and the code at https://github.com/dbt-labs/dbt-postgres/blob/05f0337d6b05c9c68617e41c0b5bca9c2a733783/dbt/adapters/postgres/impl.py#L113 - Re-orders the join to I think a slightly clearer order that "flows" from views -> the linking table (pg_depend) to the tables referenced in the views. - Lowers the abstraction/indirection levels in naming/aliases, using names closer to the PostgreSQL catalog tables - this made it easier to write and understand, and so I suspect easier to make changes in future (I found I had to keep in mind the PostgreSQL definitions more than the output of the query when making changes).

michalc requested a review from a team as a code owner January 18, 2025 08:31

cla-bot bot added the cla:yes label Jan 18, 2025

michalc force-pushed the optimize-get-relations branch 2 times, most recently from f730567 to c099b27 Compare January 18, 2025 08:41

michalc changed the title ~~Optimize macro postgres_get_relations for speed and disk space usage~~ Optimize slow query that uses a high amount of temporary disk space to find relations Jan 18, 2025

michalc force-pushed the optimize-get-relations branch 6 times, most recently from a9b1d1f to 289d225 Compare January 20, 2025 07:27

michalc force-pushed the optimize-get-relations branch from 289d225 to 54f1fdc Compare January 20, 2025 07:44

michalc mentioned this pull request Jan 21, 2025

[Bug] Extremely slow query to determine relations that uses all disk space (160GB) dbt-labs/dbt-adapters#657

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize slow query that uses a high amount of temporary disk space to find relations #191

Optimize slow query that uses a high amount of temporary disk space to find relations #191

michalc commented Jan 18, 2025 •

edited

Loading

Optimize slow query that uses a high amount of temporary disk space to find relations #191

Are you sure you want to change the base?

Optimize slow query that uses a high amount of temporary disk space to find relations #191

Conversation

michalc commented Jan 18, 2025 • edited Loading

Problem

Solution

Checklist

michalc commented Jan 18, 2025 •

edited

Loading