Optimize slow query that uses a high amount of temporary disk space to find relations #191
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
resolves dbt-labs/dbt-adapters#657
Problem
The macro postgres_get_relations in relations.sql was extremely slow and used an extremely high amount of temporary disk space on a system with high numbers of schemas, tables, and dependencies between database objects (rows in pg_depend). Slow to the point of not completing in 50 minutes and using more than 160GB disk space (at which point PostgreSQL ran out of disk space and aborted the query).
Solution
The solution here optimises the query and so it runs in ~500ms on my system. It does this by being heavily inspired by the definition of information_schema.view_table_usage, and specifically:
select distinct ... from pg_depend
in the innards of the query, and instead having a top levelselect distinct
- on my system this saved over 45 seconds.I suspect this is also more robust because oids I think can be repeated between system tables, and so when querying pg_depend filtering on classid and refclassid is required (and I think also means indexes are better leveraged).
Comparing calls to
explain
it reduces the largest "rows" value from 5,284,141,410,595,979 (over five quadrillion) to 219 and the actual run time from never completing within 50 minutes (because it used all of the 160GB available) to completing in ~500ms.It also has some style/naming changes:
distinct
on the top level rather than a group by for clarity (performance seemed the same in my case).dbt-postgres/dbt/adapters/postgres/impl.py
Line 113 in 05f0337
Checklist