Dbt setup #4011

zschira · 2025-01-13T19:30:47Z

Overview

This PR sets up a dbt project within the PUDL repo that will be used for data testing. Details on setup and usage can all be found in the Readme.md. This PR also includes several data validations that have been converted to dbt tests. The tests currently converted are all vcerare asset_checks and FERC fuel by plant cost per mmbtu range checks.

Remaining work

Get dependencies sorted with dbt 1.9 to enable python 3.12
Convert eia923 boiler fuel aggregation vs historical checks
Decide how/where to store duckdb file

closes #3997 #3971

…_setup

… dbt_setup

zaneselvans · 2025-01-13T20:45:30Z

src/pudl/dbt/.user.yml

Should this file actually be included in the repo?

zaneselvans · 2025-01-13T20:49:01Z

src/pudl/dbt/.gitignore

Organizationally it feels like it would make more sense to have the dbt directory be a top level directory in the repo, rather than down in the src/pudl directory, since it's not really PUDL source code. And it also shouldn't be under tests where the current validations are, since it will no longer be managed by pytest.

Having it at the top level would mean that it's obvious we're using dbt on cursory inspection of the repo, and weould keep dbt's special config files and hierarchy from being nested under some other unrelated collection of stuff.

zaneselvans · 2025-01-13T20:51:24Z

pyproject.toml

-requires-python = ">=3.12,<3.13"
+requires-python = ">=3.10,<3.13"


Now that we can install newer versions of dagster that don't have the dbt dependency issues we were running into (even if the dagster ETL doesn't yet work because of the bug I filed) can I go ahead and update the dependencies and revert to using python 3.12, and then wait to merge until the Dagster bugs are fixed?

zaneselvans · 2025-01-13T21:35:38Z

environments/conda-linux-64.lock.yml

-  - importlib-metadata=8.5.0=pyha770c72_1
+  - importlib-metadata=6.10.0=pyha770c72_0


For some reason dbt-semantic-interfaces is stuck between 6 and 7.

zaneselvans · 2025-01-13T21:38:34Z

environments/conda-linux-64.lock.yml

-  - isodate=0.7.2=pyhd8ed1ab_1
+  - isodate=0.6.1=pyhd8ed1ab_0


dbt-common is stuck between 0.6 and 0.7.

zaneselvans · 2025-01-13T22:16:24Z

src/pudl/dbt/models/schema.yml

+    tables:
+      - name: out_eia923__boiler_fuel
+      - name: out_eia923__monthly_boiler_fuel
+      - name: out_ferc1__yearly_steam_plants_fuel_by_plant_sched402
+      - name: out_vcerare__hourly_available_capacity_factor


In a data warehouse with hundreds of tables, would this file be created and managed by hand? Or would there be some rule-based way to generate it, or parts of it, along the lines of what we're doing with the Pandera schema checks right now? For example, the not_null tests here are a 2nd place that that restriction is being specified -- it's already present in our table metadata, which seems like recipe for them getting out of sync.

Or in the case of row counts, is there a clean, non-manual way to update the row counts to reflect whatever the currently observed counts are? Especially if we're trying to regenerate expected row counts for each individual year, filling it all in manually could be pretty tedious and error prone. We've moved toward specifying per-year row counts on the newer assets so that they work transparently in either the fast or full ETL cases, and the asset checks don't need to be aware of which kind of job they're being run in, which seems both more specific and more robust.

Looks like the "X column is not null" checks are currently defined in fields.py under the field constraints, is that what you're thinking about?

I think it would be nice to have auto-generated tests like the non-null tests & row counts defined alongside manually added tests. Then all the tests will be defined in one place, except for the tests that we need to write custom Python code for.

That seems pretty doable - YAML is easy to work with, and dbt lets us tag tests, so we could easily tag all the auto-generated tests so our generation scripts know to replace them but leave the manually-added tests alone.

In addition to the field specific constraints I think we automatically add NOT NULL check constraints to the PK fields when we construct the SQLite database -- but more generally I'm just saying that we need to get all of these generated tests integrated non-duplicatively into the dbt tests somehow.

zaneselvans · 2025-01-13T22:20:20Z

src/pudl/dbt/models/validation/ferc1_fbp_cost_per_mmbtu.sql

If we do end up needing to define these intermediate tables it seems like we would want to have some kind of clear naming convention for them?

zaneselvans · 2025-01-13T22:22:56Z

src/pudl/dbt/models/schema.yml

+          - dbt_expectations.expect_compound_columns_to_be_unique:
+              column_list: ["county_id_fips", "datetime_utc"]


Could be generated based on the PK that's defined for every table?

zaneselvans · 2025-01-13T22:24:08Z

src/pudl/dbt/models/schema.yml

+          - dbt_expectations.expect_table_row_count_to_equal:
+              value: |
+                {%- if target.name == "etl-fast" -%} 27287400
+                {%- else -%} 136437000
+                {%- endif -%}


Is there a clean way to specify the expected row counts for each year of data (or some other meaningful subset) within a table, as we've started doing for the newer assets in Dagster asset checks, so we don't have to differentiate between fast and full validations, and can identify where the changes are?

zaneselvans · 2025-01-13T22:26:04Z

src/pudl/dbt/models/schema.yml

+          - dbt_expectations.expect_column_quantile_values_to_be_between:
+              quantile: 0.05
+              min_value: 1.5


I'm guessing these are not using the weighted quantiles?

zschira and others added 6 commits January 9, 2025 17:11

Add basic dbt setup

bf40ffb

Update to dagster 1.9.7 & grpcio 1.67.1

9aac625

Setup multiple dbt profiles

415a113

Merge remote-tracking branch 'refs/remotes/origin/dbt_setup' into dbt…

ba32bd8

…_setup

Add all vcerare dbt tests

dc51c8f

Add more example dbt tests

590b02a

zschira requested a review from zaneselvans January 13, 2025 19:30

zschira marked this pull request as draft January 13, 2025 19:31

Merge branch 'dbt_setup' of github.com:catalyst-cooperative/pudl into…

63e663a

… dbt_setup

zschira requested a review from e-belfer January 13, 2025 19:44

Merge changes from main and revert to python 3.12

d428b5d

zaneselvans reviewed Jan 13, 2025

View reviewed changes

zaneselvans added 6 commits January 14, 2025 10:44

Bump gdal to v3.10.1 bugfix release.

784cf96

Merge branch 'main' into dbt_setup

48a16e1

Merge branch 'main' into dbt_setup

6f45ba5

Update to dagster 1.9.9

ac41a41

Merge branch 'main' into dbt_setup

0ce1648

Merge branch 'main' into dbt_setup

c19cfd8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dbt setup #4011

Dbt setup #4011

zschira commented Jan 13, 2025 •

edited

Loading

zaneselvans Jan 13, 2025

zaneselvans Jan 13, 2025

zaneselvans Jan 13, 2025

zaneselvans Jan 13, 2025

zaneselvans Jan 13, 2025

zaneselvans Jan 13, 2025

jdangerx Jan 15, 2025

zaneselvans Jan 15, 2025

zaneselvans Jan 13, 2025

zaneselvans Jan 13, 2025

zaneselvans Jan 13, 2025

zaneselvans Jan 13, 2025

		requires-python = ">=3.12,<3.13"
		requires-python = ">=3.10,<3.13"

		- importlib-metadata=8.5.0=pyha770c72_1
		- importlib-metadata=6.10.0=pyha770c72_0

		- dbt_expectations.expect_compound_columns_to_be_unique:
		column_list: ["county_id_fips", "datetime_utc"]

Dbt setup #4011

Are you sure you want to change the base?

Dbt setup #4011

Conversation

zschira commented Jan 13, 2025 • edited Loading

Overview

Remaining work

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zschira commented Jan 13, 2025 •

edited

Loading