-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dbt setup #4011
base: main
Are you sure you want to change the base?
Conversation
src/pudl/dbt/.user.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this file actually be included in the repo?
src/pudl/dbt/.gitignore
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Organizationally it feels like it would make more sense to have the dbt
directory be a top level directory in the repo, rather than down in the src/pudl
directory, since it's not really PUDL source code. And it also shouldn't be under tests
where the current validations are, since it will no longer be managed by pytest
.
Having it at the top level would mean that it's obvious we're using dbt
on cursory inspection of the repo, and weould keep dbt
's special config files and hierarchy from being nested under some other unrelated collection of stuff.
pyproject.toml
Outdated
requires-python = ">=3.12,<3.13" | ||
requires-python = ">=3.10,<3.13" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we can install newer versions of dagster
that don't have the dbt
dependency issues we were running into (even if the dagster ETL doesn't yet work because of the bug I filed) can I go ahead and update the dependencies and revert to using python 3.12, and then wait to merge until the Dagster bugs are fixed?
- importlib-metadata=8.5.0=pyha770c72_1 | ||
- importlib-metadata=6.10.0=pyha770c72_0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason dbt-semantic-interfaces
is stuck between 6 and 7.
- isodate=0.7.2=pyhd8ed1ab_1 | ||
- isodate=0.6.1=pyhd8ed1ab_0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dbt-common
is stuck between 0.6 and 0.7.
tables: | ||
- name: out_eia923__boiler_fuel | ||
- name: out_eia923__monthly_boiler_fuel | ||
- name: out_ferc1__yearly_steam_plants_fuel_by_plant_sched402 | ||
- name: out_vcerare__hourly_available_capacity_factor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a data warehouse with hundreds of tables, would this file be created and managed by hand? Or would there be some rule-based way to generate it, or parts of it, along the lines of what we're doing with the Pandera schema checks right now? For example, the not_null
tests here are a 2nd place that that restriction is being specified -- it's already present in our table metadata, which seems like recipe for them getting out of sync.
Or in the case of row counts, is there a clean, non-manual way to update the row counts to reflect whatever the currently observed counts are? Especially if we're trying to regenerate expected row counts for each individual year, filling it all in manually could be pretty tedious and error prone. We've moved toward specifying per-year row counts on the newer assets so that they work transparently in either the fast or full ETL cases, and the asset checks don't need to be aware of which kind of job they're being run in, which seems both more specific and more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the "X column is not null" checks are currently defined in fields.py
under the field constraints, is that what you're thinking about?
I think it would be nice to have auto-generated tests like the non-null tests & row counts defined alongside manually added tests. Then all the tests will be defined in one place, except for the tests that we need to write custom Python code for.
That seems pretty doable - YAML is easy to work with, and dbt lets us tag tests, so we could easily tag all the auto-generated tests so our generation scripts know to replace them but leave the manually-added tests alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the field specific constraints I think we automatically add NOT NULL
check constraints to the PK fields when we construct the SQLite database -- but more generally I'm just saying that we need to get all of these generated tests integrated non-duplicatively into the dbt
tests somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do end up needing to define these intermediate tables it seems like we would want to have some kind of clear naming convention for them?
- dbt_expectations.expect_compound_columns_to_be_unique: | ||
column_list: ["county_id_fips", "datetime_utc"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be generated based on the PK that's defined for every table?
- dbt_expectations.expect_table_row_count_to_equal: | ||
value: | | ||
{%- if target.name == "etl-fast" -%} 27287400 | ||
{%- else -%} 136437000 | ||
{%- endif -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a clean way to specify the expected row counts for each year of data (or some other meaningful subset) within a table, as we've started doing for the newer assets in Dagster asset checks, so we don't have to differentiate between fast and full validations, and can identify where the changes are?
- dbt_expectations.expect_column_quantile_values_to_be_between: | ||
quantile: 0.05 | ||
min_value: 1.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing these are not using the weighted quantiles?
Overview
This PR sets up a
dbt
project within the PUDL repo that will be used for data testing. Details on setup and usage can all be found in theReadme.md
. This PR also includes several data validations that have been converted todbt
tests. The tests currently converted are allvcerare
asset_check
s and FERC fuel by plant cost per mmbtu range checks.Remaining work
closes #3997 #3971