-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restructure data dictionary docs and add a class for etl_groups #2336
base: main
Are you sure you want to change the base?
Conversation
…p types not just one long alphabetical list of tables. Took out the table fields for now, want to add them back in later
…d forth between version with columns and version with just table descriptions
… Add etl_group description to the docs, fix the harvest_test.py tests so that etl_groups are accurately represented. Add etl_group to the docstrings in the classes module.
…rat for Package.to_rst()
"outputs": { | ||
"title": "Output Tables", | ||
"description": "Blah", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this as blah because we still haven't really incorporated this yet and there's a lot there. Also eventually, maybe output is too big a category and we'll do something like eia_output
or something like that....idk. But @cmgosnell if you have any edits to any of the descriptions here lmk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should explain here that these tables are not in the database but we use some of the same schema / typing information for dataframes that we derive from database tables.
In the near term we're going to need to restructure a lot of this using something other than ETL group, which is really just an ad-hoc grouping of ETL tasks, based on where various tables were getting processed in our old ETL structure, which is completely replaced by Dagster.
…nt pudl_db_fields from cleanup_rsts so it gets removed
You don't have to pull the branch and build the docs to see the new output. The documentation for PRs is automatically build by Read The Docs, and you can go view them by clicking on the "details" link in the checks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One design issue I see here is that you're introducing coupling between the Package
, ETLGroup
, and Resource
classes. Package should know how to produce RST representing a Package. ETLGroup should know how to produce RST representing an ETLGroup, and Resource should know how to output RST representing a Resource.
These output routines can be nested and have some parameters, but keep the logic for each class within that class.
# Iterates over ETLGroups within package, calling ETLGroup.to_rst()
package.to_rst(by="etl_group")
# Iterates over Resources within package, calling Resource.to_rst()
package.to_rst(by="resource")
And rather than passing in a random Jinja template to Package, it could know which template to use internally, based on which kind of RST output it's generating (by ETLGroup vs. by Resource)
I'm not entirely sure what about these changes is breaking the ETL, but my guess is it has to do with the places where you've put an ETLGroup class in a dictionary of parameters, rather than a dictionary that can be used to construct ETLGroup.
@@ -5,7 +5,7 @@ Data Dictionaries | |||
|
|||
.. toctree:: | |||
:caption: Data Processed & Cleaned by PUDL | |||
:maxdepth: 1 | |||
:maxdepth: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we maintain a direct link to pudl_db_fields
in the Data Dictionaries
section of the navigation sidebar? right now it seems like the only way to reach it is by navigating through the ETL Group page. If someone knows they want to go do Ctrl-F
it should be easy to get there.
@@ -1,11 +1,24 @@ | |||
:orphan: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would just removing this get it back in the navigation sidebar?
------------------------------------------------------------------------------- | ||
{{ resource.name }} | ||
------------------------------------------------------------------------------- | ||
|
||
{{ resource.description | wordwrap(78) if resource.description else | ||
'No table description available.' }} | ||
`Browse or query this table in Datasette. <https://data.catalyst.coop/pudl/{{ resource.name }}>`__ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of this part of the resource output means that the Resource.to_rst()
is no longer a representation of a Resource as RST. This seems bad organizationally, and couples Resource output with the Package class unnecessarily and confusingly.
Instead I think you could make output of the resource level description text + resource schema conditional with a a parameter, keeping all of the resource output logic inside the resource class, but still allowing other code to get at only the portion of that available output that they want based on what arguments they pass in.
{{ resource.name }} | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
{{ resource.description | wordwrap(78) if resource.description else | ||
'No table description available.' }} | ||
`Browse or query this table in Datasette. <https://data.catalyst.coop/pudl/{{ resource.name }}>`__ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resource-specific output probably does not belong in the package class or RST template.
{% for group, resources in package.get_resource_by_etl_group().items() %} | ||
{{ package.get_etl_group()[group].title }} | ||
------------------------------------------------------------------------------- | ||
|
||
{{ package.get_etl_group()[group].description }} | ||
|
||
{% for resource in resources %} | ||
- :ref:`{{ resource.name }}` | ||
|
||
{% endfor %} | ||
{% endfor %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to be able out output all of the ETLGroup
s as RST, it seems like creating an ETLGroup.to_rst()
method would be cleaner. It could encapsulate all of the logic of producing the RST for the ETLGroup
-- whatever text belongs at the top of the ETLGroup
+ iterating through all of the Resources that are part of the ETLGroup and having each of them output themselves with Resource.to_rst()
.
"eia860": { | ||
"title": "EIA 860", | ||
"description": """Tables derrived from the EIA Form 860. See our | ||
:doc:`../data_sources/eia860` page for more information.""", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Embedding hard-coded relative paths to particular RST documents within the string descriptions is going to be fragile. It means there's only one location in the documentation that this description can be used to refer to that other page. If it's an absolute document path instead it would work from anywhere (so long as we don't change the structure of the documentation files), but ideally this should probably be a separate piece of metadata that's attached to the data source, and each ETLGroup
class could have an attribute that stores that information separate from the text description, and it could be integrated into the Jinja template used to output the RST.
But since the etl group organization is probably going to impacted or replaced by the use of Dagster to organize the ETL, I'm not sure it makes sense to spend time on that.
def get_resource_by_etl_group(self) -> dict[str, list[Resource]]: | ||
"""blah.""" | ||
resource_dict = {} | ||
for etl_group in ETL_GROUPS.keys(): | ||
resource_dict[etl_group] = [ | ||
resource | ||
for resource in self.resources | ||
if resource.etl_group == ETLGroup.from_id(etl_group) | ||
] | ||
return resource_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic seems like it belongs in the ETLGroup
not the Package
. Then a Package could look up all its own ETL Groups, each ETL Group could look up its constituent resources.
@@ -35,6 +35,7 @@ def _assert_frame_equal(a: pd.DataFrame, b: pd.DataFrame, **kwargs: Any) -> None | |||
], | |||
"primary_key": ["i", "j"], | |||
}, | |||
"etl_group": ETLGroup.from_id("static_pudl"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may want a dictionary representation of the class here, not an instantiated class itself.
@@ -338,6 +347,7 @@ def test_resource_with_only_key_fields_harvests() -> None: | |||
d["schema"]["fields"] = [ | |||
{"name": name, "type": FIELD_DTYPES[name]} for name in d["schema"]["fields"] | |||
] | |||
d["etl_group"] = ETLGroup.from_id(d["etl_group"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And again here, do you really want the instantiated Pydantic model? Or a dictionary that can be used to instantiate it when unpacked?
for name in included_sources: | ||
source = DataSource.from_id(name) | ||
source_resources = [res for res in package.resources if res.etl_group == name] | ||
source_resources = [ | ||
res for res in package.resources if res.etl_group == ETLGroup.from_id(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you're comparing two instances of ETLGroup
using ==
. Are you sure that does what you want? How is equality defined for these classes? Do you really want to check whether the ETLGroup names (which are strings) are the same?
Thanks for the comments @zaneselvans!
My intention is to create a sub-resource table metadata grouping structure that we can build off when we implement the Dagster branch. Which aspects of this PR do you think are worth investing time in between now and then? I could see us changing the name from etl_group to something else down the road if that type of categorization becomes obsolete. |
I agree that with hundreds of tables across all of the databases we're producing now, and the prospect of ~100 tables even just in the main PUDL DB once the Dagster refactor is done, some kind of categorization and grouping of the tables is definitely required for users to navigate the available data. I think what those categories / groupings should look like was the topic of this discussion. I think those conceptual / organizational groupings should probably be pretty distinct from the software / data pipeline, and using our ad-hoc ETL Groups for that purpose here entangles structural ETL stuff with documentation / semantic categories in a way that I'm not sure is helpful. The ETL groups were really just added as a way to address groups of tables that needed to be treated similarly in the old ETL. It was a little hack. I think there are probably multiple dimensions that we're going to want to use to organize the new universe of tables:
I suspect that we're going to end up with a large number of tables that aren't meant for direct use by most kinds of users: tables from early in the pipeline, highly normalized tables, skinny tables that contain imputed values or particular analytical outputs, maybe crosswalks that allow joins between tables that were hard to work with like FERC1+EIA. These are the building blocks that can be assembled by us or others comfortable with databases and Python to build new things, or track and debug existing processes. We probably won't want to expose all of these tables in the documentation (or at least, not all in the same part of the documentation) Then we'll also probably have a smaller number of wide tables with tons of denormalized data and imputed / estimated / aggregated values in them, that most people use for analysis, selecting the subset of columns they actually need for whatever purpose they have in mind. So I think a useful thing in terms of producing table-level documentation would probably be to allow the generation of documentation based on an arbitrary group of tables (maybe a Datasette TemplatesI think getting some kind of database-level descriptions integrated into the top level https://data.catalyst.coop page with a custom Datasette Jinja template would be helpful for directing people to the right outputs (especially with the XBRL vs. DBF vs. PUDL versions of FERC 1 data) I think getting the existing table-level descriptions onto the individual database pages on Datasette would be helpful, e.g. on https://data.catalyst.coop/pudl have a table description under the table name, and before the list of columns. I think this would mean a custom Datasette Jinja template. |
PR Overview
This PR adds layers to the Data Dictionary documentation (and metadata) so that it's easier for users to see what data are available to them. The best way to view the changes from this PR is to pull this branch, run the docs build, and view the Data Dictionaries page.
We've had a series of conversations about how we structure, present, and view the final data products:
In #2289, Jesse suggested that we organize the PUDL tables by suffix. In looking at what kind of categorization we already had, I noticed the
etl_group
flag. This is part of theclass Resource
definition and spans the following categories:Previously, this was just a string Literal that was defined in the type hints. I turned it into a class of its own,
class ETLGroup
and a dictionary in constantsETL_GROUPS
so that it would mirror the clear scope and definition of other, similar categories like contributors or licenses. Once I added definitions to all of theetl_groups
, I pulled them into the metadata, using them to organize the flow of the Data Dictionaries Page. Instead of landing on a page with all the tables and all the columns in alphabetical order, you land on a page that shows you a list of tables byetl_group
with a definition of eachetl_group
. When you click on a specific table it brings you to the page with the table and column definitions.My goal with these changes was to improve user experience while maintaining searchability / abundance. I wanted users to be able to ctr F through the page with allllll the column names if they wanted to see where else certain types of data pop up, but I also wanted them to be able to find something specific without having to wade through tons of text and tables.
Ideally I would like to have a similar constant/class structure to include definitions for the SQL dbs themselves. This way we can pull and even higher-level description of what's what into the metadata and Datasette. However, I acknowledge that we can only make so many changes before the big Dagster integration.
Ultimately, I could also see us creating three-ish levels of categorization here: 1) by database, 2) by data source (this is already in the metadata) 3) by table type (etl_group metadata I just made).
PR Checklist
dev
).