Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure data dictionary docs and add a class for etl_groups #2336

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 19 additions & 4 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

import pkg_resources

from pudl.metadata.classes import CodeMetadata, DataSource, Package
from pudl.metadata.classes import CodeMetadata, DataSource, ETLGroup, Package
from pudl.metadata.codes import CODE_METADATA
from pudl.metadata.resources import RESOURCE_METADATA

Expand Down Expand Up @@ -136,18 +136,32 @@ def data_dictionary_metadata_to_rst(app):
# Sort fields within each resource by name:
for resource in package.resources:
resource.schema.fields = sorted(resource.schema.fields, key=lambda x: x.name)
package.to_rst(docs_dir=DOCS_DIR, path=DOCS_DIR / "data_dictionaries/pudl_db.rst")
package.to_rst(
docs_dir=DOCS_DIR,
path=DOCS_DIR / "data_dictionaries/pudl_db.rst",
template="etl_group.rst.jinja",
)
package.to_rst(
docs_dir=DOCS_DIR,
path=DOCS_DIR / "data_dictionaries/pudl_db_fields.rst",
template="package.rst.jinja",
)


def data_sources_metadata_to_rst(app):
"""Export data source metadata to RST for inclusion in the documentation."""
print("Exporting data source metadata to RST.")
included_sources = ["eia860", "eia923", "ferc1", "epacems"]
package = Package.from_resource_ids()
extra_etl_groups = {"eia860": ["entity_eia"], "ferc1": ["glue"]}
extra_etl_groups = {
"eia860": [ETLGroup.from_id("entity_eia")],
"ferc1": [ETLGroup.from_id("glue")],
}
for name in included_sources:
source = DataSource.from_id(name)
source_resources = [res for res in package.resources if res.etl_group == name]
source_resources = [
res for res in package.resources if res.etl_group == ETLGroup.from_id(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you're comparing two instances of ETLGroup using ==. Are you sure that does what you want? How is equality defined for these classes? Do you really want to check whether the ETLGroup names (which are strings) are the same?

]
extra_resources = None
if name in extra_etl_groups:
# get resources for this source from extra etl groups
Expand Down Expand Up @@ -183,6 +197,7 @@ def static_dfs_to_rst(app):
def cleanup_rsts(app, exception):
"""Remove generated RST files when the build is finished."""
(DOCS_DIR / "data_dictionaries/pudl_db.rst").unlink()
(DOCS_DIR / "data_dictionaries/pudl_db_fields.rst").unlink()
(DOCS_DIR / "data_dictionaries/codes_and_labels.rst").unlink()
(DOCS_DIR / "data_sources/eia860.rst").unlink()
(DOCS_DIR / "data_sources/eia923.rst").unlink()
Expand Down
2 changes: 1 addition & 1 deletion docs/data_dictionaries/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Data Dictionaries

.. toctree::
:caption: Data Processed & Cleaned by PUDL
:maxdepth: 1
:maxdepth: 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maintain a direct link to pudl_db_fields in the Data Dictionaries section of the navigation sidebar? right now it seems like the only way to reach it is by navigating through the ETL Group page. If someone knows they want to go do Ctrl-F it should be easy to get there.


pudl_db

Expand Down
19 changes: 19 additions & 0 deletions docs/templates/etl_group.rst.jinja
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
===============================================================================
PUDL Data Dictionary
===============================================================================

The following data tables have been cleaned and transformed by our ETL process.
They're seperated into categories based on their original source and the type of
content they contain.

{% for group, resources in package.get_resource_by_etl_group().items() %}
{{ package.get_etl_group()[group].title }}
-------------------------------------------------------------------------------

{{ package.get_etl_group()[group].description }}

{% for resource in resources %}
- :ref:`{{ resource.name }}`

{% endfor %}
{% endfor %}
Comment on lines +9 to +19
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be able out output all of the ETLGroups as RST, it seems like creating an ETLGroup.to_rst() method would be cleaner. It could encapsulate all of the logic of producing the RST for the ETLGroup -- whatever text belongs at the top of the ETLGroup + iterating through all of the Resources that are part of the ETLGroup and having each of them output themselves with Resource.to_rst().

15 changes: 14 additions & 1 deletion docs/templates/package.rst.jinja
Original file line number Diff line number Diff line change
@@ -1,11 +1,24 @@
:orphan:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would just removing this get it back in the navigation sidebar?


===============================================================================
PUDL Data Dictionary
===============================================================================

The following data tables have been cleaned and transformed by our ETL process.

{% for resource in package.resources %}
{% for group, resources in package.get_resource_by_etl_group().items() %}
{% for resource in resources %}

.. _{{ resource.name }}:

{{ resource.name }}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

{{ resource.description | wordwrap(78) if resource.description else
'No table description available.' }}
`Browse or query this table in Datasette. <https://data.catalyst.coop/pudl/{{ resource.name }}>`__

Comment on lines +14 to +20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resource-specific output probably does not belong in the package class or RST template.

{% include 'resource.rst.jinja' %}

{% endfor %}
{% endfor %}
9 changes: 1 addition & 8 deletions docs/templates/resource.rst.jinja
Original file line number Diff line number Diff line change
@@ -1,11 +1,3 @@
-------------------------------------------------------------------------------
{{ resource.name }}
-------------------------------------------------------------------------------

{{ resource.description | wordwrap(78) if resource.description else
'No table description available.' }}
`Browse or query this table in Datasette. <https://data.catalyst.coop/pudl/{{ resource.name }}>`__

Comment on lines -1 to -8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of this part of the resource output means that the Resource.to_rst() is no longer a representation of a Resource as RST. This seems bad organizationally, and couples Resource output with the Package class unnecessarily and confusingly.

Instead I think you could make output of the resource level description text + resource schema conditional with a a parameter, keeping all of the resource output logic inside the resource class, but still allowing other code to get at only the portion of that available output that they want based on what arguments they pass in.

.. list-table::
:widths: auto
:header-rows: 1
Expand All @@ -17,4 +9,5 @@
* - {{ field.name }}
- {{ field.type }}
- {{ field.description if field.description else "N/A" }}

{%- endfor %}
78 changes: 52 additions & 26 deletions src/pudl/metadata/classes.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
from pudl.metadata.constants import (
CONSTRAINT_DTYPES,
CONTRIBUTORS,
ETL_GROUPS,
FIELD_DTYPES_PANDAS,
FIELD_DTYPES_PYARROW,
FIELD_DTYPES_SQL,
Expand Down Expand Up @@ -1021,6 +1022,24 @@ def from_id(cls, x: str) -> "DataSource":
return cls(**cls.dict_from_id(x))


class ETLGroup(Base):
"""Blah."""

# name: SnakeCase
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented out?

This seems like it might be a natural place to define what the valid values of ETL Group are?

title: String = None
description: String = None

@staticmethod
def dict_from_id(x: str) -> dict:
"""Construct dictionary from PUDL identifier."""
return copy.deepcopy(ETL_GROUPS[x])

@classmethod
def from_id(cls, x: str) -> "ETLGroup":
"""Construct Source by source name in the metadata."""
return cls(**cls.dict_from_id(x))


class ResourceHarvest(Base):
"""Resource harvest parameters (`resource.harvest`)."""

Expand All @@ -1046,7 +1065,8 @@ class Resource(Base):
>>> fields = [{'name': 'x', 'type': 'year'}, {'name': 'y', 'type': 'string'}]
>>> fkeys = [{'fields': ['x', 'y'], 'reference': {'resource': 'b', 'fields': ['x', 'y']}}]
>>> schema = {'fields': fields, 'primary_key': ['x'], 'foreign_keys': fkeys}
>>> resource = Resource(name='a', schema=schema)
>>> etl_group = ETLGroup(title='a', description='a')
>>> resource = Resource(name='a', schema=schema, etl_group=etl_group)
>>> table = resource.to_sql()
>>> table.columns.x
Column('x', Integer(), ForeignKey('b.x'), CheckConstraint(...), table=<a>, primary_key=True, nullable=False)
Expand All @@ -1065,7 +1085,8 @@ class Resource(Base):
>>> resource = Resource(**{
... 'name': 'a',
... 'harvest': {'harvest': True},
... 'schema': {'fields': fields, 'primary_key': ['id']}
... 'schema': {'fields': fields, 'primary_key': ['id']},
... 'etl_group': ETLGroup(title='a', description='b')
... })
>>> dfs = {
... 'a': pd.DataFrame({'id': [1, 1, 2, 2], 'x': [1, 1, 2, 2]}),
Expand Down Expand Up @@ -1143,7 +1164,8 @@ class Resource(Base):
>>> fields = [{'name': 'report_year', 'type': 'year'}]
>>> resource = Resource(**{
... 'name': 'table', 'harvest': {'harvest': True},
... 'schema': {'fields': fields, 'primary_key': ['report_year']}
... 'schema': {'fields': fields, 'primary_key': ['report_year']},
... 'etl_group': ETLGroup(title='a', description='b')
... })
>>> df = pd.DataFrame({'report_date': ['2000-02-02', '2000-03-03']})
>>> resource.format_df(df)
Expand Down Expand Up @@ -1182,23 +1204,7 @@ class Resource(Base):
"ppe",
"eia_bulk_elec",
] = None
etl_group: Literal[
"eia860",
"eia861",
"eia923",
"entity_eia",
"epacems",
"ferc1",
"ferc1_disabled",
"ferc714",
"glue",
"outputs",
"static_ferc1",
"static_eia",
"static_eia_disabled",
"eia_bulk_elec",
"static_pudl",
] = None
etl_group: ETLGroup

_check_unique = _validator(
"contributors", "keywords", "licenses", "sources", fn=_check_unique
Expand Down Expand Up @@ -1279,6 +1285,7 @@ def dict_from_id(x: str) -> dict: # noqa: C901
# Delete foreign key rules
if "foreign_key_rules" in schema:
del schema["foreign_key_rules"]
obj["etl_group"] = ETLGroup.from_id(obj["etl_group"])

# Add encoders to columns as appropriate, based on FKs.
# Foreign key relationships determine the set of codes to use
Expand Down Expand Up @@ -1379,7 +1386,8 @@ def match_primary_key(self, names: Iterable[str]) -> dict[str, str] | None:
Examples:
>>> fields = [{'name': 'x_year', 'type': 'year'}]
>>> schema = {'fields': fields, 'primary_key': ['x_year']}
>>> resource = Resource(name='r', schema=schema)
>>> etl_group = ETLGroup(title='r', description='r')
>>> resource = Resource(name='r', schema=schema, etl_group=etl_group)

By default, when :attr:`harvest` .`harvest=False`,
exact matches are required.
Expand Down Expand Up @@ -1691,8 +1699,9 @@ class Package(Base):
>>> fields = [{'name': 'x', 'type': 'year'}, {'name': 'y', 'type': 'string'}]
>>> fkey = {'fields': ['x', 'y'], 'reference': {'resource': 'b', 'fields': ['x', 'y']}}
>>> schema = {'fields': fields, 'primary_key': ['x'], 'foreign_keys': [fkey]}
>>> a = Resource(name='a', schema=schema)
>>> b = Resource(name='b', schema=Schema(fields=fields, primary_key=['x']))
>>> etl_group = ETLGroup(title='a', description='b')
>>> a = Resource(name='a', schema=schema, etl_group=etl_group)
>>> b = Resource(name='b', schema=Schema(fields=fields, primary_key=['x']), etl_group=etl_group)
>>> Package(name='ab', resources=[a, b])
Traceback (most recent call last):
ValidationError: ...
Expand Down Expand Up @@ -1792,17 +1801,34 @@ def from_resource_ids( # noqa: C901
i = len(resources)
if len(names) > i:
resources += [Resource.dict_from_id(x) for x in names[i:]]

return cls(name="pudl", resources=resources)

def get_resource(self, name: str) -> Resource:
"""Return the resource with the given name if it is in the Package."""
names = [resource.name for resource in self.resources]
return self.resources[names.index(name)]

def to_rst(self, docs_dir: DirectoryPath, path: str) -> None:
def get_etl_group(self) -> dict[str, ETLGroup]:
"""blah."""
etl_group_dict = {}
for etl_group in ETL_GROUPS.keys():
etl_group_dict[etl_group] = ETLGroup.from_id(etl_group)
return etl_group_dict

def get_resource_by_etl_group(self) -> dict[str, list[Resource]]:
"""blah."""
resource_dict = {}
for etl_group in ETL_GROUPS.keys():
resource_dict[etl_group] = [
resource
for resource in self.resources
if resource.etl_group == ETLGroup.from_id(etl_group)
]
return resource_dict
Comment on lines +1818 to +1827
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic seems like it belongs in the ETLGroup not the Package. Then a Package could look up all its own ETL Groups, each ETL Group could look up its constituent resources.


def to_rst(self, docs_dir: DirectoryPath, path: str, template: str) -> None:
"""Output to an RST file."""
template = _get_jinja_environment(docs_dir).get_template("package.rst.jinja")
template = _get_jinja_environment(docs_dir).get_template(template)
rendered = template.render(package=self)
if path:
Path(path).write_text(rendered)
Expand Down
87 changes: 87 additions & 0 deletions src/pudl/metadata/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,93 @@
}
"""PUDL Contributors for attribution."""

ETL_GROUPS: dict[str, dict[str, str]] = {
"entity_eia": {
"title": "EIA Entity Tables",
"description": """EIA entity tables combine information reported in multiple EIA
tables into a single source of truth. "Entities" (boilers, generators, plants and
utilities) are referred to repeatedly throughout EIA resulting in data duplication and
human-error. For example, one year a plant's latitude is ``55.339722`` and
another year it's ``55.339725`` or in 860 a plant is called ``Barry`` but in 923 it's
once referred to as ``Bary``. We use a "harvesting" process to extract this static (not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"harvesting" (aka entity resolution)
"rehome" (aka normalize)

changing on an annual basis) information, determine the true value, and rehome it in our
entity tables. This reduces the amount of data we have to store and provides users with
a master list of all entitiy types and their characteristics that are reported to a
given source. Read more about our harvesting process in :mod:`pudl.transform.eia`.
""",
},
"static_eia": {
"title": "EIA Static Tables",
"description": """Static EIA tables are like EIA entity tables in that they pull
from multiple EIA tables, but their purpose is to elucidate encoded language. This is
where acromyns are connected to their full spelling and descriptions. These tables did
not originate as raw EIA tables, rather they are created by us to link commonly used
codes to important descriptors.""",
},
"static_eia_disabled": {
"title": "EIA Static Tables (disabled)",
"description": "Disabled tables are no longer included in the PUDL DB.",
},
"eia860": {
"title": "EIA 860",
"description": """Tables derrived from the EIA Form 860. See our
:doc:`../data_sources/eia860` page for more information.""",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Embedding hard-coded relative paths to particular RST documents within the string descriptions is going to be fragile. It means there's only one location in the documentation that this description can be used to refer to that other page. If it's an absolute document path instead it would work from anywhere (so long as we don't change the structure of the documentation files), but ideally this should probably be a separate piece of metadata that's attached to the data source, and each ETLGroup class could have an attribute that stores that information separate from the text description, and it could be integrated into the Jinja template used to output the RST.

But since the etl group organization is probably going to impacted or replaced by the use of Dagster to organize the ETL, I'm not sure it makes sense to spend time on that.

},
"eia861": {
"title": "EIA 861",
"description": "Tables derrived from the EIA Form 861.",
},
"eia923": {
"title": "EIA 923",
"description": """Tables derrived from the EIA Form 923. See our
:doc:`../data_sources/eia923` page for more information.""",
},
"eia_bulk_elec": {
"title": "EIA Bulk Electricity Tables",
"description": "Blah",
},
"epacems": {
"title": "EPA CEMS",
"description": """Tables derrived from the EPA CEMS data. See our
:doc:`../data_sources/epacems` page for more information""",
},
"static_ferc1": {
"title": "FERC Form 1 Static Tables",
"description": """Static FERC Form 1 tables elucidate encoded language. This is
where acromyns are connected to their full spelling and descriptions. These tables did
not originate as raw FERC Form 1 tables, rather they are created by us to link commonly
used codes to important descriptors.""",
},
"ferc1": {
"title": "FERC Form 1",
"description": """Tables derrived from the FERC Form 1 tables. See our
:doc:`../data_sources/ferc1` page for more information""",
},
"ferc1_disabled": {
"title": "FERC Form 1 disabled",
"description": "Disabled tables are no longer included in the PUDL DB.",
},
"ferc714": {
"title": "FERC Form 714",
"description": "Tables derrived from the FERC Form 714 tables",
},
"glue": {
"title": "Glue Tables",
"description": "Tables connecting information from multiple sources",
},
"outputs": {
"title": "Output Tables",
"description": "Blah",
Comment on lines +230 to +232
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this as blah because we still haven't really incorporated this yet and there's a lot there. Also eventually, maybe output is too big a category and we'll do something like eia_output or something like that....idk. But @cmgosnell if you have any edits to any of the descriptions here lmk.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should explain here that these tables are not in the database but we use some of the same schema / typing information for dataframes that we derive from database tables.

In the near term we're going to need to restructure a lot of this using something other than ETL group, which is really just an ad-hoc grouping of ETL tasks, based on where various tables were getting processed in our old ETL structure, which is completely replaced by Dagster.

},
"static_pudl": {
"title": "Static PUDL Tables",
"description": """Static PUDL tables elucidate encoded language. This is
where acromyns are connected to their full spelling and descriptions. They're created
to link commonly used codes to important descriptors.""",
Comment on lines +236 to +238
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description isn't really helping me understand.

There's really only one static data table -- the political jurisdictions -- which is information we compiled by hand since it was showing up in a lot of different places and we wanted it in the DB.

The datasources is a little DB metadata table that explains what went into creating this particular iteration of the DB.

},
}
"""Table categorization by ETL group."""

KEYWORDS: dict[str, list[str]] = {
"electricity": [
"electricity",
Expand Down
Loading