-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restructure data dictionary docs and add a class for etl_groups #2336
base: main
Are you sure you want to change the base?
Changes from 5 commits
26aa741
ed394ed
01a670d
868a67a
79e07b7
295886a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,7 +5,7 @@ Data Dictionaries | |
|
||
.. toctree:: | ||
:caption: Data Processed & Cleaned by PUDL | ||
:maxdepth: 1 | ||
:maxdepth: 2 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we maintain a direct link to |
||
|
||
pudl_db | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,24 @@ | ||
:orphan: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would just removing this get it back in the navigation sidebar? |
||
|
||
=============================================================================== | ||
PUDL Data Dictionary | ||
=============================================================================== | ||
|
||
The following data tables have been cleaned and transformed by our ETL process. | ||
|
||
{% for resource in package.resources %} | ||
{% for group, resources in package.get_resource_by_etl_group().items() %} | ||
{% for resource in resources %} | ||
|
||
.. _{{ resource.name }}: | ||
|
||
{{ resource.name }} | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
{{ resource.description | wordwrap(78) if resource.description else | ||
'No table description available.' }} | ||
`Browse or query this table in Datasette. <https://data.catalyst.coop/pudl/{{ resource.name }}>`__ | ||
|
||
Comment on lines
+14
to
+20
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Resource-specific output probably does not belong in the package class or RST template. |
||
{% include 'resource.rst.jinja' %} | ||
|
||
{% endfor %} | ||
{% endfor %} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,3 @@ | ||
------------------------------------------------------------------------------- | ||
{{ resource.name }} | ||
------------------------------------------------------------------------------- | ||
|
||
{{ resource.description | wordwrap(78) if resource.description else | ||
'No table description available.' }} | ||
`Browse or query this table in Datasette. <https://data.catalyst.coop/pudl/{{ resource.name }}>`__ | ||
|
||
Comment on lines
-1
to
-8
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The removal of this part of the resource output means that the Instead I think you could make output of the resource level description text + resource schema conditional with a a parameter, keeping all of the resource output logic inside the resource class, but still allowing other code to get at only the portion of that available output that they want based on what arguments they pass in. |
||
.. list-table:: | ||
:widths: auto | ||
:header-rows: 1 | ||
|
@@ -17,4 +9,5 @@ | |
* - {{ field.name }} | ||
- {{ field.type }} | ||
- {{ field.description if field.description else "N/A" }} | ||
|
||
{%- endfor %} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,6 +22,7 @@ | |
from pudl.metadata.constants import ( | ||
CONSTRAINT_DTYPES, | ||
CONTRIBUTORS, | ||
ETL_GROUPS, | ||
FIELD_DTYPES_PANDAS, | ||
FIELD_DTYPES_PYARROW, | ||
FIELD_DTYPES_SQL, | ||
|
@@ -1021,6 +1022,24 @@ def from_id(cls, x: str) -> "DataSource": | |
return cls(**cls.dict_from_id(x)) | ||
|
||
|
||
class ETLGroup(Base): | ||
"""Blah.""" | ||
|
||
# name: SnakeCase | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this commented out? This seems like it might be a natural place to define what the valid values of ETL Group are? |
||
title: String = None | ||
description: String = None | ||
|
||
@staticmethod | ||
def dict_from_id(x: str) -> dict: | ||
"""Construct dictionary from PUDL identifier.""" | ||
return copy.deepcopy(ETL_GROUPS[x]) | ||
|
||
@classmethod | ||
def from_id(cls, x: str) -> "ETLGroup": | ||
"""Construct Source by source name in the metadata.""" | ||
return cls(**cls.dict_from_id(x)) | ||
|
||
|
||
class ResourceHarvest(Base): | ||
"""Resource harvest parameters (`resource.harvest`).""" | ||
|
||
|
@@ -1046,7 +1065,8 @@ class Resource(Base): | |
>>> fields = [{'name': 'x', 'type': 'year'}, {'name': 'y', 'type': 'string'}] | ||
>>> fkeys = [{'fields': ['x', 'y'], 'reference': {'resource': 'b', 'fields': ['x', 'y']}}] | ||
>>> schema = {'fields': fields, 'primary_key': ['x'], 'foreign_keys': fkeys} | ||
>>> resource = Resource(name='a', schema=schema) | ||
>>> etl_group = ETLGroup(title='a', description='a') | ||
>>> resource = Resource(name='a', schema=schema, etl_group=etl_group) | ||
>>> table = resource.to_sql() | ||
>>> table.columns.x | ||
Column('x', Integer(), ForeignKey('b.x'), CheckConstraint(...), table=<a>, primary_key=True, nullable=False) | ||
|
@@ -1065,7 +1085,8 @@ class Resource(Base): | |
>>> resource = Resource(**{ | ||
... 'name': 'a', | ||
... 'harvest': {'harvest': True}, | ||
... 'schema': {'fields': fields, 'primary_key': ['id']} | ||
... 'schema': {'fields': fields, 'primary_key': ['id']}, | ||
... 'etl_group': ETLGroup(title='a', description='b') | ||
... }) | ||
>>> dfs = { | ||
... 'a': pd.DataFrame({'id': [1, 1, 2, 2], 'x': [1, 1, 2, 2]}), | ||
|
@@ -1143,7 +1164,8 @@ class Resource(Base): | |
>>> fields = [{'name': 'report_year', 'type': 'year'}] | ||
>>> resource = Resource(**{ | ||
... 'name': 'table', 'harvest': {'harvest': True}, | ||
... 'schema': {'fields': fields, 'primary_key': ['report_year']} | ||
... 'schema': {'fields': fields, 'primary_key': ['report_year']}, | ||
... 'etl_group': ETLGroup(title='a', description='b') | ||
... }) | ||
>>> df = pd.DataFrame({'report_date': ['2000-02-02', '2000-03-03']}) | ||
>>> resource.format_df(df) | ||
|
@@ -1182,23 +1204,7 @@ class Resource(Base): | |
"ppe", | ||
"eia_bulk_elec", | ||
] = None | ||
etl_group: Literal[ | ||
"eia860", | ||
"eia861", | ||
"eia923", | ||
"entity_eia", | ||
"epacems", | ||
"ferc1", | ||
"ferc1_disabled", | ||
"ferc714", | ||
"glue", | ||
"outputs", | ||
"static_ferc1", | ||
"static_eia", | ||
"static_eia_disabled", | ||
"eia_bulk_elec", | ||
"static_pudl", | ||
] = None | ||
etl_group: ETLGroup | ||
|
||
_check_unique = _validator( | ||
"contributors", "keywords", "licenses", "sources", fn=_check_unique | ||
|
@@ -1279,6 +1285,7 @@ def dict_from_id(x: str) -> dict: # noqa: C901 | |
# Delete foreign key rules | ||
if "foreign_key_rules" in schema: | ||
del schema["foreign_key_rules"] | ||
obj["etl_group"] = ETLGroup.from_id(obj["etl_group"]) | ||
|
||
# Add encoders to columns as appropriate, based on FKs. | ||
# Foreign key relationships determine the set of codes to use | ||
|
@@ -1379,7 +1386,8 @@ def match_primary_key(self, names: Iterable[str]) -> dict[str, str] | None: | |
Examples: | ||
>>> fields = [{'name': 'x_year', 'type': 'year'}] | ||
>>> schema = {'fields': fields, 'primary_key': ['x_year']} | ||
>>> resource = Resource(name='r', schema=schema) | ||
>>> etl_group = ETLGroup(title='r', description='r') | ||
>>> resource = Resource(name='r', schema=schema, etl_group=etl_group) | ||
|
||
By default, when :attr:`harvest` .`harvest=False`, | ||
exact matches are required. | ||
|
@@ -1691,8 +1699,9 @@ class Package(Base): | |
>>> fields = [{'name': 'x', 'type': 'year'}, {'name': 'y', 'type': 'string'}] | ||
>>> fkey = {'fields': ['x', 'y'], 'reference': {'resource': 'b', 'fields': ['x', 'y']}} | ||
>>> schema = {'fields': fields, 'primary_key': ['x'], 'foreign_keys': [fkey]} | ||
>>> a = Resource(name='a', schema=schema) | ||
>>> b = Resource(name='b', schema=Schema(fields=fields, primary_key=['x'])) | ||
>>> etl_group = ETLGroup(title='a', description='b') | ||
>>> a = Resource(name='a', schema=schema, etl_group=etl_group) | ||
>>> b = Resource(name='b', schema=Schema(fields=fields, primary_key=['x']), etl_group=etl_group) | ||
>>> Package(name='ab', resources=[a, b]) | ||
Traceback (most recent call last): | ||
ValidationError: ... | ||
|
@@ -1792,17 +1801,34 @@ def from_resource_ids( # noqa: C901 | |
i = len(resources) | ||
if len(names) > i: | ||
resources += [Resource.dict_from_id(x) for x in names[i:]] | ||
|
||
return cls(name="pudl", resources=resources) | ||
|
||
def get_resource(self, name: str) -> Resource: | ||
"""Return the resource with the given name if it is in the Package.""" | ||
names = [resource.name for resource in self.resources] | ||
return self.resources[names.index(name)] | ||
|
||
def to_rst(self, docs_dir: DirectoryPath, path: str) -> None: | ||
def get_etl_group(self) -> dict[str, ETLGroup]: | ||
"""blah.""" | ||
etl_group_dict = {} | ||
for etl_group in ETL_GROUPS.keys(): | ||
etl_group_dict[etl_group] = ETLGroup.from_id(etl_group) | ||
return etl_group_dict | ||
|
||
def get_resource_by_etl_group(self) -> dict[str, list[Resource]]: | ||
"""blah.""" | ||
resource_dict = {} | ||
for etl_group in ETL_GROUPS.keys(): | ||
resource_dict[etl_group] = [ | ||
resource | ||
for resource in self.resources | ||
if resource.etl_group == ETLGroup.from_id(etl_group) | ||
] | ||
return resource_dict | ||
Comment on lines
+1818
to
+1827
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This logic seems like it belongs in the |
||
|
||
def to_rst(self, docs_dir: DirectoryPath, path: str, template: str) -> None: | ||
"""Output to an RST file.""" | ||
template = _get_jinja_environment(docs_dir).get_template("package.rst.jinja") | ||
template = _get_jinja_environment(docs_dir).get_template(template) | ||
rendered = template.render(package=self) | ||
if path: | ||
Path(path).write_text(rendered) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -153,6 +153,93 @@ | |
} | ||
"""PUDL Contributors for attribution.""" | ||
|
||
ETL_GROUPS: dict[str, dict[str, str]] = { | ||
"entity_eia": { | ||
"title": "EIA Entity Tables", | ||
"description": """EIA entity tables combine information reported in multiple EIA | ||
tables into a single source of truth. "Entities" (boilers, generators, plants and | ||
utilities) are referred to repeatedly throughout EIA resulting in data duplication and | ||
human-error. For example, one year a plant's latitude is ``55.339722`` and | ||
another year it's ``55.339725`` or in 860 a plant is called ``Barry`` but in 923 it's | ||
once referred to as ``Bary``. We use a "harvesting" process to extract this static (not | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "harvesting" (aka entity resolution) |
||
changing on an annual basis) information, determine the true value, and rehome it in our | ||
entity tables. This reduces the amount of data we have to store and provides users with | ||
a master list of all entitiy types and their characteristics that are reported to a | ||
given source. Read more about our harvesting process in :mod:`pudl.transform.eia`. | ||
""", | ||
}, | ||
"static_eia": { | ||
"title": "EIA Static Tables", | ||
"description": """Static EIA tables are like EIA entity tables in that they pull | ||
from multiple EIA tables, but their purpose is to elucidate encoded language. This is | ||
where acromyns are connected to their full spelling and descriptions. These tables did | ||
not originate as raw EIA tables, rather they are created by us to link commonly used | ||
codes to important descriptors.""", | ||
}, | ||
"static_eia_disabled": { | ||
"title": "EIA Static Tables (disabled)", | ||
"description": "Disabled tables are no longer included in the PUDL DB.", | ||
}, | ||
"eia860": { | ||
"title": "EIA 860", | ||
"description": """Tables derrived from the EIA Form 860. See our | ||
:doc:`../data_sources/eia860` page for more information.""", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Embedding hard-coded relative paths to particular RST documents within the string descriptions is going to be fragile. It means there's only one location in the documentation that this description can be used to refer to that other page. If it's an absolute document path instead it would work from anywhere (so long as we don't change the structure of the documentation files), but ideally this should probably be a separate piece of metadata that's attached to the data source, and each But since the etl group organization is probably going to impacted or replaced by the use of Dagster to organize the ETL, I'm not sure it makes sense to spend time on that. |
||
}, | ||
"eia861": { | ||
"title": "EIA 861", | ||
"description": "Tables derrived from the EIA Form 861.", | ||
}, | ||
"eia923": { | ||
"title": "EIA 923", | ||
"description": """Tables derrived from the EIA Form 923. See our | ||
:doc:`../data_sources/eia923` page for more information.""", | ||
}, | ||
"eia_bulk_elec": { | ||
"title": "EIA Bulk Electricity Tables", | ||
"description": "Blah", | ||
}, | ||
"epacems": { | ||
"title": "EPA CEMS", | ||
"description": """Tables derrived from the EPA CEMS data. See our | ||
:doc:`../data_sources/epacems` page for more information""", | ||
}, | ||
"static_ferc1": { | ||
"title": "FERC Form 1 Static Tables", | ||
"description": """Static FERC Form 1 tables elucidate encoded language. This is | ||
where acromyns are connected to their full spelling and descriptions. These tables did | ||
not originate as raw FERC Form 1 tables, rather they are created by us to link commonly | ||
used codes to important descriptors.""", | ||
}, | ||
"ferc1": { | ||
"title": "FERC Form 1", | ||
"description": """Tables derrived from the FERC Form 1 tables. See our | ||
:doc:`../data_sources/ferc1` page for more information""", | ||
}, | ||
"ferc1_disabled": { | ||
"title": "FERC Form 1 disabled", | ||
"description": "Disabled tables are no longer included in the PUDL DB.", | ||
}, | ||
"ferc714": { | ||
"title": "FERC Form 714", | ||
"description": "Tables derrived from the FERC Form 714 tables", | ||
}, | ||
"glue": { | ||
"title": "Glue Tables", | ||
"description": "Tables connecting information from multiple sources", | ||
}, | ||
"outputs": { | ||
"title": "Output Tables", | ||
"description": "Blah", | ||
Comment on lines
+230
to
+232
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I left this as blah because we still haven't really incorporated this yet and there's a lot there. Also eventually, maybe output is too big a category and we'll do something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should explain here that these tables are not in the database but we use some of the same schema / typing information for dataframes that we derive from database tables. In the near term we're going to need to restructure a lot of this using something other than ETL group, which is really just an ad-hoc grouping of ETL tasks, based on where various tables were getting processed in our old ETL structure, which is completely replaced by Dagster. |
||
}, | ||
"static_pudl": { | ||
"title": "Static PUDL Tables", | ||
"description": """Static PUDL tables elucidate encoded language. This is | ||
where acromyns are connected to their full spelling and descriptions. They're created | ||
to link commonly used codes to important descriptors.""", | ||
Comment on lines
+236
to
+238
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This description isn't really helping me understand. There's really only one static data table -- the political jurisdictions -- which is information we compiled by hand since it was showing up in a lot of different places and we wanted it in the DB. The |
||
}, | ||
} | ||
"""Table categorization by ETL group.""" | ||
|
||
KEYWORDS: dict[str, list[str]] = { | ||
"electricity": [ | ||
"electricity", | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you're comparing two instances of
ETLGroup
using==
. Are you sure that does what you want? How is equality defined for these classes? Do you really want to check whether the ETLGroup names (which are strings) are the same?