Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FERC 714: transform of hourly demand table (dbf +xbrl) #3842

Merged
merged 47 commits into from
Sep 26, 2024
Merged
Show file tree
Hide file tree
Changes from 45 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
7d3b11d
first very wip draft of transofmring the hourly 714 table
cmgosnell Sep 12, 2024
f48054b
early processing of datetimes and initial cleaning of timezone codes
cmgosnell Sep 13, 2024
ea25923
lil function suffix cleanup
cmgosnell Sep 16, 2024
b1d5c0a
group the table-specific transforms into staticmethods of a table tra…
cmgosnell Sep 17, 2024
f5bfc62
yay add the hour into the csv report_date early so i'm not oopsies lo…
cmgosnell Sep 17, 2024
b0d1219
lil extra doc clean
cmgosnell Sep 17, 2024
f27b04b
Merge branch 'main' into transform-714-xbrl
cmgosnell Sep 17, 2024
c1e66af
Map FERC 714 XBRL and CSV IDs (#3849)
e-belfer Sep 17, 2024
05408d5
Preliminary fixes to the 714 data source page
aesharpe Sep 17, 2024
0054b4d
integrate the respondent_id_ferc714 map into transforms
cmgosnell Sep 17, 2024
6936a38
Merge branch 'main' into transform-714-xbrl
cmgosnell Sep 18, 2024
1986b80
Add notes on CSV-XBRL ID linkage to docs
e-belfer Sep 18, 2024
46ba25e
Merge branch 'transform-714-xbrl' into ferc714-data-source
e-belfer Sep 18, 2024
780ebb7
Write preliminary transform class and function for XBRL and CSV core_…
aesharpe Sep 19, 2024
978e664
wip first round of respondent table transforming
cmgosnell Sep 19, 2024
30d34d8
Combine XBRL and CSV tables
aesharpe Sep 19, 2024
bbd6e9a
Add forecast to forecast column names
aesharpe Sep 19, 2024
37a0643
Add migration file for new forecast cols
aesharpe Sep 19, 2024
42716aa
finish eia_code mapping and wrap up transforms
cmgosnell Sep 20, 2024
858b744
udpate docs
cmgosnell Sep 20, 2024
549a5ab
udpate docs again lol spaces
cmgosnell Sep 20, 2024
afc60ac
fix forcast to forecast type and add to run() docstring
aesharpe Sep 20, 2024
75e29ae
convert :meth: to :func:
aesharpe Sep 20, 2024
e4c3b95
lower expected forecast year range
aesharpe Sep 20, 2024
cfc9176
fix docs typo
aesharpe Sep 20, 2024
4a3c2bf
Use split/apply/combine for deduping and update assertion
aesharpe Sep 20, 2024
a459344
responding to pr comments mostly doc updates
cmgosnell Sep 23, 2024
956af05
Add new years to Ferc714CheckSpec
aesharpe Sep 23, 2024
6bb79de
update docs
cmgosnell Sep 23, 2024
5cb6492
Merge pull request #3856 from catalyst-cooperative/transform-714-year…
aesharpe Sep 23, 2024
10692cb
Merge pull request #3857 from catalyst-cooperative/transform-714-xbrl…
cmgosnell Sep 23, 2024
94752b5
first pass of adding respondend id tables
cmgosnell Sep 23, 2024
df95744
Merge branch 'transform-714-xbrl' into ferc714-data-source
aesharpe Sep 23, 2024
48e9ff0
add alembic migration for the glue tables
cmgosnell Sep 23, 2024
33085e6
remove the lil post process step
cmgosnell Sep 23, 2024
66b908d
Light edits
aesharpe Sep 23, 2024
25fb6dd
Merge branch 'ferc714-data-source' of https://github.com/catalyst-coo…
aesharpe Sep 23, 2024
74e8fe0
release notes and metadata updates
cmgosnell Sep 23, 2024
0c811e6
Merge pull request #3850 from catalyst-cooperative/ferc714-data-source
aesharpe Sep 23, 2024
9b938c3
Add table description for annual forecast table and fix indentation e…
aesharpe Sep 24, 2024
91c8859
update docs and metadata, plus stop trying to impute midnight jan 1st…
cmgosnell Sep 24, 2024
b101086
Merge branch 'main' into transform-714-xbrl
cmgosnell Sep 24, 2024
d4d93fc
update the validation test expectations for the analysis downstream s…
cmgosnell Sep 24, 2024
708a413
update the settinggggsss omigosh plus restrict the imputations based …
cmgosnell Sep 25, 2024
706c7df
add module-level design notes
cmgosnell Sep 25, 2024
551325d
add move color to the fast test 12 assertion
cmgosnell Sep 25, 2024
20959c8
remove the lil context thing that is no longer necessary
cmgosnell Sep 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions docs/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,16 @@ PUDL Release Notes
v2024.X.x (2024-XX-XX)
---------------------------------------------------------------------------------------

New Data Coverage
^^^^^^^^^^^^^^^^^

FERC Form 714
~~~~~~~~~~~~~
* Integrate 2021-2023 years of the FERC Form 714 data. FERC updated its reporting
format for 2021 from a CSV files to XBRL files. This update integrates the two
raw data sources and extends the data coverage through 2023. See :issue:`3809`
and :pr:`3842`.

Schema Changes
^^^^^^^^^^^^^^
* Added :ref:`out_eia__yearly_assn_plant_parts_plant_gen` table. This table associates
Expand Down
76 changes: 57 additions & 19 deletions docs/templates/ferc714_child.rst.jinja
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
{% extends "data_source_parent.rst.jinja" %}

{% block background %}
FERC Form 714, otherwise known as the Annual Electric Balancing Authority Area and
Planning Area Report, collects data and provides insights about balancing authority
area and planning area operations.

{% endblock %}

Expand All @@ -13,28 +16,21 @@
{% block availability %}
The data we've integrated from FERC Form 714 includes:

* hourly electricity demand by utility or balancing authority from 2006-2020
* a table identifying the form respondents including their EIA utility or balancing
* Hourly electricity demand by utility or balancing authority.
* Annual demand forecast.
* A table identifying the form respondents including their EIA utility or balancing
authority ID, which allows us to link the FERC-714 data to other information
reported in :doc:`eia860` and :doc:`eia861`.

We have not yet had the opportunity to work with the most recent FERC-714 data (2021 and
later), which is now being published using the new XBRL format.

The hourly demand data for 2006-2020 is about 15 million records. There are about 200
respondents that show up in the respondents table.

WIth the EIA IDs, we link the hourly electricity demand to a particular georgraphic
region at the county level, because utilities and balancing authorities report their
service territories in :ref:`core_eia861__yearly_service_territory`, and from that
information we can estimate historical hourly electricity demand by state.
With the EIA IDs we can link the hourly electricity demand to a particular geographic
region at the county level because utilities and balancing authorities report their
service territories in :ref:`core_eia861__yearly_service_territory`. From that
information we estimate historical hourly electricity demand by state.

Plant operators reported in :ref:`core_eia860__scd_plants` and generator ownership
information reported in :ref:`core_eia860__scd_ownership` are linked to
:ref:`core_eia860__scd_utilities` and :ref:`core_eia861__yearly_balancing_authority` and
so can also be linked to the :ref:`core_ferc714__respondent_id` table, as well as the
:ref:`core_epacems__hourly_emissions` unit-level emissions and generation data reported
in :doc:`epacems`.
can therefore be linked to the :ref:`core_ferc714__respondent_id` table.

{% endblock %}

Expand All @@ -56,32 +52,44 @@ formats:
* **2021-present**: Standardized electronic filing using the XBRL (eXtensible Business
Reporting Language) dialect of XML.

We only have plans to integrate the data from the standardized electronic reporting era
since the format of the earlier data varies for each reporting balancing authority and
utility, and would be very labor intensive to parse and reconcile.
We only plan to integrate the data from the standardized electronic reporting era
(2006+) since the format of the earlier data varies for each reporting balancing authority
and utility, and would be very labor intensive to parse and reconcile.

{% endblock %}

{% block notable_irregularities %}

Timezone errors
---------------

The original hourly electricity demand time series is plagued with timezone and daylight
savings vs. standard time irregularities, which we have done our best to clean up. The
timestamps in the clean data are all in UTC, with a timezone code stored in a separate
column, so that the times can be easily localized or converted. It's certainly not
perfect, but its much better than the original data and it's easy to work with!

Sign errors
-----------

Not all respondents use the same sign convention for reporting "demand." The vast
majority consider demand / load that they serve to be a positive number, and so we've
standardized the data to use that convention.

Reporting gaps
--------------

There are a lot of reporting gaps, especially for smaller respondents. Sometimes these
are brief, and sometimes they are entire years. There are also a number of outliers and
suspicious values (e.g. a long series of identical consecutive values). We have some
tools that we've built to clean up these outliers in
:mod:`pudl.analysis.timeseries_cleaning`.

Respondent-to-balancing-authority inconsistencies
-------------------------------------------------

Because utilities and balancing authorities occasionally change their service
territories or merge, the demand reproted by any individual "respondent" may correspond
territories or merge, the demand reported by any individual "respondent" may correspond
to wildly different consumers in different years. To make it at least somewhat possible
to compare the reported data across time, we've also compiled historical service
territory maps for the respondents based on data reported in :doc:`eia861`. However,
Expand All @@ -93,4 +101,34 @@ be found in :mod:`pudl.analysis.service_territory` and :mod:`pudl.analysis.spati
The :mod:`pudl.analysis.state_demand` script brings together all of the above to
estimate historical hourly electricity demand by state for 2006-2020.

Combining XBRL and CSV data
---------------------------

The format of the company identifiers (CIDs) used in the CSV data (2006-2020) and the
XBRL data (2021+) differs. To link respondents between both data formats, we manually
map the IDs from both datasets and create a ``respondent_id_ferc714`` in
:mod:`pudl.package_data.glue.respondent_id_ferc714.csv`.

This CSV builds on the `migrated data
<https://www.ferc.gov/filing-forms/eforms-refresh/migrated-data-downloads>`__ provided
by FERC during the transition from CSV to XBRL data, which notes that:

Companies that did not have a CID prior to the migration have been assigned a CID that
begins with R, i.e., a temporary RID. These RIDs will be replaced in future with the
accurate CIDs and new datasets will be published.

The file names of the migrated data (which correspond to CSV IDs) and the respondent
CIDs in the migrated files provide the basis for ID mapping. Though CIDs are intended to
be static, some of the CIDs in the migrated data weren't found in the actual XBRL data,
and the same respondents were reporting data using different CIDs. To ensure accurate
record matching, we manually reviewed the CIDs for each respondent, matching based on
name and location. Some quirks to note:

* All respondents are matched 1:1 from CSV to XBRL data. Unmatched respondents mostly
occur due to mergers, splits, acquisitions, and companies that no longer exist.
* Some CIDs assigned during the migration process do not appear in the data. Given the
intention by FERC to make these CIDs permanent, they are still included in the mapping
CSV in case these respondents re-appear. All temporary IDs (beginning with R) were
removed.

{% endblock %}
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
"""Add my cool lil respondent id glue tables and other 714 xbrl updates

Revision ID: 8fffc1d0399a
Revises: a93bdb8d4fbd
Create Date: 2024-09-24 09:28:45.862748

"""
from alembic import op
import sqlalchemy as sa


# revision identifiers, used by Alembic.
revision = '8fffc1d0399a'
down_revision = 'a93bdb8d4fbd'
branch_labels = None
depends_on = None


def upgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
op.create_table('core_pudl__assn_ferc714_pudl_respondents',
sa.Column('respondent_id_ferc714', sa.Integer(), nullable=False, comment='PUDL-assigned identifying a respondent to FERC Form 714. This ID associates natively reported respondent IDs from the orignal CSV and XBRL data sources.'),
sa.PrimaryKeyConstraint('respondent_id_ferc714', name=op.f('pk_core_pudl__assn_ferc714_pudl_respondents'))
)
op.create_table('core_pudl__assn_ferc714_csv_pudl_respondents',
sa.Column('respondent_id_ferc714', sa.Integer(), nullable=False, comment='PUDL-assigned identifying a respondent to FERC Form 714. This ID associates natively reported respondent IDs from the orignal CSV and XBRL data sources.'),
sa.Column('respondent_id_ferc714_csv', sa.Integer(), nullable=False, comment='FERC Form 714 respondent ID from CSV reported data - published from years: 2006-2020. This ID is linked to the newer years of reported XBRL data through the PUDL-assigned respondent_id_ferc714 ID. This ID was originally reported as respondent_id. Note that this ID does not correspond to FERC respondent IDs from other forms.'),
sa.ForeignKeyConstraint(['respondent_id_ferc714'], ['core_pudl__assn_ferc714_pudl_respondents.respondent_id_ferc714'], name=op.f('fk_core_pudl__assn_ferc714_csv_pudl_respondents_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents')),
sa.PrimaryKeyConstraint('respondent_id_ferc714', 'respondent_id_ferc714_csv', name=op.f('pk_core_pudl__assn_ferc714_csv_pudl_respondents'))
)
op.create_table('core_pudl__assn_ferc714_xbrl_pudl_respondents',
sa.Column('respondent_id_ferc714', sa.Integer(), nullable=False, comment='PUDL-assigned identifying a respondent to FERC Form 714. This ID associates natively reported respondent IDs from the orignal CSV and XBRL data sources.'),
sa.Column('respondent_id_ferc714_xbrl', sa.Text(), nullable=False, comment='FERC Form 714 respondent ID from XBRL reported data - published from years: 2021-present. This ID is linked to the older years of reported CSV data through the PUDL-assigned respondent_id_ferc714 ID. This ID was originally reported as entity_id. Note that this ID does not correspond to FERC respondent IDs from other forms.'),
sa.ForeignKeyConstraint(['respondent_id_ferc714'], ['core_pudl__assn_ferc714_pudl_respondents.respondent_id_ferc714'], name=op.f('fk_core_pudl__assn_ferc714_xbrl_pudl_respondents_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents')),
sa.PrimaryKeyConstraint('respondent_id_ferc714', 'respondent_id_ferc714_xbrl', name=op.f('pk_core_pudl__assn_ferc714_xbrl_pudl_respondents'))
)
with op.batch_alter_table('core_ferc714__respondent_id', schema=None) as batch_op:
batch_op.add_column(sa.Column('respondent_id_ferc714_csv', sa.Integer(), nullable=True, comment='FERC Form 714 respondent ID from CSV reported data - published from years: 2006-2020. This ID is linked to the newer years of reported XBRL data through the PUDL-assigned respondent_id_ferc714 ID. This ID was originally reported as respondent_id. Note that this ID does not correspond to FERC respondent IDs from other forms.'))
batch_op.add_column(sa.Column('respondent_id_ferc714_xbrl', sa.Text(), nullable=True, comment='FERC Form 714 respondent ID from XBRL reported data - published from years: 2021-present. This ID is linked to the older years of reported CSV data through the PUDL-assigned respondent_id_ferc714 ID. This ID was originally reported as entity_id. Note that this ID does not correspond to FERC respondent IDs from other forms.'))
batch_op.create_foreign_key(batch_op.f('fk_core_ferc714__respondent_id_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), 'core_pudl__assn_ferc714_pudl_respondents', ['respondent_id_ferc714'], ['respondent_id_ferc714'])

with op.batch_alter_table('core_ferc714__yearly_planning_area_demand_forecast', schema=None) as batch_op:
batch_op.add_column(sa.Column('summer_peak_demand_forecast_mw', sa.Float(), nullable=True, comment='The maximum forecasted hourly sumemr load (for the months of June through September).'))
batch_op.add_column(sa.Column('winter_peak_demand_forecast_mw', sa.Float(), nullable=True, comment='The maximum forecasted hourly winter load (for the months of January through March).'))
batch_op.add_column(sa.Column('net_demand_forecast_mwh', sa.Float(), nullable=True, comment='Net forecasted electricity demand for the specific period in megawatt-hours (MWh).'))
batch_op.drop_constraint('fk_core_ferc714__yearly_planning_area_demand_forecast_respondent_id_ferc714_core_ferc714__respondent_id', type_='foreignkey')
batch_op.create_foreign_key(batch_op.f('fk_core_ferc714__yearly_planning_area_demand_forecast_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), 'core_pudl__assn_ferc714_pudl_respondents', ['respondent_id_ferc714'], ['respondent_id_ferc714'])
batch_op.drop_column('summer_peak_demand_mw')
batch_op.drop_column('net_demand_mwh')
batch_op.drop_column('winter_peak_demand_mw')

with op.batch_alter_table('out_ferc714__respondents_with_fips', schema=None) as batch_op:
batch_op.drop_constraint('fk_out_ferc714__respondents_with_fips_respondent_id_ferc714_core_ferc714__respondent_id', type_='foreignkey')
batch_op.create_foreign_key(batch_op.f('fk_out_ferc714__respondents_with_fips_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), 'core_pudl__assn_ferc714_pudl_respondents', ['respondent_id_ferc714'], ['respondent_id_ferc714'])

with op.batch_alter_table('out_ferc714__summarized_demand', schema=None) as batch_op:
batch_op.drop_constraint('fk_out_ferc714__summarized_demand_respondent_id_ferc714_core_ferc714__respondent_id', type_='foreignkey')
batch_op.create_foreign_key(batch_op.f('fk_out_ferc714__summarized_demand_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), 'core_pudl__assn_ferc714_pudl_respondents', ['respondent_id_ferc714'], ['respondent_id_ferc714'])

# ### end Alembic commands ###


def downgrade() -> None:
# ### commands auto generated by Alembic - please adjust! ###
with op.batch_alter_table('out_ferc714__summarized_demand', schema=None) as batch_op:
batch_op.drop_constraint(batch_op.f('fk_out_ferc714__summarized_demand_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), type_='foreignkey')
batch_op.create_foreign_key('fk_out_ferc714__summarized_demand_respondent_id_ferc714_core_ferc714__respondent_id', 'core_ferc714__respondent_id', ['respondent_id_ferc714'], ['respondent_id_ferc714'])

with op.batch_alter_table('out_ferc714__respondents_with_fips', schema=None) as batch_op:
batch_op.drop_constraint(batch_op.f('fk_out_ferc714__respondents_with_fips_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), type_='foreignkey')
batch_op.create_foreign_key('fk_out_ferc714__respondents_with_fips_respondent_id_ferc714_core_ferc714__respondent_id', 'core_ferc714__respondent_id', ['respondent_id_ferc714'], ['respondent_id_ferc714'])

with op.batch_alter_table('core_ferc714__yearly_planning_area_demand_forecast', schema=None) as batch_op:
batch_op.add_column(sa.Column('winter_peak_demand_mw', sa.FLOAT(), nullable=True))
batch_op.add_column(sa.Column('net_demand_mwh', sa.FLOAT(), nullable=True))
batch_op.add_column(sa.Column('summer_peak_demand_mw', sa.FLOAT(), nullable=True))
batch_op.drop_constraint(batch_op.f('fk_core_ferc714__yearly_planning_area_demand_forecast_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), type_='foreignkey')
batch_op.create_foreign_key('fk_core_ferc714__yearly_planning_area_demand_forecast_respondent_id_ferc714_core_ferc714__respondent_id', 'core_ferc714__respondent_id', ['respondent_id_ferc714'], ['respondent_id_ferc714'])
batch_op.drop_column('net_demand_forecast_mwh')
batch_op.drop_column('winter_peak_demand_forecast_mw')
batch_op.drop_column('summer_peak_demand_forecast_mw')

with op.batch_alter_table('core_ferc714__respondent_id', schema=None) as batch_op:
batch_op.drop_constraint(batch_op.f('fk_core_ferc714__respondent_id_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), type_='foreignkey')
batch_op.drop_column('respondent_id_ferc714_xbrl')
batch_op.drop_column('respondent_id_ferc714_csv')

op.drop_table('core_pudl__assn_ferc714_xbrl_pudl_respondents')
op.drop_table('core_pudl__assn_ferc714_csv_pudl_respondents')
op.drop_table('core_pudl__assn_ferc714_pudl_respondents')
# ### end Alembic commands ###
Loading
Loading