Skip to content

Commit

Permalink
Integrate final 2023 EIA 860 data (#3871)
Browse files Browse the repository at this point in the history
* Update extraction and drop non-existent generators

* Stash in-progress ID mapping

* Update PUDL ID mapping and docs, update data updates documentation

* Check in helpful mapping notebook

* Update release notes and validation tests

* Drop capacity from mapping spreadsheet, fix release notes

* Remove duplicated utility ID

---------

Co-authored-by: Zane Selvans <[email protected]>
  • Loading branch information
e-belfer and zaneselvans authored Sep 30, 2024
1 parent 9255b64 commit 7cff938
Show file tree
Hide file tree
Showing 13 changed files with 364 additions and 111 deletions.
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/existing_data_updates.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Based on the [Existing Data Updates Docs](https://catalystcoop-pudl.readthedocs.
- [ ] [Update table and column transformations](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/existing_data_updates.html#update-table-column-transformations)
- [ ] [Update the PUDL db schema](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/existing_data_updates.html#update-the-pudl-db-schema)
- [ ] [Connect datasets](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/existing_data_updates.html#connect-datasets)
- [ ] [Update the output routines](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/existing_data_updates.html#update-the-output-routines-and-run-full-tests)
- [ ] [Run the ETL](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/existing_data_updates.html#run-the-etl)
- [ ] [Update the output routines and run full tests](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/existing_data_updates.html#update-the-output-routines-and-run-full-tests)
- [ ] [Run and update data validations](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/existing_data_updates.html#run-and-update-data-validations)
- [ ] [Update the documentation](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/existing_data_updates.html#update-the-documentation)
123 changes: 123 additions & 0 deletions devtools/pudl_id_mapping_help.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "f22a1df7",
"metadata": {},
"source": [
"# PUDL ID Mapping Help\n",
"\n",
"This notebook helps to support the manual mapping of FERC to EIA plant IDs. See the [PUDL ID mapping](https://catalystcoop-pudl.readthedocs.io/en/latest/dev/pudl_id_mapping.html) documentation for more information."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5d7c4cba-9e4c-4f00-b120-046d63929ed7",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import pudl\n",
"import pudl.logging_helpers\n",
"from pudl.etl import default_assets, defs\n",
"\n",
"logger = pudl.logging_helpers.get_logger(__name__)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "acada4d1-cb5e-4a9b-a372-52596e2cf5f0",
"metadata": {},
"outputs": [],
"source": [
"plants_eia = defs.load_asset_value(\"out_eia__yearly_plants\")\n",
"plants_pudl = defs.load_asset_value(\"core_pudl__entity_plants_pudl\")\n",
"plants_ferc = defs.load_asset_value(\"out_ferc1__yearly_all_plants\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3f0e2a0c-c94b-4b11-88f9-720094bb768a",
"metadata": {},
"outputs": [],
"source": [
"cols_eia = [\"plant_id_pudl\",\"plant_id_eia\",\"plant_name_eia\",\"utility_name_eia\",\"city\",\"county\", \"latitude\",\"longitude\",\"state\"]\n",
"cols_ferc = [\"plant_id_pudl\",\"plant_id_ferc1\",\"plant_name_ferc1\", \"utility_name_ferc1\", \"capacity_mw\", \"record_id\"]\n",
"plants = pd.merge(\n",
" plants_pudl,\n",
" plants_eia[cols_eia].drop_duplicates(),\n",
" how=\"outer\",\n",
" on=[\"plant_id_pudl\"],\n",
" validate=\"1:m\"\n",
").merge(\n",
" plants_ferc[cols_ferc].drop_duplicates(subset=[col for col in cols_ferc if col != \"record_id\"]),\n",
" how=\"outer\",\n",
" on=[\"plant_id_pudl\"],\n",
" suffixes=(\"_eia\", \"_ferc\")\n",
")\n",
"plants.plant_name_eia = plants.plant_name_eia.str.lower()"
]
},
{
"cell_type": "markdown",
"id": "607c048f",
"metadata": {},
"source": [
"Use the snippet of code below to speed up searching for plant matches. Update the matching ID value in the spreadsheet by linking it to the cell, _not_ by hard-coding the value!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8fda99b8-091f-4c16-958b-51ebffbc95eb",
"metadata": {},
"outputs": [],
"source": [
"name_bit = \"richmond\"\n",
"# when you actually need to restrict it by state bc there are too many\n",
"# add your state and un-comment out the state line below\n",
"state = \"VT\"\n",
"plants[\n",
" (plants.plant_name_eia.str.contains(name_bit)\n",
" | plants.plant_name_pudl.str.contains(name_bit)\n",
" | plants.plant_name_ferc1.str.contains(name_bit))\n",
" & ((plants.state == state) | plants.state.isnull())\n",
"].sort_values([\"latitude\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c6accad0",
"metadata": {},
"outputs": [],
"source": [
"plants_entity = defs.load_asset_value(\"out_eia__entity_plants\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
33 changes: 17 additions & 16 deletions docs/dev/existing_data_updates.rst
Original file line number Diff line number Diff line change
Expand Up @@ -411,36 +411,37 @@ In some cases you may need to resort to Google Maps. If no coordinates can be fo
then at least the plant's state should be included so that an approximate timezone can
be inferred.

7. Run the ETL
7. Update the Output Routines
-----------------------------
**7.1)** Update the denormalized table outputs and derived analytical routines to
accommodate the new data if necessary.

* Are there new columns that should be incorporated into the output tables?
* Are there new tables that need to have an output function defined for them?

8. Run the ETL
--------------
Once the FERC 1 and EIA utilities and plants have been associated with each other, you
can try and run the ETL with all datasets included. See: :doc:`run_the_etl`.

**7.1)** First run the ETL for just the new year of data, using the ``etl_fast.yml``
**8.1)** First run the ETL for just the new year of data, using the ``etl_fast.yml``
settings file.

**7.2)** Once the fast ETL works, run the full ETL using the ``etl_full.yml`` settings
**8.2)** Once the fast ETL works, run the full ETL using the ``etl_full.yml`` settings
to populate complete FERC 1 & PUDL DBs and EPA CEMS Parquet files.

8. Update the Output Routines and Run Full Tests
------------------------------------------------
**8.1)** With a full PUDL DB, update the denormalized table outputs and derived
analytical routines to accommodate the new data if necessary. These are generally
called from within the :class:`pudl.output.pudltabl.PudlTabl` class.

* Are there new columns that should be incorporated into the output tables?
* Are there new tables that need to have an output function defined for them?
9. Run and Update Data Validations
----------------------------------

**8.2)** To ensure that you fully exercise all of the possible output functions,
**9.1)** To ensure that you fully exercise all of the possible output functions,
run all the integration tests against your live PUDL DB with:

.. code-block:: console
$ make pytest-integration-full
9. Run and Update Data Validations
-----------------------------------
**9.1)** When the CI tests are passing against all years of data, sanity check the data
**9.2)** When the CI tests are passing against all years of data, sanity check the data
in the database and the derived outputs by running

.. code-block:: console
Expand All @@ -450,12 +451,12 @@ in the database and the derived outputs by running
We expect at least some of the validation tests to fail initially because we haven't
updated the number of records we expect to see in each table.

**9.2)** You may also need to update the expected distribution of fuel prices if they
**9.3)** You may also need to update the expected distribution of fuel prices if they
were particularly high or low in the new year of data. Other values like expected heat
content per unit of fuel should be relatively stable. If the required adjustments are
large, or there are other types of validations failing, they should be investigated.

**9.3)** Update the expected number of rows in the minmax_row validation tests. Pay
**9.4)** Update the expected number of rows in the minmax_row validation tests. Pay
attention to how far off of previous expectations the new tables are. E.g. if there
are already 20 years of data, and you're integrating 1 new year of data, probably the
number of rows in the tables should be increasing by around 5% (since 1/20 = 0.05).
Expand Down
32 changes: 18 additions & 14 deletions docs/dev/pudl_id_mapping.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,13 +91,16 @@ a complete database based on the settings files stored in
unmapped IDs to the ``devtools/ferc1-eia-glue`` directory that correspond to unmapped
plants and utilities from FERC 1 and EIA.

If you have already generated a databse without foreign-key constraints, you can run
If you have already generated a database without foreign-key constraints, you can run
just the script that extracts the umapped IDs with:

.. code-block:: console
$ pytest test/integration/glue_test.py --live-dbs --save-unmapped-ids
The ``--save-unmapped-ids`` flag saves unmapped plants and utilities in the
``devtools/ferc1-eia-glue`` folder by default.

Assigning PUDL IDs to Unmapped Records
--------------------------------------

Expand All @@ -114,10 +117,11 @@ same PUDL ID.
Mapping Plants
^^^^^^^^^^^^^^

The ``unmapped_plants_ferc1/eia.csv`` files should display basic plant information such
as the facility name, utility name, and capacity. We show capacity here so that we can
prioritize which plants to map. The larger the capacity, the more important it is to get
it mapped. Sort the records by capacity so the highest priority records at the top.
The ``missing_plant_id_in_plants_ferc1/eia.csv`` files should display basic plant
information such as the facility name, utility name, and capacity. We show capacity here
so that we can prioritize which plants to map. The larger the capacity, the more
important it is to get it mapped. Sort the records by capacity so the highest priority
records at the top.

From the FERC and EIA unmapped plants spreadsheets, copy the ``plant_id_eia`` (only in
EIA), ``plant_name_ferc1/eia``, ``utility_id_ferc1/eia``, and ``utility_name_ferc1/eia``
Expand Down Expand Up @@ -160,7 +164,8 @@ plant name string (e.g. for ``chenango solar``, you could search for ``chen``,
or ``chenan``). Searching the entire plant tab helps find other records within
both FERC and EIA that may be the same or part of the same facility. Searching
for a piece can help catch misspellings in the plant name, which are more common
in the FERC records.
in the FERC records. Use the ``devtools/pudl_id_mapping_help.ipynb`` notebook to speed
up this process.

* **If co-located EIA plants have distinct plant IDs and no FERC 1 plant:**
they should not be lumped under a single PUDL Plant ID, as that artificially
Expand Down Expand Up @@ -202,17 +207,16 @@ Linking FERC1-EIA Records
The following section needs to be updated to include new steps for mapping FERC1
XBRL utilities with DBF utilities.

Copy the information output to the ``unmapped_utils_eia/ferc1.csv`` files and paste it
in the appropriate columns at the bottom of the ``pudl_id_mapping.xlsx`` sheet. Note
that FERC 1 utility information goes in the left-hand columns and EIA utility
information goes in the right-hand columns.
Copy the information output to the ``missing_utility_id_in_utilities_eia/ferc1.csv``
files and paste it in the appropriate columns at the bottom of the
``utility_id_pudl.csv`` sheet. Note that FERC 1 utility information goes in the
left-hand columns and EIA utility information goes in the right-hand columns.

Next, you'll have to manually assign ``utility_id_pudl`` values to each row. There is no
formula you can drag down, so just find the largest ``utility_id_pudl`` and create new
values incrementing from there. To double check whether a utility has already appeared,
drag down the formulas in the ``check_utility_id_ferc1`` and ``check_utility_id_eia``
columns. If there's a match, the correct ``utility_id_pudl`` will show up in the column,
and you can create a reference to the original ``utility_id_pudl`` assignment above.
search by name, creating a reference to the original ``utility_id_pudl`` assignment
above.

Make sure to save the file when you're done!

Expand All @@ -235,5 +239,5 @@ Integrating Newly Mapped Records into PUDL
Once you’ve successfully mapped all unmapped PUDL IDs, you’ll want to rerun the ETL!
This ensures that the newly mapped IDs get integrated into the PUDL database and output
tables that folks are using. Make sure to tell everyone else to do so as well so that
you can all use the newly mapped PUDL IDs. But furst, make sure to head back to the
you can all use the newly mapped PUDL IDs. But first, make sure to head back to the
:doc:`existing_data_updates` page to wrap up the validation tests!
4 changes: 4 additions & 0 deletions docs/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ v2024.X.x (2024-XX-XX)
New Data Coverage
^^^^^^^^^^^^^^^^^

EIA 860
~~~~~~~
* Added EIA 860 final release data from 2023. See :issue:`3684` and PR :pr:`3871`.

FERC Form 714
~~~~~~~~~~~~~
* Integrate 2021-2023 years of the FERC Form 714 data. FERC updated its reporting
Expand Down
4 changes: 1 addition & 3 deletions src/pudl/output/eia.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,18 +145,16 @@ def _out_eia__yearly_generators(
)

# If any generator data is completely empty, drop it.
# These are five known generators that originate from harvesting the plant and
# These are three known generators that originate from harvesting the plant and
# generator IDs found in the plant_id_eia_direct_support_x and
# generator_id_direct_support_x in EIA 860 energy storage tables, in
# order to enable foreign key relationships with these columns.
# They do not show up in any other tables and thus lack data in all columns.
# For more, see issue #3695 and PR #3699.
empty_generator_ids = [
(9170, "3093", "2023-01-01"),
(18170, "B8170", "2023-01-01"),
(34516, "SOL1", "2023-01-01"),
(64966, "GEN1", "2023-01-01"),
(60321, "PV1", "2023-01-01"),
]
out_df = drop_all_null_records_with_multiindex(
out_df, ["plant_id_eia", "generator_id", "report_date"], empty_generator_ids
Expand Down
Loading

0 comments on commit 7cff938

Please sign in to comment.