Integrate final 2023 EIA 860 data (#3871)

* Update extraction and drop non-existent generators * Stash in-progress ID mapping * Update PUDL ID mapping and docs, update data updates documentation * Check in helpful mapping notebook * Update release notes and validation tests * Drop capacity from mapping spreadsheet, fix release notes * Remove duplicated utility ID --------- Co-authored-by: Zane Selvans <[email protected]>
catalyst-cooperative · Sep 30, 2024 · 7cff938 · 7cff938
1 parent 9255b64
commit 7cff938
Show file tree

Hide file tree

Showing 13 changed files with 364 additions and 111 deletions.
diff --git a/.github/ISSUE_TEMPLATE/existing_data_updates.md b/.github/ISSUE_TEMPLATE/existing_data_updates.md
@@ -16,7 +16,7 @@ Based on the [Existing Data Updates Docs](https://catalystcoop-pudl.readthedocs.
 - [ ] [Update table and column transformations](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/existing_data_updates.html#update-table-column-transformations)
 - [ ] [Update the PUDL db schema](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/existing_data_updates.html#update-the-pudl-db-schema)
 - [ ] [Connect datasets](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/existing_data_updates.html#connect-datasets)
+- [ ] [Update the output routines](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/existing_data_updates.html#update-the-output-routines-and-run-full-tests)
 - [ ] [Run the ETL](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/existing_data_updates.html#run-the-etl)
-- [ ] [Update the output routines and run full tests](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/existing_data_updates.html#update-the-output-routines-and-run-full-tests)
 - [ ] [Run and update data validations](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/existing_data_updates.html#run-and-update-data-validations)
 - [ ] [Update the documentation](https://catalystcoop-pudl.readthedocs.io/en/nightly/dev/existing_data_updates.html#update-the-documentation)
diff --git a/devtools/pudl_id_mapping_help.ipynb b/devtools/pudl_id_mapping_help.ipynb
@@ -0,0 +1,123 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "f22a1df7",
+   "metadata": {},
+   "source": [
+    "# PUDL ID Mapping Help\n",
+    "\n",
+    "This notebook helps to support the manual mapping of FERC to EIA plant IDs. See the [PUDL ID mapping](https://catalystcoop-pudl.readthedocs.io/en/latest/dev/pudl_id_mapping.html) documentation for more information."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5d7c4cba-9e4c-4f00-b120-046d63929ed7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import pudl\n",
+    "import pudl.logging_helpers\n",
+    "from pudl.etl import default_assets, defs\n",
+    "\n",
+    "logger = pudl.logging_helpers.get_logger(__name__)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "acada4d1-cb5e-4a9b-a372-52596e2cf5f0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plants_eia = defs.load_asset_value(\"out_eia__yearly_plants\")\n",
+    "plants_pudl = defs.load_asset_value(\"core_pudl__entity_plants_pudl\")\n",
+    "plants_ferc = defs.load_asset_value(\"out_ferc1__yearly_all_plants\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3f0e2a0c-c94b-4b11-88f9-720094bb768a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "cols_eia = [\"plant_id_pudl\",\"plant_id_eia\",\"plant_name_eia\",\"utility_name_eia\",\"city\",\"county\", \"latitude\",\"longitude\",\"state\"]\n",
+    "cols_ferc = [\"plant_id_pudl\",\"plant_id_ferc1\",\"plant_name_ferc1\", \"utility_name_ferc1\", \"capacity_mw\", \"record_id\"]\n",
+    "plants = pd.merge(\n",
+    "    plants_pudl,\n",
+    "    plants_eia[cols_eia].drop_duplicates(),\n",
+    "    how=\"outer\",\n",
+    "    on=[\"plant_id_pudl\"],\n",
+    "    validate=\"1:m\"\n",
+    ").merge(\n",
+    "    plants_ferc[cols_ferc].drop_duplicates(subset=[col for col in cols_ferc if col != \"record_id\"]),\n",
+    "    how=\"outer\",\n",
+    "    on=[\"plant_id_pudl\"],\n",
+    "    suffixes=(\"_eia\", \"_ferc\")\n",
+    ")\n",
+    "plants.plant_name_eia = plants.plant_name_eia.str.lower()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "607c048f",
+   "metadata": {},
+   "source": [
+    "Use the snippet of code below to speed up searching for plant matches. Update the matching ID value in the spreadsheet by linking it to the cell, _not_ by hard-coding the value!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8fda99b8-091f-4c16-958b-51ebffbc95eb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "name_bit = \"richmond\"\n",
+    "# when you actually need to restrict it by state bc there are too many\n",
+    "# add your state and un-comment out the state line below\n",
+    "state = \"VT\"\n",
+    "plants[\n",
+    "    (plants.plant_name_eia.str.contains(name_bit)\n",
+    "    | plants.plant_name_pudl.str.contains(name_bit)\n",
+    "    | plants.plant_name_ferc1.str.contains(name_bit))\n",
+    "    & ((plants.state == state) | plants.state.isnull())\n",
+    "].sort_values([\"latitude\"])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c6accad0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "plants_entity = defs.load_asset_value(\"out_eia__entity_plants\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/docs/dev/existing_data_updates.rst b/docs/dev/existing_data_updates.rst
@@ -411,36 +411,37 @@ In some cases you may need to resort to Google Maps. If no coordinates can be fo
 then at least the plant's state should be included so that an approximate timezone can
 be inferred.
 
-7. Run the ETL
+7. Update the Output Routines
+-----------------------------
+**7.1)** Update the denormalized table outputs and derived analytical routines to
+accommodate the new data if necessary.
+
+* Are there new columns that should be incorporated into the output tables?
+* Are there new tables that need to have an output function defined for them?
+
+8. Run the ETL
 --------------
 Once the FERC 1 and EIA utilities and plants have been associated with each other, you
 can try and run the ETL with all datasets included. See: :doc:`run_the_etl`.
 
-**7.1)** First run the ETL for just the new year of data, using the ``etl_fast.yml``
+**8.1)** First run the ETL for just the new year of data, using the ``etl_fast.yml``
 settings file.
 
-**7.2)** Once the fast ETL works, run the full ETL using the ``etl_full.yml`` settings
+**8.2)** Once the fast ETL works, run the full ETL using the ``etl_full.yml`` settings
 to populate complete FERC 1 & PUDL DBs and EPA CEMS Parquet files.
 
-8. Update the Output Routines and Run Full Tests
-------------------------------------------------
-**8.1)** With a full PUDL DB, update the denormalized table outputs and derived
-analytical routines to accommodate the new data if necessary. These are generally
-called from within the :class:`pudl.output.pudltabl.PudlTabl` class.
 
-* Are there new columns that should be incorporated into the output tables?
-* Are there new tables that need to have an output function defined for them?
+9. Run and Update Data Validations
+----------------------------------
 
-**8.2)** To ensure that you fully exercise all of the possible output functions,
+**9.1)** To ensure that you fully exercise all of the possible output functions,
 run all the integration tests against your live PUDL DB with:
 
 .. code-block:: console
 
     $ make pytest-integration-full
 
-9. Run and Update Data Validations
------------------------------------
-**9.1)** When the CI tests are passing against all years of data, sanity check the data
+**9.2)** When the CI tests are passing against all years of data, sanity check the data
 in the database and the derived outputs by running
 
 .. code-block:: console
@@ -450,12 +451,12 @@ in the database and the derived outputs by running
 We expect at least some of the validation tests to fail initially because we haven't
 updated the number of records we expect to see in each table.
 
-**9.2)** You may also need to update the expected distribution of fuel prices if they
+**9.3)** You may also need to update the expected distribution of fuel prices if they
 were particularly high or low in the new year of data. Other values like expected heat
 content per unit of fuel should be relatively stable. If the required adjustments are
 large, or there are other types of validations failing, they should be investigated.
 
-**9.3)** Update the expected number of rows in the minmax_row validation tests. Pay
+**9.4)** Update the expected number of rows in the minmax_row validation tests. Pay
 attention to how far off of previous expectations the new tables are. E.g. if there
 are already 20 years of data, and you're integrating 1 new year of data, probably the
 number of rows in the tables should be increasing by around 5% (since 1/20 = 0.05).

diff --git a/docs/dev/pudl_id_mapping.rst b/docs/dev/pudl_id_mapping.rst
@@ -91,13 +91,16 @@ a complete database based on the settings files stored in
 unmapped IDs to the ``devtools/ferc1-eia-glue`` directory that correspond to unmapped
 plants and utilities from FERC 1 and EIA.
 
-If you have already generated a databse without foreign-key constraints, you can run
+If you have already generated a database without foreign-key constraints, you can run
 just the script that extracts the umapped IDs with:
 
 .. code-block:: console
 
     $ pytest test/integration/glue_test.py --live-dbs --save-unmapped-ids
 
+The ``--save-unmapped-ids`` flag saves unmapped plants and utilities in the
+``devtools/ferc1-eia-glue`` folder by default.
+
 Assigning PUDL IDs to Unmapped Records
 --------------------------------------
 
@@ -114,10 +117,11 @@ same PUDL ID.
 Mapping Plants
 ^^^^^^^^^^^^^^
 
-The ``unmapped_plants_ferc1/eia.csv`` files should display basic plant information such
-as the facility name, utility name, and capacity. We show capacity here so that we can
-prioritize which plants to map. The larger the capacity, the more important it is to get
-it mapped. Sort the records by capacity so the highest priority records at the top.
+The ``missing_plant_id_in_plants_ferc1/eia.csv`` files should display basic plant
+information such as the facility name, utility name, and capacity. We show capacity here
+so that we can prioritize which plants to map. The larger the capacity, the more
+important it is to get it mapped. Sort the records by capacity so the highest priority
+records at the top.
 
 From the FERC and EIA unmapped plants spreadsheets, copy the ``plant_id_eia`` (only in
 EIA), ``plant_name_ferc1/eia``, ``utility_id_ferc1/eia``, and ``utility_name_ferc1/eia``
@@ -160,7 +164,8 @@ plant name string (e.g. for ``chenango solar``, you could search for ``chen``,
 or ``chenan``). Searching the entire plant tab helps find other records within
 both FERC and EIA that may be the same or part of the same facility. Searching
 for a piece can help catch misspellings in the plant name, which are more common
-in the FERC records.
+in the FERC records. Use the ``devtools/pudl_id_mapping_help.ipynb`` notebook to speed
+up this process.
 
     * **If co-located EIA plants have distinct plant IDs and no FERC 1 plant:**
         they should not be lumped under a single PUDL Plant ID, as that artificially
@@ -202,17 +207,16 @@ Linking FERC1-EIA Records
     The following section needs to be updated to include new steps for mapping FERC1
     XBRL utilities with DBF utilities.
 
-Copy the information output to the ``unmapped_utils_eia/ferc1.csv`` files and paste it
-in the appropriate columns at the bottom of the ``pudl_id_mapping.xlsx``  sheet. Note
-that FERC 1 utility information goes in the left-hand columns and EIA utility
-information goes in the right-hand columns.
+Copy the information output to the ``missing_utility_id_in_utilities_eia/ferc1.csv``
+files and paste it in the appropriate columns at the bottom of the
+``utility_id_pudl.csv``  sheet. Note that FERC 1 utility information goes in the
+left-hand columns and EIA utility information goes in the right-hand columns.
 
 Next, you'll have to manually assign ``utility_id_pudl`` values to each row. There is no
 formula you can drag down, so just find the largest ``utility_id_pudl`` and create new
 values incrementing from there. To double check whether a utility has already appeared,
-drag down the formulas in the ``check_utility_id_ferc1`` and ``check_utility_id_eia``
-columns. If there's a match, the correct ``utility_id_pudl`` will show up in the column,
-and you can create a reference to the original ``utility_id_pudl`` assignment above.
+search by name, creating a reference to the original ``utility_id_pudl`` assignment
+above.
 
 Make sure to save the file when you're done!
 
@@ -235,5 +239,5 @@ Integrating Newly Mapped Records into PUDL
 Once you’ve successfully mapped all unmapped PUDL IDs, you’ll want to rerun the ETL!
 This ensures that the newly mapped IDs get integrated into the PUDL database and output
 tables that folks are using. Make sure to tell everyone else to do so as well so that
-you can all use the newly mapped PUDL IDs. But furst, make sure to head back to the
+you can all use the newly mapped PUDL IDs. But first, make sure to head back to the
 :doc:`existing_data_updates` page to wrap up the validation tests!
diff --git a/docs/release_notes.rst b/docs/release_notes.rst
@@ -9,6 +9,10 @@ v2024.X.x (2024-XX-XX)
 New Data Coverage
 ^^^^^^^^^^^^^^^^^
 
+EIA 860
+~~~~~~~
+* Added EIA 860 final release data from 2023. See :issue:`3684` and PR :pr:`3871`.
+
 FERC Form 714
 ~~~~~~~~~~~~~
 * Integrate 2021-2023 years of the FERC Form 714 data. FERC updated its reporting

diff --git a/src/pudl/output/eia.py b/src/pudl/output/eia.py
@@ -145,18 +145,16 @@ def _out_eia__yearly_generators(
     )
 
     # If any generator data is completely empty, drop it.
-    # These are five known generators that originate from harvesting the plant and
+    # These are three known generators that originate from harvesting the plant and
     # generator IDs found in the plant_id_eia_direct_support_x and
     # generator_id_direct_support_x in EIA 860 energy storage tables, in
     # order to enable foreign key relationships with these columns.
     # They do not show up in any other tables and thus lack data in all columns.
     # For more, see issue #3695 and PR #3699.
     empty_generator_ids = [
         (9170, "3093", "2023-01-01"),
-        (18170, "B8170", "2023-01-01"),
         (34516, "SOL1", "2023-01-01"),
         (64966, "GEN1", "2023-01-01"),
-        (60321, "PV1", "2023-01-01"),
     ]
     out_df = drop_all_null_records_with_multiindex(
         out_df, ["plant_id_eia", "generator_id", "report_date"], empty_generator_ids