Archive EPA PCAP data #544

e-belfer · 2025-01-24T18:14:39Z

Overview

Closes #524. A takeover of #541.

What problem does this address?
Implements metadata and archiver for EPA PCAP.

What did you change in this PR?
Added metadata and zipped download of XLSX and PDF files, and metadata for the dataset.
Added the new data to the YAML file.

When trying to make a sandbox archive, I get a server 413 error ("The data value transmitted exceeds the capacity limit.") that doesn't appear when I make a production archive. I've emailed Zenodo support on 1/24.

Testing

How did you make sure this worked? How can a reviewer verify this?
See the draft production archive: https://zenodo.org/uploads/14735667 (must be logged in to view).

To-do list

Tasks

Give feedback

create sandbox archive - resolve bizarre 413 server error emailed Zenodo support or do this manually
make prod archive once approved
add prod DOI to YAML
add to github actions
Update relevant documentation - like comments, docstrings, README, release notes, etc.
Review the PR yourself and call out any questions or issues you have
Options

e-belfer · 2025-01-24T18:59:07Z

@zaneselvans I've already reviewed this in the course of taking it over, but can't review my own PR, so I'm tagging you for a second look! I'm assuming it'll take some time to hear back from Zenodo, so I could merge this without a sandbox archive or manually upload the files uploaded to the production draft to the sandbox to give us an archive to work against.

zaneselvans

Minor non-blocking naming stuff, if you care to update.

I could imagine wanting to partition this by type of jurisdiction -- state, msa, tribe -- but at least all the data is available! Seems like it could be complicated to connect the URLs in the spreadsheets to the appropriate archived PDFs, but maybe the name munging is straightforward?

zaneselvans · 2025-01-24T20:54:08Z

src/pudl_archiver/metadata/sources.py

+            "disadvantaged communities (LIDACs), and other PCAP elements."
+        ),
+        "working_partitions": {},
+        "keywords": sorted({"emissions", "ghg", "epa", "pcap", "cprg", "emissions"}),


Remove dupe of "emissions"

Suggested change

"keywords": sorted({"emissions", "ghg", "epa", "pcap", "cprg", "emissions"}),

"keywords": sorted({"emissions", "ghg", "epa", "pcap", "cprg"}),

zaneselvans · 2025-01-24T20:57:29Z

src/pudl_archiver/archivers/epapcap.py

+        for link in await self.get_hyperlinks(BASE_URL, excel_pattern):
+            await self.download_helper(link, zip_path, data_paths_in_archive)
+
+        # Download all PDFs from each searchable table


Wow, great that this was relatively straightforward. I was worried it would be a huge pain.

All @nilaykumar 's fine handywork! 🚀

zaneselvans · 2025-01-24T21:16:39Z

src/pudl_archiver/metadata/sources.py

@@ -398,4 +398,22 @@
        "license_pudl": LICENSES["cc-by-4.0"],
        "contributors": [CONTRIBUTORS["catalyst-cooperative"]],
    },
+    "epapcap": {
+        "title": "EPA -- Priority Climate Action Plan",


In the past we've put the abbreviated name before the double-dash and then expanding the name after it. Are the abbreviations intentionally being left out? They appear to be present in some of the new dataset titles, but missing in others.

Suggested change

"title": "EPA -- Priority Climate Action Plan",

"title": "EPA PCAP -- Priority Climate Action Plan",

e-belfer · 2025-01-24T22:14:56Z

I could imagine wanting to partition this by type of jurisdiction -- state, msa, tribe -- but at least all the data is available! Seems like it could be complicated to connect the URLs in the spreadsheets to the appropriate archived PDFs, but maybe the name munging is straightforward?

I agree, but we can always repartition later. It's also a bit tricky because the PDFs won't match these jurisdictional partitions 1:1, as Nilay noted in the fact that some PDFs appear in more than one spreadsheet.

nilaykumar · 2025-01-24T22:44:52Z

Yeah, it looked like some jurisdictions had just one file for the multiple categories of data, so it seemed easiest to just throw everything in one bucket.

nilaykumar · 2025-01-24T22:48:28Z

src/pudl_archiver/archivers/epa/epapcap.py

+                prefix = "https://www.epa.gov"
+                if not link.startswith("http"):
+                    link = prefix + link
+                    await self.download_helper(link, zip_path, data_paths_in_archive)


I just realized: this call to the download helper looks like it should be indented outside of the if statement --- just in case there are links in the second and third searchable tables that are already absolute URLs, we don't want to skip those.

Ah yes that's a great catch, I'll go ahead and fix it!

zaneselvans · 2025-01-28T20:56:29Z

@e-belfer I don't know if this is a problem, or expected behavior, but in the logs I saw:

Warning: No files were found with the provided path: epapcap_run_summary.json. No artifacts will be uploaded.

And I also saw some wildcards in the artifact upload/download paths in the workflow file, and IIRC there was a change a few months ago to GitHub Actions where they stopped allowing wildcards for security reasons. Though I'm not immediately finding the blog post where they mentioned it, or the PUDL issue that it came up in so frustratingly when the GHA updated.

e-belfer · 2025-01-28T21:00:35Z

@e-belfer I don't know if this is a problem, or expected behavior, but in the logs I saw:
Warning: No files were found with the provided path: epapcap_run_summary.json. No artifacts will be uploaded.
And I also saw some wildcards in the artifact upload/download paths in the workflow file, and IIRC there was a change a few months ago to GitHub Actions where they stopped allowing wildcards for security reasons. Though I'm not immediately finding the blog post where they mentioned it, or the PUDL issue that it came up in so frustratingly when the GHA updated.

There are workflow files successfully attached to the last ran I completed - where did you see this?

nilaykumar and others added 3 commits January 24, 2025 02:02

Added EPA PCAP metadata and archiver

f769887

Merge branch 'main' into archiver-epa-pcap

aeda8b2

Add keywords

71ad55c

e-belfer self-assigned this Jan 24, 2025

e-belfer added the new-data label Jan 24, 2025

e-belfer changed the title ~~Epapcap~~ Archive EPA PCAP data Jan 24, 2025

e-belfer requested a review from zaneselvans January 24, 2025 18:58

e-belfer marked this pull request as ready for review January 24, 2025 20:00

zaneselvans approved these changes Jan 24, 2025

View reviewed changes

Fix PCAP metadata and make an EPA folder for EPA archivers

4822ea3

Merge branch 'main' into epapcap

a8e1d70

nilaykumar reviewed Jan 24, 2025

View reviewed changes

e-belfer and others added 4 commits January 24, 2025 18:06

Fix indentation of download helper

535f8ab

Merge branch 'main' into epapcap

1ec4b24

Merge branch 'main' into epapcap

429f525

Merge branch 'main' into epapcap

ea01aa8

e-belfer added 2 commits January 28, 2025 16:19

Add prod doi and epapcap to GHA

4d92918

Note that sandbox DOI isn't complete

8a4d117

e-belfer merged commit bdbfa3e into main Jan 28, 2025
3 checks passed

e-belfer deleted the epapcap branch January 28, 2025 21:32

e-belfer added the epapcap EPA Priority Climate Action Plans data label Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archive EPA PCAP data #544

Archive EPA PCAP data #544

e-belfer commented Jan 24, 2025 •

edited

Loading

Tasks

e-belfer commented Jan 24, 2025 •

edited

Loading

zaneselvans left a comment •

edited

Loading

zaneselvans Jan 24, 2025

zaneselvans Jan 24, 2025

e-belfer Jan 24, 2025

zaneselvans Jan 24, 2025

e-belfer commented Jan 24, 2025 •

edited

Loading

nilaykumar commented Jan 24, 2025

nilaykumar Jan 24, 2025

e-belfer Jan 24, 2025

zaneselvans commented Jan 28, 2025

e-belfer commented Jan 28, 2025 •

edited

Loading

	"keywords": sorted({"emissions", "ghg", "epa", "pcap", "cprg", "emissions"}),
	"keywords": sorted({"emissions", "ghg", "epa", "pcap", "cprg"}),

	"title": "EPA -- Priority Climate Action Plan",
	"title": "EPA PCAP -- Priority Climate Action Plan",

Archive EPA PCAP data #544

Archive EPA PCAP data #544

Conversation

e-belfer commented Jan 24, 2025 • edited Loading

Overview

Testing

To-do list

Tasks

e-belfer commented Jan 24, 2025 • edited Loading

zaneselvans left a comment • edited Loading

Choose a reason for hiding this comment

zaneselvans Jan 24, 2025

Choose a reason for hiding this comment

zaneselvans Jan 24, 2025

Choose a reason for hiding this comment

e-belfer Jan 24, 2025

Choose a reason for hiding this comment

zaneselvans Jan 24, 2025

Choose a reason for hiding this comment

e-belfer commented Jan 24, 2025 • edited Loading

nilaykumar commented Jan 24, 2025

nilaykumar Jan 24, 2025

Choose a reason for hiding this comment

e-belfer Jan 24, 2025

Choose a reason for hiding this comment

zaneselvans commented Jan 28, 2025

e-belfer commented Jan 28, 2025 • edited Loading

e-belfer commented Jan 24, 2025 •

edited

Loading

e-belfer commented Jan 24, 2025 •

edited

Loading

zaneselvans left a comment •

edited

Loading

e-belfer commented Jan 24, 2025 •

edited

Loading

e-belfer commented Jan 28, 2025 •

edited

Loading