Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive EPA PCAP data #544

Merged
merged 11 commits into from
Jan 28, 2025
Merged

Archive EPA PCAP data #544

merged 11 commits into from
Jan 28, 2025

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Jan 24, 2025

Overview

Closes #524. A takeover of #541.

What problem does this address?
Implements metadata and archiver for EPA PCAP.

What did you change in this PR?
Added metadata and zipped download of XLSX and PDF files, and metadata for the dataset.
Added the new data to the YAML file.

When trying to make a sandbox archive, I get a server 413 error ("The data value transmitted exceeds the capacity limit.") that doesn't appear when I make a production archive. I've emailed Zenodo support on 1/24.

Testing

How did you make sure this worked? How can a reviewer verify this?
See the draft production archive: https://zenodo.org/uploads/14735667 (must be logged in to view).

To-do list

Tasks

Preview Give feedback

@e-belfer e-belfer self-assigned this Jan 24, 2025
@e-belfer e-belfer changed the title Epapcap Archive EPA PCAP data Jan 24, 2025
@e-belfer e-belfer requested a review from zaneselvans January 24, 2025 18:58
@e-belfer
Copy link
Member Author

e-belfer commented Jan 24, 2025

@zaneselvans I've already reviewed this in the course of taking it over, but can't review my own PR, so I'm tagging you for a second look! I'm assuming it'll take some time to hear back from Zenodo, so I could merge this without a sandbox archive or manually upload the files uploaded to the production draft to the sandbox to give us an archive to work against.

@e-belfer e-belfer marked this pull request as ready for review January 24, 2025 20:00
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor non-blocking naming stuff, if you care to update.

I could imagine wanting to partition this by type of jurisdiction -- state, msa, tribe -- but at least all the data is available! Seems like it could be complicated to connect the URLs in the spreadsheets to the appropriate archived PDFs, but maybe the name munging is straightforward?

"disadvantaged communities (LIDACs), and other PCAP elements."
),
"working_partitions": {},
"keywords": sorted({"emissions", "ghg", "epa", "pcap", "cprg", "emissions"}),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove dupe of "emissions"

Suggested change
"keywords": sorted({"emissions", "ghg", "epa", "pcap", "cprg", "emissions"}),
"keywords": sorted({"emissions", "ghg", "epa", "pcap", "cprg"}),

for link in await self.get_hyperlinks(BASE_URL, excel_pattern):
await self.download_helper(link, zip_path, data_paths_in_archive)

# Download all PDFs from each searchable table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, great that this was relatively straightforward. I was worried it would be a huge pain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All @nilaykumar 's fine handywork! 🚀

@@ -398,4 +398,22 @@
"license_pudl": LICENSES["cc-by-4.0"],
"contributors": [CONTRIBUTORS["catalyst-cooperative"]],
},
"epapcap": {
"title": "EPA -- Priority Climate Action Plan",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past we've put the abbreviated name before the double-dash and then expanding the name after it. Are the abbreviations intentionally being left out? They appear to be present in some of the new dataset titles, but missing in others.

Suggested change
"title": "EPA -- Priority Climate Action Plan",
"title": "EPA PCAP -- Priority Climate Action Plan",

@e-belfer
Copy link
Member Author

e-belfer commented Jan 24, 2025

I could imagine wanting to partition this by type of jurisdiction -- state, msa, tribe -- but at least all the data is available! Seems like it could be complicated to connect the URLs in the spreadsheets to the appropriate archived PDFs, but maybe the name munging is straightforward?

I agree, but we can always repartition later. It's also a bit tricky because the PDFs won't match these jurisdictional partitions 1:1, as Nilay noted in the fact that some PDFs appear in more than one spreadsheet.

@nilaykumar
Copy link
Contributor

Yeah, it looked like some jurisdictions had just one file for the multiple categories of data, so it seemed easiest to just throw everything in one bucket.

prefix = "https://www.epa.gov"
if not link.startswith("http"):
link = prefix + link
await self.download_helper(link, zip_path, data_paths_in_archive)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized: this call to the download helper looks like it should be indented outside of the if statement --- just in case there are links in the second and third searchable tables that are already absolute URLs, we don't want to skip those.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes that's a great catch, I'll go ahead and fix it!

@zaneselvans
Copy link
Member

@e-belfer I don't know if this is a problem, or expected behavior, but in the logs I saw:

Warning: No files were found with the provided path: epapcap_run_summary.json. No artifacts will be uploaded.

And I also saw some wildcards in the artifact upload/download paths in the workflow file, and IIRC there was a change a few months ago to GitHub Actions where they stopped allowing wildcards for security reasons. Though I'm not immediately finding the blog post where they mentioned it, or the PUDL issue that it came up in so frustratingly when the GHA updated.

@e-belfer
Copy link
Member Author

e-belfer commented Jan 28, 2025

@e-belfer I don't know if this is a problem, or expected behavior, but in the logs I saw:

Warning: No files were found with the provided path: epapcap_run_summary.json. No artifacts will be uploaded.

And I also saw some wildcards in the artifact upload/download paths in the workflow file, and IIRC there was a change a few months ago to GitHub Actions where they stopped allowing wildcards for security reasons. Though I'm not immediately finding the blog post where they mentioned it, or the PUDL issue that it came up in so frustratingly when the GHA updated.

There are workflow files successfully attached to the last ran I completed - where did you see this?

@e-belfer e-belfer merged commit bdbfa3e into main Jan 28, 2025
3 checks passed
@e-belfer e-belfer deleted the epapcap branch January 28, 2025 21:32
@e-belfer e-belfer added the epapcap EPA Priority Climate Action Plans data label Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epapcap EPA Priority Climate Action Plans data new-data
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Write an archiver for the EPA Priority Climate Action Plan Directory
3 participants