-
-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archive EPA PCAP data #544
Conversation
@zaneselvans I've already reviewed this in the course of taking it over, but can't review my own PR, so I'm tagging you for a second look! I'm assuming it'll take some time to hear back from Zenodo, so I could merge this without a sandbox archive or manually upload the files uploaded to the production draft to the sandbox to give us an archive to work against. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor non-blocking naming stuff, if you care to update.
I could imagine wanting to partition this by type of jurisdiction -- state, msa, tribe -- but at least all the data is available! Seems like it could be complicated to connect the URLs in the spreadsheets to the appropriate archived PDFs, but maybe the name munging is straightforward?
"disadvantaged communities (LIDACs), and other PCAP elements." | ||
), | ||
"working_partitions": {}, | ||
"keywords": sorted({"emissions", "ghg", "epa", "pcap", "cprg", "emissions"}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove dupe of "emissions"
"keywords": sorted({"emissions", "ghg", "epa", "pcap", "cprg", "emissions"}), | |
"keywords": sorted({"emissions", "ghg", "epa", "pcap", "cprg"}), |
for link in await self.get_hyperlinks(BASE_URL, excel_pattern): | ||
await self.download_helper(link, zip_path, data_paths_in_archive) | ||
|
||
# Download all PDFs from each searchable table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, great that this was relatively straightforward. I was worried it would be a huge pain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All @nilaykumar 's fine handywork! 🚀
@@ -398,4 +398,22 @@ | |||
"license_pudl": LICENSES["cc-by-4.0"], | |||
"contributors": [CONTRIBUTORS["catalyst-cooperative"]], | |||
}, | |||
"epapcap": { | |||
"title": "EPA -- Priority Climate Action Plan", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past we've put the abbreviated name before the double-dash and then expanding the name after it. Are the abbreviations intentionally being left out? They appear to be present in some of the new dataset titles, but missing in others.
"title": "EPA -- Priority Climate Action Plan", | |
"title": "EPA PCAP -- Priority Climate Action Plan", |
I agree, but we can always repartition later. It's also a bit tricky because the PDFs won't match these jurisdictional partitions 1:1, as Nilay noted in the fact that some PDFs appear in more than one spreadsheet. |
Yeah, it looked like some jurisdictions had just one file for the multiple categories of data, so it seemed easiest to just throw everything in one bucket. |
prefix = "https://www.epa.gov" | ||
if not link.startswith("http"): | ||
link = prefix + link | ||
await self.download_helper(link, zip_path, data_paths_in_archive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized: this call to the download helper looks like it should be indented outside of the if statement --- just in case there are links in the second and third searchable tables that are already absolute URLs, we don't want to skip those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes that's a great catch, I'll go ahead and fix it!
@e-belfer I don't know if this is a problem, or expected behavior, but in the logs I saw:
And I also saw some wildcards in the artifact upload/download paths in the workflow file, and IIRC there was a change a few months ago to GitHub Actions where they stopped allowing wildcards for security reasons. Though I'm not immediately finding the blog post where they mentioned it, or the PUDL issue that it came up in so frustratingly when the GHA updated. |
There are workflow files successfully attached to the last ran I completed - where did you see this? |
Overview
Closes #524. A takeover of #541.
What problem does this address?
Implements metadata and archiver for EPA PCAP.
What did you change in this PR?
Added metadata and zipped download of XLSX and PDF files, and metadata for the dataset.
Added the new data to the YAML file.
When trying to make a sandbox archive, I get a server 413 error ("The data value transmitted exceeds the capacity limit.") that doesn't appear when I make a production archive. I've emailed Zenodo support on 1/24.
Testing
How did you make sure this worked? How can a reviewer verify this?
See the draft production archive: https://zenodo.org/uploads/14735667 (must be logged in to view).
To-do list
Tasks