Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract VCE renewable generation profiles and remove deprecated gsutil from workflows #3893

Merged
merged 26 commits into from
Oct 17, 2024

Conversation

e-belfer
Copy link
Member

@e-belfer e-belfer commented Oct 2, 2024

Overview

Closes #3880.

What problem does this address?

Extracts VCE renewable generation profile data into raw dagster assets. Also fixes a nightly build failure due to the deprecation of gsutil (IAM permissions were updated manually and need to be added in terraform, but this is out of scope for this PR).

What did you change?

  • Add a module for vcerare extraction and specify valid run settings
  • Add vcerare to the fast and full ETL runs
  • Swap gsutil with gcloud storage

To-do list

Preview Give feedback

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Preview Give feedback

@e-belfer e-belfer self-assigned this Oct 2, 2024
@e-belfer e-belfer added the vcerare Pertaining to Vibrant Clean Energy's Resource Adequacy Renewable Energy Power datasets label Oct 2, 2024
@e-belfer e-belfer linked an issue Oct 2, 2024 that may be closed by this pull request
@aesharpe aesharpe added the new-data Requests for integration of new data. label Oct 3, 2024
Base automatically changed from vceregen-metadata to main October 3, 2024 19:32
@e-belfer e-belfer marked this pull request as ready for review October 16, 2024 13:46
@e-belfer e-belfer requested a review from aesharpe October 16, 2024 13:46
@e-belfer e-belfer requested a review from jdangerx October 16, 2024 18:50
@e-belfer e-belfer added bug Things that are just plain broken. nightly-builds Anything having to do with nightly builds or continuous deployment. labels Oct 16, 2024
@e-belfer e-belfer changed the title Extract VCE renewable generation profiles Extract VCE renewable generation profiles and remove deprecated gsutil from workflows Oct 16, 2024
Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gsutil/gcloud storage changes look good to me!

@@ -15,7 +15,7 @@ RUN apt-get update && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

# Configure gsutil authentication
# Configure gcloud authentication
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of surprised this is needed, it's writing to boto.cfg which sounds like it's for AWS. But no need to mess with that now I guess.

@@ -298,13 +298,13 @@ if [[ $ETL_SUCCESS == 0 ]]; then
# If running a tagged release, ensure that outputs can't be accidentally deleted
# It's not clear that an object lock can be applied in S3 with the AWS CLI
if [[ "$GITHUB_ACTION_TRIGGER" == "push" && "$BUILD_REF" == v20* ]]; then
gsutil -m -u catalyst-cooperative-pudl retention temp set "gs://pudl.catalyst.coop/$BUILD_REF/*" 2>&1 | tee -a "$LOGFILE"
gcloud storage --billing-project="catalyst-cooperative-pudl" objects update "gs://pudl.catalyst.coop/$BUILD_REF/*" --temporary-hold 2>&1 | tee -a "$LOGFILE"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: seems like we could use "$GCP_BILLING_PROJECT" instead of "catalyst-cooperative-pudl" here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree!

Copy link
Member

@aesharpe aesharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! a few comments about format etc.

I have one overarching question about the dataset name. Technically it is called the "VCE RARE Power Dataset" but we refer to it here as the "VCE RARE generation profiles". I'm not sure how important the last few words after VCE RARE are...I like the way we currently have it because "generation profiles" feels more descriptive than "power dataset" but maybe we should just use the official name. I don't feel strongly about this, just wanted to flag.

Comment on lines +64 to +73
def _load_column_maps(self, column_map_pkg: str) -> dict:
"""Create a dictionary of all column mapping CSVs to use in get_column_map()."""
column_dict = {}
for res_path in importlib.resources.files(column_map_pkg).iterdir():
# res_path is expected to end with ${page}.csv
if res_path.suffix == ".csv":
column_map = self._load_csv(column_map_pkg, res_path.name)
column_dict[res_path.stem] = column_map
return column_dict

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just a formality seeing as we don't have any column maps for this dataset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moving the existing load step into a separate method in the base class, so that i'm able to subclass and change it in the VCE extractor.

Comment on lines 3 to 5
This dataset has 1,000s of columns, so we don't want to manually specify a rename on
import because we'll pivot these to a column. We adapt the standard extraction
infrastructure to simply read in the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add "in the PUDL pipeline" after "pivot these to a column"

Comment on lines 12 to 14
The drive also contains one more file: vce_county_lat_long_fips_table.csv. This file is
not partitioned, so we always read it in regardless of the partitions configured for the
run.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more CSV file (because there is also the readme)



def raw_vcerare_asset_factory(part: str) -> AssetsDefinition:
"""An asset factory for VCE hourly renewable generation profiles."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same name comment


@asset(**asset_kwargs)
def _extract(context, raw_vcerare__all_dfs):
"""Extract raw GridPath RA Toolkit renewable energy generation profiles.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an accidental copy-over from the gridpath extractor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure is, good catch!



class VCEMetadata(GenericMetadata):
"""Special metadata class for VCE RARE renewable generation profiles."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the RE in rare stands for renewable energy so I think we can take out the word renewable here



class Extractor(CsvExtractor):
"""Extractor for VCE renewable generation profiles."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should call it by RARE name here

@@ -401,6 +401,19 @@ class EiaAeoSettings(GenericDatasetSettings):
years: list[int] = data_source.working_partitions["years"]


class VCERARESettings(GenericDatasetSettings):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way we've named the other sources is with Pascal case (VceRareSettings)

@@ -0,0 +1 @@
"""CSV file extraction maps for VCE RARE generation profiles."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note for consistency if you want to change this to say "Power Dataset" to match the others.

Comment on lines 584 to +587
phmsagas: PhmsaGasSettings | None = None
nrelatb: NrelAtbSettings | None = None
gridpathratoolkit: GridPathRAToolkitSettings | None = None
vcerare: VCERareSettings | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another nit, but I think even the VCE can be lowercased to VceRare... to match the others.



@asset(required_resource_keys={"datastore", "dataset_settings"})
def raw_vcegen__lat_lon_fips(context) -> pd.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be raw_vcerare__

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure should be, I'll update now

@e-belfer e-belfer enabled auto-merge October 17, 2024 12:25
@e-belfer e-belfer added this pull request to the merge queue Oct 17, 2024
Merged via the queue into main with commit 40dd5cc Oct 17, 2024
19 checks passed
@e-belfer e-belfer deleted the extract-vceregen branch October 17, 2024 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that are just plain broken. new-data Requests for integration of new data. nightly-builds Anything having to do with nightly builds or continuous deployment. vcerare Pertaining to Vibrant Clean Energy's Resource Adequacy Renewable Energy Power datasets
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Write an extractor for the vceregen data
4 participants