-
-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract VCE renewable generation profiles and remove deprecated gsutil
from workflows
#3893
Conversation
…ative/pudl into extract-vceregen
For more information, see https://pre-commit.ci
gsutil
from workflows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gsutil
/gcloud storage
changes look good to me!
@@ -15,7 +15,7 @@ RUN apt-get update && \ | |||
apt-get clean && \ | |||
rm -rf /var/lib/apt/lists/* | |||
|
|||
# Configure gsutil authentication | |||
# Configure gcloud authentication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of surprised this is needed, it's writing to boto.cfg
which sounds like it's for AWS. But no need to mess with that now I guess.
docker/gcp_pudl_etl.sh
Outdated
@@ -298,13 +298,13 @@ if [[ $ETL_SUCCESS == 0 ]]; then | |||
# If running a tagged release, ensure that outputs can't be accidentally deleted | |||
# It's not clear that an object lock can be applied in S3 with the AWS CLI | |||
if [[ "$GITHUB_ACTION_TRIGGER" == "push" && "$BUILD_REF" == v20* ]]; then | |||
gsutil -m -u catalyst-cooperative-pudl retention temp set "gs://pudl.catalyst.coop/$BUILD_REF/*" 2>&1 | tee -a "$LOGFILE" | |||
gcloud storage --billing-project="catalyst-cooperative-pudl" objects update "gs://pudl.catalyst.coop/$BUILD_REF/*" --temporary-hold 2>&1 | tee -a "$LOGFILE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: seems like we could use "$GCP_BILLING_PROJECT"
instead of "catalyst-cooperative-pudl"
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! a few comments about format etc.
I have one overarching question about the dataset name. Technically it is called the "VCE RARE Power Dataset" but we refer to it here as the "VCE RARE generation profiles". I'm not sure how important the last few words after VCE RARE are...I like the way we currently have it because "generation profiles" feels more descriptive than "power dataset" but maybe we should just use the official name. I don't feel strongly about this, just wanted to flag.
def _load_column_maps(self, column_map_pkg: str) -> dict: | ||
"""Create a dictionary of all column mapping CSVs to use in get_column_map().""" | ||
column_dict = {} | ||
for res_path in importlib.resources.files(column_map_pkg).iterdir(): | ||
# res_path is expected to end with ${page}.csv | ||
if res_path.suffix == ".csv": | ||
column_map = self._load_csv(column_map_pkg, res_path.name) | ||
column_dict[res_path.stem] = column_map | ||
return column_dict | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this just a formality seeing as we don't have any column maps for this dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is moving the existing load step into a separate method in the base class, so that i'm able to subclass and change it in the VCE extractor.
src/pudl/extract/vcerare.py
Outdated
This dataset has 1,000s of columns, so we don't want to manually specify a rename on | ||
import because we'll pivot these to a column. We adapt the standard extraction | ||
infrastructure to simply read in the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add "in the PUDL pipeline" after "pivot these to a column"
src/pudl/extract/vcerare.py
Outdated
The drive also contains one more file: vce_county_lat_long_fips_table.csv. This file is | ||
not partitioned, so we always read it in regardless of the partitions configured for the | ||
run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more CSV file (because there is also the readme)
src/pudl/extract/vcerare.py
Outdated
|
||
|
||
def raw_vcerare_asset_factory(part: str) -> AssetsDefinition: | ||
"""An asset factory for VCE hourly renewable generation profiles.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same name comment
src/pudl/extract/vcerare.py
Outdated
|
||
@asset(**asset_kwargs) | ||
def _extract(context, raw_vcerare__all_dfs): | ||
"""Extract raw GridPath RA Toolkit renewable energy generation profiles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an accidental copy-over from the gridpath extractor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure is, good catch!
src/pudl/extract/vcerare.py
Outdated
|
||
|
||
class VCEMetadata(GenericMetadata): | ||
"""Special metadata class for VCE RARE renewable generation profiles.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the RE in rare stands for renewable energy so I think we can take out the word renewable here
src/pudl/extract/vcerare.py
Outdated
|
||
|
||
class Extractor(CsvExtractor): | ||
"""Extractor for VCE renewable generation profiles.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should call it by RARE name here
src/pudl/settings.py
Outdated
@@ -401,6 +401,19 @@ class EiaAeoSettings(GenericDatasetSettings): | |||
years: list[int] = data_source.working_partitions["years"] | |||
|
|||
|
|||
class VCERARESettings(GenericDatasetSettings): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way we've named the other sources is with Pascal case (VceRareSettings)
@@ -0,0 +1 @@ | |||
"""CSV file extraction maps for VCE RARE generation profiles.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note for consistency if you want to change this to say "Power Dataset" to match the others.
phmsagas: PhmsaGasSettings | None = None | ||
nrelatb: NrelAtbSettings | None = None | ||
gridpathratoolkit: GridPathRAToolkitSettings | None = None | ||
vcerare: VCERareSettings | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another nit, but I think even the VCE can be lowercased to VceRare... to match the others.
src/pudl/extract/vcerare.py
Outdated
|
||
|
||
@asset(required_resource_keys={"datastore", "dataset_settings"}) | ||
def raw_vcegen__lat_lon_fips(context) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be raw_vcerare__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure should be, I'll update now
Overview
Closes #3880.
What problem does this address?
Extracts VCE renewable generation profile data into raw dagster assets. Also fixes a nightly build failure due to the deprecation of
gsutil
(IAM permissions were updated manually and need to be added in terraform, but this is out of scope for this PR).What did you change?
vcerare
extraction and specify valid run settingsvcerare
to the fast and full ETL runsgsutil
withgcloud storage
To-do list
Testing
How did you make sure this worked? How can a reviewer verify this?
To-do list