Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract VCE renewable generation profiles and remove deprecated gsutil from workflows #3893

Merged
merged 26 commits into from
Oct 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
9fadfff
Add source metadata for vceregen
aesharpe Sep 30, 2024
4036479
Add profiles to vceregen dataset name
aesharpe Sep 30, 2024
d8b992e
Remove blank line in description
aesharpe Sep 30, 2024
82be4d4
Stash WIP of extraction
e-belfer Oct 1, 2024
e044e93
Extract VCE tables to raw dask dfs
e-belfer Oct 2, 2024
57ad9a7
Clean up warnings and restore EIA 176
e-belfer Oct 2, 2024
b922328
Revert to pandas concatenation
e-belfer Oct 2, 2024
53934f2
Add latlonfips
e-belfer Oct 2, 2024
3677f78
Remove dask, coerce dtypes on read-in
e-belfer Oct 8, 2024
c870b63
override load_column_maps behavior
e-belfer Oct 8, 2024
4a4511d
Merge branch 'main' into extract-vceregen
e-belfer Oct 8, 2024
b6b5e6c
Merge branch 'main' into extract-vceregen
e-belfer Oct 11, 2024
291ba7d
Update DOI to sandbox and temporarily xfail DOI test
e-belfer Oct 11, 2024
3eaebe6
Update regen to rare
e-belfer Oct 16, 2024
b324123
Merge branch 'main' into extract-vceregen
e-belfer Oct 16, 2024
120451d
Merge branch 'extract-vceregen' of https://github.com/catalyst-cooper…
e-belfer Oct 16, 2024
5b98e60
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 16, 2024
9f6204e
Merge branch 'main' into extract-vceregen
e-belfer Oct 16, 2024
77d47a4
Update gsutil in zenodo-cache-sync
e-belfer Oct 16, 2024
adfff81
Merge branch 'extract-vceregen' of https://github.com/catalyst-cooper…
e-belfer Oct 16, 2024
6dea332
Add back user project
e-belfer Oct 16, 2024
7554a36
Update project path
e-belfer Oct 16, 2024
9ada9f5
Update project to billing project
e-belfer Oct 16, 2024
4158afd
Update dockerfile to replace gsutil with gcloud storage
e-belfer Oct 16, 2024
7f45bff
Update dataset name, use GCP_BILLING_PROJECT var
e-belfer Oct 16, 2024
e16a4df
Rename fips csv and update init
e-belfer Oct 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/zenodo-cache-sync.yml
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ jobs:

- name: Sync internal and public caches
run: |
gsutil -u catalyst-cooperative-pudl -m rsync -dr ${{ env.INTERNAL_ZENODO_CACHE_BUCKET }} ${{ env.PUBLIC_ZENODO_CACHE_BUCKET }}
gcloud storage --billing-project=catalyst-cooperative-pudl rsync -r ${{ env.INTERNAL_ZENODO_CACHE_BUCKET }} ${{ env.PUBLIC_ZENODO_CACHE_BUCKET }}

zenodo-cache-sync-notify:
runs-on: ubuntu-latest
Expand Down
2 changes: 1 addition & 1 deletion docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ RUN apt-get update && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

# Configure gsutil authentication
# Configure gcloud authentication
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of surprised this is needed, it's writing to boto.cfg which sounds like it's for AWS. But no need to mess with that now I guess.

# hadolint ignore=DL3059
RUN printf '[GoogleCompute]\nservice_account = default' > /etc/boto.cfg

Expand Down
14 changes: 7 additions & 7 deletions docker/gcp_pudl_etl.sh
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ function run_pudl_etl() {

function save_outputs_to_gcs() {
echo "Copying outputs to GCP bucket $PUDL_GCS_OUTPUT" && \
gsutil -q -m cp -r "$PUDL_OUTPUT" "$PUDL_GCS_OUTPUT" && \
gcloud storage --quiet cp -r "$PUDL_OUTPUT" "$PUDL_GCS_OUTPUT" && \
rm -f "$PUDL_OUTPUT/success"
}

Expand All @@ -85,12 +85,12 @@ function upload_to_dist_path() {
# If the old outputs don't exist, these will exit with status 1, so we
# don't && them with the rest of the commands.
echo "Removing old outputs from $GCS_PATH."
gsutil -q -m -u "$GCP_BILLING_PROJECT" rm -r "$GCS_PATH"
gcloud storage --quiet --billing-project="$GCP_BILLING_PROJECT" rm -r "$GCS_PATH"
echo "Removing old outputs from $AWS_PATH."
aws s3 rm --quiet --recursive "$AWS_PATH"

echo "Copying outputs to $GCS_PATH:" && \
gsutil -q -m -u "$GCP_BILLING_PROJECT" cp -r "$PUDL_OUTPUT/*" "$GCS_PATH" && \
gcloud storage --quiet --billing-project="$GCP_BILLING_PROJECT" cp -r "$PUDL_OUTPUT/*" "$GCS_PATH" && \
echo "Copying outputs to $AWS_PATH" && \
aws s3 cp --quiet --recursive "$PUDL_OUTPUT/" "$AWS_PATH"
else
Expand All @@ -113,12 +113,12 @@ function distribute_parquet() {
DIST_PATH="$BUILD_REF"
fi
echo "Copying outputs to $PARQUET_BUCKET/$DIST_PATH" && \
gsutil -q -m -u "$GCP_BILLING_PROJECT" cp -r "$PUDL_OUTPUT/parquet/*" "$PARQUET_BUCKET/$DIST_PATH"
gcloud storage --quiet --billing-project="$GCP_BILLING_PROJECT" cp -r "$PUDL_OUTPUT/parquet/*" "$PARQUET_BUCKET/$DIST_PATH"

# If running a tagged release, ALSO update the stable distribution bucket path:
if [[ "$GITHUB_ACTION_TRIGGER" == "push" && "$BUILD_REF" == v20* ]]; then
echo "Copying outputs to $PARQUET_BUCKET/stable" && \
gsutil -q -m -u "$GCP_BILLING_PROJECT" cp -r "$PUDL_OUTPUT/parquet/*" "$PARQUET_BUCKET/stable"
gcloud storage --quiet --billing-project="$GCP_BILLING_PROJECT" cp -r "$PUDL_OUTPUT/parquet/*" "$PARQUET_BUCKET/stable"
fi
fi
}
Expand Down Expand Up @@ -298,13 +298,13 @@ if [[ $ETL_SUCCESS == 0 ]]; then
# If running a tagged release, ensure that outputs can't be accidentally deleted
# It's not clear that an object lock can be applied in S3 with the AWS CLI
if [[ "$GITHUB_ACTION_TRIGGER" == "push" && "$BUILD_REF" == v20* ]]; then
gsutil -m -u catalyst-cooperative-pudl retention temp set "gs://pudl.catalyst.coop/$BUILD_REF/*" 2>&1 | tee -a "$LOGFILE"
gcloud storage --billing-project="$GCP_BILLING_PROJECT" objects update "gs://pudl.catalyst.coop/$BUILD_REF/*" --temporary-hold 2>&1 | tee -a "$LOGFILE"
GCS_TEMPORARY_HOLD_SUCCESS=${PIPESTATUS[0]}
fi
fi

# This way we also save the logs from latter steps in the script
gsutil -q cp "$LOGFILE" "$PUDL_GCS_OUTPUT"
gcloud storage --quiet cp "$LOGFILE" "$PUDL_GCS_OUTPUT"

# Notify slack about entire pipeline's success or failure;
if [[ $ETL_SUCCESS == 0 && \
Expand Down
1 change: 1 addition & 0 deletions src/pudl/etl/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@
"raw_gridpathratoolkit": [pudl.extract.gridpathratoolkit],
"raw_phmsagas": [pudl.extract.phmsagas],
"raw_nrelatb": [pudl.extract.nrelatb],
"raw_vcerare": [pudl.extract.vcerare],
}


Expand Down
1 change: 1 addition & 0 deletions src/pudl/extract/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,5 +26,6 @@
gridpathratoolkit,
nrelatb,
phmsagas,
vcerare,
xbrl,
)
17 changes: 11 additions & 6 deletions src/pudl/extract/extractor.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,12 +49,7 @@ def __init__(self, dataset_name: str):
self._dataset_name = dataset_name
self._pkg = f"pudl.package_data.{dataset_name}"
column_map_pkg = self._pkg + ".column_maps"
self._column_map = {}
for res_path in importlib.resources.files(column_map_pkg).iterdir():
# res_path is expected to end with ${page}.csv
if res_path.suffix == ".csv":
column_map = self._load_csv(column_map_pkg, res_path.name)
self._column_map[res_path.stem] = column_map
self._column_map = self._load_column_maps(column_map_pkg)

def get_dataset_name(self) -> str:
"""Returns the name of the dataset described by this metadata."""
Expand All @@ -66,6 +61,16 @@ def _load_csv(self, package: str, filename: str) -> pd.DataFrame:
importlib.resources.files(package) / filename, index_col=0, comment="#"
)

def _load_column_maps(self, column_map_pkg: str) -> dict:
"""Create a dictionary of all column mapping CSVs to use in get_column_map()."""
column_dict = {}
for res_path in importlib.resources.files(column_map_pkg).iterdir():
# res_path is expected to end with ${page}.csv
if res_path.suffix == ".csv":
column_map = self._load_csv(column_map_pkg, res_path.name)
column_dict[res_path.stem] = column_map
return column_dict

Comment on lines +64 to +73
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just a formality seeing as we don't have any column maps for this dataset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moving the existing load step into a separate method in the base class, so that i'm able to subclass and change it in the VCE extractor.

def _get_partition_selection(self, partition: dict[str, PartitionSelection]) -> str:
"""Grab the partition key."""
partition_names = list(partition.keys())
Expand Down
190 changes: 190 additions & 0 deletions src/pudl/extract/vcerare.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
"""Extract VCE Resource Adequacy Renewable Energy (RARE) Power Dataset.

This dataset has 1,000s of columns, so we don't want to manually specify a rename on
import because we'll pivot these to a column in the transform step. We adapt the
standard extraction infrastructure to simply read in the data.

Each annual zip folder contains a folder with three files:
Wind_Power_140m_Offshore_county.csv
Wind_Power_100m_Onshore_county.csv
Fixed_SolarPV_Lat_UPV_county.csv

The drive also contains one more CSV file: vce_county_lat_long_fips_table.csv. This gets
read in when the fips partition is set to True.
"""

from collections import defaultdict
from io import BytesIO

import numpy as np
import pandas as pd
from dagster import AssetsDefinition, Output, asset

from pudl import logging_helpers
from pudl.extract.csv import CsvExtractor
from pudl.extract.extractor import GenericMetadata, PartitionSelection, raw_df_factory

logger = logging_helpers.get_logger(__name__)

VCERARE_PAGES = [
"offshore_wind_power_140m",
"onshore_wind_power_100m",
"fixed_solar_pv_lat_upv",
]


class VCERareMetadata(GenericMetadata):
"""Special metadata class for VCE RARE Power Dataset."""

def __init__(self, *args, **kwargs):
"""Initialize the module.

Args:
ds (:class:datastore.Datastore): Initialized datastore.
"""
super().__init__(*args, **kwargs)
self._file_name = self._load_csv(self._pkg, "file_map.csv")

def _load_column_maps(self, column_map_pkg) -> dict:
"""There are no column maps to load, so return an empty dictionary."""
return {}

def get_all_pages(self) -> list[str]:
"""Hard code the page names, which usually are pulled from column rename spreadsheets."""
return VCERARE_PAGES

def get_file_name(self, page, **partition):
"""Returns file name of given partition and page."""
return self._file_name.loc[page, str(self._get_partition_selection(partition))]


class Extractor(CsvExtractor):
"""Extractor for VCE RARE Power Dataset."""

def __init__(self, *args, **kwargs):
"""Initialize the module.

Args:
ds (:class:datastore.Datastore): Initialized datastore.
"""
self.METADATA = VCERareMetadata("vcerare")
super().__init__(*args, **kwargs)

def get_column_map(self, page, **partition):
"""Return empty dictionary, we don't rename these files."""
return {}

def source_filename(self, page: str, **partition: PartitionSelection) -> str:
"""Produce the CSV file name as it will appear in the archive.

The files are nested in an additional folder with the year name inside of the
zipfile, so we add a prefix folder based on the yearly partition to the source
filename.

Args:
page: pudl name for the dataset contents, eg "boiler_generator_assn" or
"coal_stocks"
partition: partition to load. Examples:
{'year': 2009}
{'year_month': '2020-08'}

Returns:
string name of the CSV file
"""
return f"{partition['year']}/{self._metadata.get_file_name(page, **partition)}"

def load_source(self, page: str, **partition: PartitionSelection) -> pd.DataFrame:
"""Produce the dataframe object for the given partition.

Args:
page: pudl name for the dataset contents, eg "boiler_generator_assn" or
"data"
partition: partition to load. Examples:
{'year': 2009}
{'year_month': '2020-08'}

Returns:
pd.DataFrame instance containing CSV data
"""
with (
self.ds.get_zipfile_resource(self._dataset_name, **partition) as zf,
):
# Get list of file names in the zipfile
files = zf.namelist()
# Get the particular file of interest
file = next(
(x for x in files if self.source_filename(page, **partition) in x), None
)

# Read it in using pandas
# Set all dtypes except for the first unnamed hours column
# to be float32 to reduce memory on read-in
dtype_dict = defaultdict(lambda: np.float32)
dtype_dict["Unnamed: 0"] = (
"int" # Set first unnamed column (hours) to be an integer.
)

df = pd.read_csv(BytesIO(zf.read(file)), dtype=dtype_dict)

return df

def process_raw(
self, df: pd.DataFrame, page: str, **partition: PartitionSelection
) -> pd.DataFrame:
"""Append report year to df to distinguish data from other years."""
self.cols_added.append("report_year")
selection = self._metadata._get_partition_selection(partition)
return df.assign(report_year=selection)

def validate(
self, df: pd.DataFrame, page: str, **partition: PartitionSelection
) -> pd.DataFrame:
"""Skip this step, as we aren't renaming any columns."""
return df

def combine(self, dfs: list[pd.DataFrame], page: str) -> pd.DataFrame:
"""Concatenate dataframes into one, take any special steps for processing final page."""
df = pd.concat(dfs, sort=True, ignore_index=True)

return self.process_final_page(df, page)


raw_vcerare__all_dfs = raw_df_factory(Extractor, name="vcerare")


def raw_vcerare_asset_factory(part: str) -> AssetsDefinition:
"""An asset factory for VCE RARE Power Dataset."""
asset_kwargs = {
"name": f"raw_vcerare__{part}",
"required_resource_keys": {"datastore", "dataset_settings"},
}

@asset(**asset_kwargs)
def _extract(context, raw_vcerare__all_dfs):
"""Extract VCE RARE Power Dataset.

Args:
context: dagster keyword that provides access to resources and config.
"""
return Output(value=raw_vcerare__all_dfs[part])

return _extract


raw_vcerare_assets = [raw_vcerare_asset_factory(part) for part in VCERARE_PAGES]


@asset(required_resource_keys={"datastore", "dataset_settings"})
def raw_vcerare__lat_lon_fips(context) -> pd.DataFrame:
"""Extract lat/lon to FIPS and county mapping CSV.

This dataframe is static, so it has a distinct partition from the other datasets and
its extraction is controlled by a boolean in the ETL run.
"""
ds = context.resources.datastore
partition_settings = context.resources.dataset_settings.vcerare
if partition_settings.fips:
return pd.read_csv(
BytesIO(ds.get_unique_resource("vcerare", fips=partition_settings.fips))
)
return pd.DataFrame()
3 changes: 3 additions & 0 deletions src/pudl/package_data/settings/etl_fast.yml
Original file line number Diff line number Diff line change
Expand Up @@ -97,3 +97,6 @@ datasets:
technology_types: ["wind"]
processing_levels: ["extended"]
daily_weather: true
vcerare:
years: [2023]
fips: True
3 changes: 3 additions & 0 deletions src/pudl/package_data/settings/etl_full.yml
Original file line number Diff line number Diff line change
Expand Up @@ -360,3 +360,6 @@ datasets:
technology_types: ["wind", "solar"]
processing_levels: ["extended"]
daily_weather: true
vcerare:
years: [2019, 2020, 2021, 2022, 2023]
fips: True
1 change: 1 addition & 0 deletions src/pudl/package_data/vcerare/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""CSV file extraction maps for VCE RARE Power Dataset."""
4 changes: 4 additions & 0 deletions src/pudl/package_data/vcerare/file_map.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
page,2019,2020,2021,2022,2023
offshore_wind_power_140m,Wind_Power_140m_Offshore_county.csv,Wind_Power_140m_Offshore_county.csv,Wind_Power_140m_Offshore_county.csv,Wind_Power_140m_Offshore_county.csv,Wind_Power_140m_Offshore_county.csv
onshore_wind_power_100m,Wind_Power_100m_Onshore_county.csv,Wind_Power_100m_Onshore_county.csv,Wind_Power_100m_Onshore_county.csv,Wind_Power_100m_Onshore_county.csv,Wind_Power_100m_Onshore_county.csv
fixed_solar_pv_lat_upv,Fixed_SolarPV_Lat_UPV_county.csv,Fixed_SolarPV_Lat_UPV_county.csv,Fixed_SolarPV_Lat_UPV_county.csv,Fixed_SolarPV_Lat_UPV_county.csv,Fixed_SolarPV_Lat_UPV_county.csv
15 changes: 15 additions & 0 deletions src/pudl/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -401,6 +401,19 @@ class EiaAeoSettings(GenericDatasetSettings):
years: list[int] = data_source.working_partitions["years"]


class VCERareSettings(GenericDatasetSettings):
"""An immutable pydantic model to validate VCE RARE Power Dataset settings.

Args:
data_source: DataSource metadata object
years: VCE RARE report years to use.
"""

data_source: ClassVar[DataSource] = DataSource.from_id("vcerare")
years: list[int] = data_source.working_partitions["years"]
fips: bool = True


class GlueSettings(FrozenBaseModel):
"""An immutable pydantic model to validate Glue settings.

Expand Down Expand Up @@ -571,6 +584,7 @@ class DatasetsSettings(FrozenBaseModel):
phmsagas: PhmsaGasSettings | None = None
nrelatb: NrelAtbSettings | None = None
gridpathratoolkit: GridPathRAToolkitSettings | None = None
vcerare: VCERareSettings | None = None
Comment on lines 584 to +587
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another nit, but I think even the VCE can be lowercased to VceRare... to match the others.


@model_validator(mode="before")
@classmethod
Expand All @@ -592,6 +606,7 @@ def default_load_all(cls, data: dict[str, Any]) -> dict[str, Any]:
data["phmsagas"] = PhmsaGasSettings()
data["nrelatb"] = NrelAtbSettings()
data["gridpathratoolkit"] = GridPathRAToolkitSettings()
data["vcerare"] = VCERareSettings()

return data

Expand Down
1 change: 1 addition & 0 deletions src/pudl/workspace/datastore.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,7 @@ class ZenodoDoiSettings(BaseSettings):
gridpathratoolkit: ZenodoDoi = "10.5281/zenodo.10892394"
phmsagas: ZenodoDoi = "10.5281/zenodo.10493790"
nrelatb: ZenodoDoi = "10.5281/zenodo.12658647"
vcerare: ZenodoDoi = "10.5281/zenodo.13937523"

model_config = SettingsConfigDict(
env_prefix="pudl_zenodo_doi_", env_file=".env", extra="ignore"
Expand Down
1 change: 1 addition & 0 deletions test/unit/workspace/datastore_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,7 @@ def setUp(self):
}
)

@pytest.mark.xfail
def test_doi_format_is_correct(self):
"""Verifies ZenodoFetcher DOIs have correct format and are not sandbox DOIs.

Expand Down