pudl-models: ML models developed for PUDL

Any color you want, so long as it's black.

About

The PUDL project makes US energy data free and open for all. For more information, see the PUDL repo and website.

This repo implements machine learning models which support PUDL. The types of modelling performed here include record linkage between datasets, and extracting structured data from unstructured documents. The outputs of these models then feed into PUDL tables, and are distributed in the PUDL data warehouse.

Project Structure

This repo is split into two main sections, with shared tooling being implemented in src/mozilla_sec_eia/library and actual models implemented in src/mozilla_sec_eia/models.

Models

Each model is contained in its own Dagster code location. This keeps models isolated from each other, allowing finetuned dependency management, and provides useful organization in the Dagster UI. To add a new model, you must create a new python module in the src/mozilla_sec_eia/models/ directory. This module should define a single Dagster Definitions object which can be imported from the top-level of the module. For reference on how to structure a code location, see src/mozilla_sec_eia/models/sec10k/ for an example. After creating a new model, it must be added to workspace.yaml.

There are three types of dagster jobs expected in a model code location:

Production Jobs: Production jobs define a pipeline to execute a model and produce outputs which typicall feed into PUDL.
Validation Jobs: Validation jobs are used to test/validate models. They will be run in a single process with an mlflow run backing them to allow logging results to a tracking server.
Training Jobs: Training jobs are meant to train models and log results with mlflow for use in production jobs.

There are helper functions in src/mozilla_sec_eia/library/model_jobs.py for constructing each of these jobs. These functions help to ensure each job will use the appropriate executor and supply the job with necessary resources.

Library

There's generic shared tooling for pudl-models defined in src/mozilla_sec_eia/library/. This includes the helper functions for constructing dagster jobs discussed above, as well as useful methods for computing validation metrics, and an interface to our mlflow tracking server integrated with our tracking server.

MlFlow

We use a remote mlflow tracking to aid in the development and management of pudl-models. In the mlflow module, there are several dagster resources and IO-managers that can be used in any models to allow simple seamless interface to the server.

Development

To launch the dagster UI to load all pudl-models, run the command dagster dev in the top-level of this repo. This will load the file workspace.yaml, which points to each model. You can also work on a single model in isolation by running the command: dagster dev -m mozilla_sec_eia.models.{your_cool_model}.

About Catalyst Cooperative

Catalyst Cooperative is a small group of data wranglers and policy wonks organized as a worker-owned cooperative consultancy. Our goal is a more just, livable, and sustainable world. We integrate public data and perform custom analyses to inform public policy (Hire us!). Our focus is primarily on mitigating climate change and improving electric utility regulation in the United States.

Contact Us

For general support, questions, or other conversations around the project that might be of interest to others, check out the GitHub Discussions
If you'd like to get occasional updates about our projects sign up for our email list.
Want to schedule a time to chat with us one-on-one? Join us for Office Hours
Follow us on Twitter: @CatalystCoop
More info on our website: https://catalyst.coop
For private communication about the project or to hire us to provide customized data extraction and analysis, you can email the maintainers: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 271 Commits
.github		.github
docker/mlflow		docker/mlflow
docs		docs
labeling-configs		labeling-configs
notebooks		notebooks
src/mozilla_sec_eia		src/mozilla_sec_eia
terraform		terraform
tests		tests
.codecov.yml		.codecov.yml
.coveragerc		.coveragerc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.rst		README.rst
dagster.yaml		dagster.yaml
environment.yml		environment.yml
pyproject.toml		pyproject.toml
test_environment.yml		test_environment.yml
tox.ini		tox.ini
workspace.yaml		workspace.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pudl-models: ML models developed for PUDL

About

Project Structure

Models

Library

MlFlow

Development

About Catalyst Cooperative

Contact Us

About

Releases

Sponsor this project

Packages

Contributors 5

Languages

License

catalyst-cooperative/mozilla-sec-eia

Folders and files

Latest commit

History

Repository files navigation

pudl-models: ML models developed for PUDL

About

Project Structure

Models

Library

MlFlow

Development

About Catalyst Cooperative

Contact Us

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Contributors 5

Languages

Packages