The PUDL project makes US energy data free and open for all. For more information, see the PUDL repo and website.
This repo implements machine learning models which support PUDL. The types of modelling performed here include record linkage between datasets, and extracting structured data from unstructured documents. The outputs of these models then feed into PUDL tables, and are distributed in the PUDL data warehouse.
This repo is split into two main sections, with shared tooling being implemented in
src/mozilla_sec_eia/library
and actual models implemented in
src/mozilla_sec_eia/models
.
Each model is contained in its own Dagster
code location. This keeps models
isolated from each other, allowing finetuned dependency management, and provides useful
organization in the Dagster UI. To add a new model, you must create a new python module
in the src/mozilla_sec_eia/models/
directory. This module should define a single
Dagster Definitions
object which can be imported from the top-level of the module.
For reference on how to structure a code location, see
src/mozilla_sec_eia/models/sec10k/
for an example. After creating a new model,
it must be added to
workspace.yaml.
There are three types of dagster jobs expected in a model code location:
- Production Jobs: Production jobs define a pipeline to execute a model and produce outputs which typicall feed into PUDL.
- Validation Jobs: Validation jobs are used to test/validate models. They will be run in a single process with an mlflow run backing them to allow logging results to a tracking server.
- Training Jobs: Training jobs are meant to train models and log results with mlflow for use in production jobs.
There are helper functions in src/mozilla_sec_eia/library/model_jobs.py
for
constructing each of these jobs. These functions help to ensure each job will
use the appropriate executor and supply the job with necessary resources.
There's generic shared tooling for pudl-models
defined in
src/mozilla_sec_eia/library/
. This includes the helper functions for
constructing dagster jobs discussed above, as well as useful methods for computing
validation metrics, and an interface to our mlflow tracking server integrated with
our tracking server.
We use a remote mlflow tracking to aid in the
development and management of pudl-models
. In the mlflow
module, there are
several dagster resources and IO-managers that can be used in any models to allow simple
seamless interface to the server.
To launch the dagster UI to load all pudl-models
, run the command dagster dev
in the top-level of this repo. This will load the file workspace.yaml
, which points
to each model. You can also work on a single model in isolation by running the command:
dagster dev -m mozilla_sec_eia.models.{your_cool_model}
.
Catalyst Cooperative is a small group of data wranglers and policy wonks organized as a worker-owned cooperative consultancy. Our goal is a more just, livable, and sustainable world. We integrate public data and perform custom analyses to inform public policy (Hire us!). Our focus is primarily on mitigating climate change and improving electric utility regulation in the United States.
- For general support, questions, or other conversations around the project that might be of interest to others, check out the GitHub Discussions
- If you'd like to get occasional updates about our projects sign up for our email list.
- Want to schedule a time to chat with us one-on-one? Join us for Office Hours
- Follow us on Twitter: @CatalystCoop
- More info on our website: https://catalyst.coop
- For private communication about the project or to hire us to provide customized data extraction and analysis, you can email the maintainers: [email protected]