diff --git a/README.rst b/README.rst index 31dba8a38e..1fda87632e 100644 --- a/README.rst +++ b/README.rst @@ -37,14 +37,15 @@ The Public Utility Data Liberation Project (PUDL) :alt: Zenodo DOI `PUDL `__ makes US energy data easier to access -and work with. Hundreds of gigabytes of supposedly public information published -by government agencies, but in a bunch of different formats that can be hard to +and work with. Hundreds of gigabytes of public information is published +by government agencies, but in many different formats that make it hard to work with and combine. PUDL takes these spreadsheets, CSV files, and databases -and turns them into easy to parse, well-documented `tabular data packages `__ -that can be used to create a database, used directly with Python, R, Microsoft -Access, and lots of other tools. +and turns them into easy use +`tabular data packages `__ +that can populate a database, or be used directly with Python, R, Microsoft +Access, and many other tools. -The project currently contains data from: +The project currently integrates data from: * `EIA Form 860 `__ * `EIA Form 923 `__ @@ -52,18 +53,17 @@ The project currently contains data from: * `The EPA Integrated Planning Model (IPM) `__ * `FERC Form 1 `__ -We are especially interested in serving researchers, activists, journalists, +The project is especially meant to serve researchers, activists, journalists, and policy makers that might not otherwise be able to afford access to this -data from commercial data providers. +data from existing commercial data providers. Getting Started --------------- Just want to play with some example data? Install `Anaconda `__ -(or `miniconda `__ -if you like the command line) with at least Python 3.7. Then work through the -following terminal commands: +(or `miniconda `__) with at +least Python 3.7. Then work through the following commands. First, we create and activate conda environment named ``pudl``. All the required packages are available from the community maintained ``conda-forge`` @@ -78,11 +78,10 @@ interactively. $ conda create -y -n pudl -c conda-forge --strict-channel-priority python=3.7 catalystcoop.pudl jupyter jupyterlab pip $ conda activate pudl -Now we create a data management workspace -- a well defined directory structure -that PUDL will use to organize the data it downloads, processes, and outputs -- -and download the most recent year's worth of data for each of the available -datasets. You can run ``pudl_setup --help`` and ``pudl_data --help`` for more -information. +Now we create a data management workspace called ``pudl-work`` and download +some data. The workspace is a well defined directory structure that PUDL uses +to organize the data it downloads, processes, and outputs. You can run +``pudl_setup --help`` and ``pudl_data --help`` for more information. .. code-block:: console @@ -91,12 +90,12 @@ information. $ pudl_data --sources eia923 eia860 ferc1 epacems epaipm --years 2017 --states id Now that we have the original data as published by the federal agencies, we can -run the data processing (ETL = Extract, Transform, Load) pipeline, that turns -the raw data into an well organized, standardized bundle of data packages. -This involves a couple of steps: cloning the FERC Form 1 into an SQLite -database, extracting data from that database and all the other sources and -cleaning it up, outputting that data into well organized CSV/JSON based data -packages, and finally loading those data packages into a local database. +run the ETL (Extract, Transform, Load) pipeline, that turns the raw data into +an well organized, standardized bundle of data packages. This involves a couple +of steps: cloning the FERC Form 1 into an SQLite database, extracting data from +that database and all the other sources and cleaning it up, outputting that +data into well organized CSV/JSON based data packages, and finally loading +those data packages into a local database. PUDL provides a script to clone the FERC Form 1 database, controlled by a YAML file which you can find in the settings folder. Run it like this: @@ -119,7 +118,7 @@ using. Run the ETL pipeline with this command: $ pudl_etl pudl-work/settings/etl_example.yml -The generated data packages are made up of CSV and JSON files, that are both +The generated data packages are made up of CSV and JSON files. They're both easy to parse programmatically, and readable by humans. They are also well suited to archiving, citation, and bulk distribution. However, to make the data easier to query and work with interactively, we typically load it into a @@ -139,11 +138,6 @@ Jupyter notebook server, and open a notebook of PUDL usage examples: $ jupyter lab pudl-work/notebook/pudl_intro.ipynb -**NOTE:** The example above requires a computer with at least **4 GB of RAM** -and **several GB of free disk space**. You will also need to download -**100s of MB of data**. This could take a while if you have a slow internet -connection. - For more details, see `the full PUDL documentation `__ on Read The Docs. @@ -167,9 +161,11 @@ contribute! Licensing --------- -The PUDL software is released under the `MIT License `__. +The PUDL software is released under the +`MIT License `__. `The PUDL documentation `__ -and the data packages we distribute are released under the `Creative Commons Attribution 4.0 License `__. +and the data packages we distribute are released under the +`CC-BY-4.0 `__ license. Contact Us ---------- diff --git a/docs/clone_ferc1.rst b/docs/clone_ferc1.rst index 68aeb1f8b4..5521110f54 100644 --- a/docs/clone_ferc1.rst +++ b/docs/clone_ferc1.rst @@ -33,8 +33,8 @@ ETL process. This can be done with the ``ferc1_to_sqlite`` script (which is an entrypoint into the :mod:`pudl.convert.ferc1_to_sqlite` module) which is installed as part of the PUDL Python package. It takes its instructions from a YAML file, an example of which is included in the ``settings`` directory in -your PUDL workspace. Once you've :ref:`created a ` you can try this -example: +your PUDL workspace. Once you've :ref:`created a datastore ` you can +try this example: .. code-block:: console @@ -49,14 +49,16 @@ factor of ~10 (to ~8 GB rather than 800 MB). If for some reason you need access to those tables, you can create your own settings file and un-comment those tables in the list of tables that it directs the script to load. -Note that this script pulls *all* the FERC Form 1 data into a *single* -database, while FERC distributes a *separate* database for each year. Virtually -all the database tables contain a ``report_year`` column that indicates which -year they came from, preventing collisions between records in the merged -multi-year database that we create. One notable exception is the -``f1_respondent_id`` table, which maps ``respondent_id`` to the names of the -respondents. For that table, we have allowed the most recently reported record -to take precedence, overwriting previous mappings if they exist. +.. note:: + + This script pulls *all* of the FERC Form 1 data into a *single* database, + but FERC distributes a *separate* database for each year. Virtually all + the database tables contain a ``report_year`` column that indicates which + year they came from, preventing collisions between records in the merged + multi-year database. One notable exception is the ``f1_respondent_id`` + table, which maps ``respondent_id`` to the names of the respondents. For + that table, we have allowed the most recently reported record to take + precedence, overwriting previous mappings if they exist. Sadly, the FERC Form 1 database is not particularly... relational. The only foreign key relationships that exist map ``respondent_id`` fields in the diff --git a/docs/datapackages.rst b/docs/datapackages.rst index 46fac8ac70..97e873d546 100644 --- a/docs/datapackages.rst +++ b/docs/datapackages.rst @@ -76,6 +76,16 @@ the data packages to populate a local SQLite database. `Open an issue on Github `__ and let us know if you have another example we can add. +SQLite +^^^^^^ + +If you want to access the data via SQL, we have provided a script that loads +a bundle of data packages into a local :mod:`sqlite3` database, e.g.: + +.. code-block:: + + $ datapkg_to_sqlite --pkg_bundle_name pudl-example + Python, Pandas, and Jupyter ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -111,16 +121,6 @@ The R programming language Get someone who uses R to give us an example here... maybe we can get someone from OKFN to do it? -SQLite -^^^^^^ - -If you'd rather access the data via SQL, we have provided a script that loads -a bundle of the datapackages into a local :mod:`sqlite3` database, e.g.: - -.. code-block:: - - $ datapkg_to_sqlite --pkg_bundle_name pudl-example - Microsoft Access / Excel ^^^^^^^^^^^^^^^^^^^^^^^^^ diff --git a/docs/datastore.rst b/docs/datastore.rst index a7e8859fd2..67f31dbf0d 100644 --- a/docs/datastore.rst +++ b/docs/datastore.rst @@ -45,9 +45,7 @@ For more detailed usage information, see: The downloaded data will be used by the script to populate a datastore under the ``data`` directory in your workspace, organized by data source, form, and -date: - -.. code-block:: +date:: data/eia/form860/ data/eia/form923/ diff --git a/docs/dev_setup.rst b/docs/dev_setup.rst index 15b65d2957..c65e949b7a 100644 --- a/docs/dev_setup.rst +++ b/docs/dev_setup.rst @@ -226,9 +226,6 @@ be hard for humans to catch but are easy for a computer. managing and maintaining multi-language pre-commit hooks. -* Set up your editor / IDE to follow our code style guidelines. -* Run ``pudl_setup`` to create a local data management environment. - ------------------------------------------------------------------------------- Install and Validate the Data ------------------------------------------------------------------------------- @@ -279,8 +276,7 @@ already downloaded datastore, you can point the tests at it with $ tox -v -e travis -- --fast --pudl_in=AUTO -Additional details can be found in our -:ref:`documentation on testing `. +Additional details can be found in :ref:`testing`. ------------------------------------------------------------------------------- Making a Pull Request diff --git a/docs/install.rst b/docs/install.rst index bb3eeb7788..c4203a70cd 100644 --- a/docs/install.rst +++ b/docs/install.rst @@ -68,6 +68,8 @@ with PUDL interactively: You may also want to update your global ``conda`` settings: +.. code-block:: console + $ conda config --add channels conda-forge $ conda config --set channel_priority strict @@ -87,7 +89,7 @@ PUDL is also available via the official :doc:`dev_setup` documentation. In addition to making the :mod:`pudl` package available for import in Python, -installing ``catalystcoop.pudl`` installs the following command line tools: +installing ``catalystcoop.pudl`` provides the following command line tools: * ``pudl_setup`` * ``pudl_data`` @@ -96,10 +98,12 @@ installing ``catalystcoop.pudl`` installs the following command line tools: * ``datapkg_to_sqlite`` * ``epacems_to_parquet`` -For information on how to use them, run them with the ``--help`` option. Most -of them are configured using settings files. Examples are provided with the -``catalystcoop.pudl`` package, and deployed by running ``pudl_setup`` as -described below. +For information on how to use these scripts, each can be run with the +``--help`` option. ``ferc1_to_sqlite`` and ``pudl_etl`` are configured with +YAML files. Examples are provided with the ``catalystcoop.pudl`` package, and +deployed by running ``pudl_setup`` as described below. Additional inormation +about the settings files can be found in our documentation on +:ref:`settings_files` .. _install-workspace: @@ -107,7 +111,7 @@ described below. Creating a Workspace ------------------------------------------------------------------------------- -PUDL needs to know where to store its big pile of input and output data. It +PUDL needs to know where to store its big piles of inputs and outputs. It also provides some example configuration files and `Jupyter `__ notebooks. The ``pudl_setup`` script lets PUDL know where all this stuff should go. We call this a "PUDL workspace": @@ -120,7 +124,8 @@ Here is the path to the directory where you want PUDL to do its business -- this is where the datastore will be located, and any outputs that are generated will end up. The script will also put a configuration file in your home directory, called ``.pudl.yml`` that records the location of this -workspace and uses it by default in the future. +workspace and uses it by default in the future. If you run ``pudl_setup`` with +no arguments, it assumes you want to use the current directory. The workspace is laid out like this: @@ -164,9 +169,9 @@ run: .. code-block:: console - $ conda env create --name=pudl --file=environment.yml + $ conda env create --name pudl --file environment.yml -You should probably periodically update the packages installed as part of PUDL, +You may want to periodically update PUDL and the packages it depends on by running the following commands in the directory with ``environment.yml`` in it: diff --git a/docs/new_dataset.rst b/docs/new_dataset.rst index 43cca887ac..13cba7ce8e 100644 --- a/docs/new_dataset.rst +++ b/docs/new_dataset.rst @@ -12,27 +12,27 @@ and improves over time. In general the process for adding a new data source looks like this: -#. Add the new data source to the ``datastore.py`` module and the - ``update_datastore.py`` script. +#. Add the new data source to the :mod:`pudl.workspace.datastore`` module and + the ``pudl_data`` script. #. Define well normalized data tables for the new data source in the - metadata ``pudl/package_data/meta/datapackage/datapackage.json``. -#. Add a module to the ``extract`` subpackage that generates raw dataframes - containing the new data source's information from whatever its original - format was. -#. Add a module to the ``transform`` subpackage that takes those raw + metadata, which is stored in + ``src/pudl/package_data/meta/datapackage/datapackage.json``. +#. Add a module to the :mod:`pudl.extract` subpackage that generates raw + dataframes containing the new data source's information from whatever its + original format was. +#. Add a module to the :mod:`pudl.transform` subpackage that takes those raw dataframes, cleans them up, and re-organizes them to match the new database table definitions. -#. If necessary, add a module to the ``load`` subpackage that takes these - clean, transformed dataframes and pushes their contents into the postgres - database. -#. If appropriate, create linkages between the new database tables and other - existing data in the database, so they can be used together. Often this - means creating some skinny "glue" tables that link one set of unique entity - IDs to another. -#. Update the ``etl.py`` module so that it includes your new data source as - part of the ETL (Extract, Transform, Load) process, and any necessary code - to the ``cli.py`` script. -#. Add an output module for the new data source to the ``output`` subpackage. +#. If necessary, add a module to the :mod:`pudl.load` subpackage that takes + these clean, transformed dataframes and exports them to data packages. +#. If appropriate, create linkages in the table schemas between the tabular + resources so they can be used together. Often this means creating some + skinny "glue" tables that link one set of unique entity IDs to another. +#. Update the :mod:`pudl.etl` module so that it includes your new data source + as part of the ETL (Extract, Transform, Load) process, and any necessary + code to the :mod:`pudl.cli` entrypoint module. +#. Add an output module for the new data source to the :mod:`pudl.output` + subpackage. #. Write some unit tests for the new data source, and add them to the ``pytest`` suite in the ``test`` directory. @@ -43,13 +43,13 @@ Add dataset to the datastore Scripts ^^^^^^^ -This means editing the ``datastore.py`` module and the datastore update script -(``scripts/update_datastore.py``) so that they can acquire the data from the -reporting agencies, and organize it locally in advance of the ETL (Extract, -Transform, and Load) process. New data sources should be organized under -``data///`` e.g. ``data/ferc/form1`` or ``data/eia/form923``. -Larger data sources that are available as compressed zipfiles can be left -zipped to save local disk space, since ``pandas`` can read zipfiles directly. +This means editing the :mod:`pudl.workspace.datastore` module and the +``pudl_data`` script so that they can acquire the data from the +reporting agencies, and organize it locally in advance of the ETL process. +New data sources should be organized under ``data///`` e.g. +``data/ferc/form1`` or ``data/eia/form923``. Larger data sources that are +available as compressed zipfiles can be left zipped to save local disk space, +since ``pandas`` can read zipfiles directly. Organization ^^^^^^^^^^^^ @@ -68,19 +68,18 @@ should be able to specify subsets of the data to pull or refresh -- e.g. a set of years, or a set of states -- especially in the case of large datasets. In some cases, opening several download connections in parallel may dramatically reduce the time it takes to acquire the data (e.g. pulling don the EPA CEMS -dataset over FTP). The ``constants.py`` module contains several dictionaries -which define what years etc. are available for each data source. +dataset over FTP). The :mod:`pudl.constants` module contains several +dictionaries which define what years etc. are available for each data source. Describe Table Metadata ^^^^^^^^^^^^^^^^^^^^^^^ Add table description into `resources` in the the mega-data: the metadata file that contains all of the PUDL table descriptions -(``package_data/meta/datapackage/datapackage.json``). The resource descriptions -must conform to the `Frictionless Data specifications `__, +(``src/pudl/package_data/meta/datapackage/datapackage.json``). The resource +descriptions must conform to the `Frictionless Data specifications `__, specifically the specifications for a `tabular data resource `__. -The `table schema specification __` will be -particularly helpful. +The `table schema specification `__ will be particularly helpful. There is also a dictionary in the megadata called "autoincrement", which is used for compiling table names that require an auto incremented id column when @@ -142,19 +141,13 @@ Load the data into the datapackages Each of the dataframes that comes out of the transform step represents a resource that needs to be loaded into the datapackage. Pandas has a native -:meth:`pandas.DataFrame.to_csv` method for exporting a dataframe to a - -Instead, we use postgres’ native -``COPY_FROM`` function, which is designed for loading large CSV files directly -into the database very efficiently. Instead of writing the dataframe out to a -file on disk, we create an in-memory file-like object, and read directly from -that. For this to work, the corresponding dataframe and database columns need -to be named identically, and the strings that are read by postgres from the -in-memory CSV file need to be readily interpretable as the data type that is -associated with the column in the table definition. Because Python doesn’t have -a native NA value for integers, but postgres does, just before the dataframes -are loaded into the database we convert any integer NA sentinel values using a -little helper function :func:`pudl.helpers.fix_int_na`. +:meth:`pandas.DataFrame.to_csv` method for exporting a dataframe to a CSV +file, which is used to output the data to disk. + +Because we have not yet taken advantage the new pandas extension arrays, and +Python doesn’t have a native NA value for integers, just before the dataframes +are written to disk we convert any integer NA sentinel values using a little +helper function :func:`pudl.helpers.fix_int_na`. Glue the new data to existing data ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -202,7 +195,5 @@ Test cases need to be created for each new dataset, verifying that the ETL process works, and sanity checking the data itself. This is somewhat different than traditional software testing, since we're not just testing our code -- we're also trying to make sure that the data is in good shape. Those -exhaustive tests are currently only run locally. Less extensive tests that are -meant to just check that the code is still working correctly need to be -integrated into the ``test/travis_ci_test.py`` module, which downloads a small -sample of each dataset for use in testing. +exhaustive tests are currently only run locally. See :ref:`testing` for more +details. diff --git a/docs/settings_files.rst b/docs/settings_files.rst index be373949ad..bb0f296aae 100644 --- a/docs/settings_files.rst +++ b/docs/settings_files.rst @@ -1,3 +1,5 @@ +.. _settings_files: + =============================================================================== Settings Files =============================================================================== @@ -63,9 +65,7 @@ dictionaries should not be changed, but the values of those dictionaries should be edited. There are two high-level elements of the settings file: ``pkg_bundle_name`` and ``pkg_bundle_settings``. The ``pkg_bundle_name`` will be the directory that the bundle of packages described in the settings file. -The elements and structure of the ``pkg_bundle_settings`` is described below: - -.. code-block:: +The elements and structure of the ``pkg_bundle_settings`` is described below:: pkg_bundle_settings ├── name : name of data package diff --git a/docs/usage.rst b/docs/usage.rst index c1ec1033a3..59fb99e0f0 100644 --- a/docs/usage.rst +++ b/docs/usage.rst @@ -54,12 +54,12 @@ directory at ``datapackage/pudl-example`` containing several datapackage directories, one for each of the ``ferc1``, ``eia`` (Forms 860 and 923), ``epacems-eia``, and ``epaipm`` datasets. -Under the hood, these scripts are extracting a bunch of data from the -datastore, including a bunch of spreadsheets, CSV files, and binary DBF files, -generating a SQLite database containing the raw FERC Form 1 data, and combining -it all into ``pudl-example``, which is a bundle of -`tabular datapackages `__. -that can be used together to create a database (or other things). +Under the hood, these scripts are extracting data from the datastore, including +spreadsheets, CSV files, and binary DBF files, generating a SQLite database +containing the raw FERC Form 1 data, and combining it all into +``pudl-example``, which is a bundle of `tabular datapackages +`__. that can be used +together to create a database (or other things). Each of the data packages which are part of the bundle have metadata describing their structure, stored in a file called ``datapackage.json`` The data itself