Skip to content

Commit

Permalink
Final read-through for documentation edits...
Browse files Browse the repository at this point in the history
  • Loading branch information
zaneselvans committed Sep 17, 2019
1 parent 50c98a8 commit fd2f867
Show file tree
Hide file tree
Showing 9 changed files with 112 additions and 124 deletions.
56 changes: 26 additions & 30 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,33 +37,33 @@ The Public Utility Data Liberation Project (PUDL)
:alt: Zenodo DOI

`PUDL <https://catalyst.coop/pudl/>`__ makes US energy data easier to access
and work with. Hundreds of gigabytes of supposedly public information published
by government agencies, but in a bunch of different formats that can be hard to
and work with. Hundreds of gigabytes of public information is published
by government agencies, but in many different formats that make it hard to
work with and combine. PUDL takes these spreadsheets, CSV files, and databases
and turns them into easy to parse, well-documented `tabular data packages <https://https://frictionlessdata.io/docs/tabular-data-package/>`__
that can be used to create a database, used directly with Python, R, Microsoft
Access, and lots of other tools.
and turns them into easy use
`tabular data packages <https://https://frictionlessdata.io/docs/tabular-data-package/>`__
that can populate a database, or be used directly with Python, R, Microsoft
Access, and many other tools.

The project currently contains data from:
The project currently integrates data from:

* `EIA Form 860 <https://www.eia.gov/electricity/data/eia860/>`__
* `EIA Form 923 <https://www.eia.gov/electricity/data/eia923/>`__
* `The EPA Continuous Emissions Monitoring System (CEMS) <https://ampd.epa.gov/ampd/>`__
* `The EPA Integrated Planning Model (IPM) <https://www.epa.gov/airmarkets/national-electric-energy-data-system-needs-v6>`__
* `FERC Form 1 <https://www.ferc.gov/docs-filing/forms/form-1/data.asp>`__

We are especially interested in serving researchers, activists, journalists,
The project is especially meant to serve researchers, activists, journalists,
and policy makers that might not otherwise be able to afford access to this
data from commercial data providers.
data from existing commercial data providers.

Getting Started
---------------

Just want to play with some example data? Install
`Anaconda <https://www.anaconda.com/distribution/>`__
(or `miniconda <https://docs.conda.io/en/latest/miniconda.html>`__
if you like the command line) with at least Python 3.7. Then work through the
following terminal commands:
(or `miniconda <https://docs.conda.io/en/latest/miniconda.html>`__) with at
least Python 3.7. Then work through the following commands.

First, we create and activate conda environment named ``pudl``. All the
required packages are available from the community maintained ``conda-forge``
Expand All @@ -78,11 +78,10 @@ interactively.
$ conda create -y -n pudl -c conda-forge --strict-channel-priority python=3.7 catalystcoop.pudl jupyter jupyterlab pip
$ conda activate pudl
Now we create a data management workspace -- a well defined directory structure
that PUDL will use to organize the data it downloads, processes, and outputs --
and download the most recent year's worth of data for each of the available
datasets. You can run ``pudl_setup --help`` and ``pudl_data --help`` for more
information.
Now we create a data management workspace called ``pudl-work`` and download
some data. The workspace is a well defined directory structure that PUDL uses
to organize the data it downloads, processes, and outputs. You can run
``pudl_setup --help`` and ``pudl_data --help`` for more information.

.. code-block:: console
Expand All @@ -91,12 +90,12 @@ information.
$ pudl_data --sources eia923 eia860 ferc1 epacems epaipm --years 2017 --states id
Now that we have the original data as published by the federal agencies, we can
run the data processing (ETL = Extract, Transform, Load) pipeline, that turns
the raw data into an well organized, standardized bundle of data packages.
This involves a couple of steps: cloning the FERC Form 1 into an SQLite
database, extracting data from that database and all the other sources and
cleaning it up, outputting that data into well organized CSV/JSON based data
packages, and finally loading those data packages into a local database.
run the ETL (Extract, Transform, Load) pipeline, that turns the raw data into
an well organized, standardized bundle of data packages. This involves a couple
of steps: cloning the FERC Form 1 into an SQLite database, extracting data from
that database and all the other sources and cleaning it up, outputting that
data into well organized CSV/JSON based data packages, and finally loading
those data packages into a local database.

PUDL provides a script to clone the FERC Form 1 database, controlled by a YAML
file which you can find in the settings folder. Run it like this:
Expand All @@ -119,7 +118,7 @@ using. Run the ETL pipeline with this command:
$ pudl_etl pudl-work/settings/etl_example.yml
The generated data packages are made up of CSV and JSON files, that are both
The generated data packages are made up of CSV and JSON files. They're both
easy to parse programmatically, and readable by humans. They are also well
suited to archiving, citation, and bulk distribution. However, to make the
data easier to query and work with interactively, we typically load it into a
Expand All @@ -139,11 +138,6 @@ Jupyter notebook server, and open a notebook of PUDL usage examples:
$ jupyter lab pudl-work/notebook/pudl_intro.ipynb
**NOTE:** The example above requires a computer with at least **4 GB of RAM**
and **several GB of free disk space**. You will also need to download
**100s of MB of data**. This could take a while if you have a slow internet
connection.

For more details, see `the full PUDL documentation
<https://catalystcoop-pudl.readthedocs.io/>`__ on Read The Docs.

Expand All @@ -167,9 +161,11 @@ contribute!
Licensing
---------

The PUDL software is released under the `MIT License <https://opensource.org/licenses/MIT>`__.
The PUDL software is released under the
`MIT License <https://opensource.org/licenses/MIT>`__.
`The PUDL documentation <https://catalystcoop-pudl.readthedocs.io>`__
and the data packages we distribute are released under the `Creative Commons Attribution 4.0 License <https://creativecommons.org/licenses/by/4.0/>`__.
and the data packages we distribute are released under the
`CC-BY-4.0 <https://creativecommons.org/licenses/by/4.0/>`__ license.

Contact Us
----------
Expand Down
22 changes: 12 additions & 10 deletions docs/clone_ferc1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ ETL process. This can be done with the ``ferc1_to_sqlite`` script (which is an
entrypoint into the :mod:`pudl.convert.ferc1_to_sqlite` module) which is
installed as part of the PUDL Python package. It takes its instructions from a
YAML file, an example of which is included in the ``settings`` directory in
your PUDL workspace. Once you've :ref:`created a <datastore>` you can try this
example:
your PUDL workspace. Once you've :ref:`created a datastore <datastore>` you can
try this example:

.. code-block:: console
Expand All @@ -49,14 +49,16 @@ factor of ~10 (to ~8 GB rather than 800 MB). If for some reason you need access
to those tables, you can create your own settings file and un-comment those
tables in the list of tables that it directs the script to load.

Note that this script pulls *all* the FERC Form 1 data into a *single*
database, while FERC distributes a *separate* database for each year. Virtually
all the database tables contain a ``report_year`` column that indicates which
year they came from, preventing collisions between records in the merged
multi-year database that we create. One notable exception is the
``f1_respondent_id`` table, which maps ``respondent_id`` to the names of the
respondents. For that table, we have allowed the most recently reported record
to take precedence, overwriting previous mappings if they exist.
.. note::

This script pulls *all* of the FERC Form 1 data into a *single* database,
but FERC distributes a *separate* database for each year. Virtually all
the database tables contain a ``report_year`` column that indicates which
year they came from, preventing collisions between records in the merged
multi-year database. One notable exception is the ``f1_respondent_id``
table, which maps ``respondent_id`` to the names of the respondents. For
that table, we have allowed the most recently reported record to take
precedence, overwriting previous mappings if they exist.

Sadly, the FERC Form 1 database is not particularly... relational. The only
foreign key relationships that exist map ``respondent_id`` fields in the
Expand Down
20 changes: 10 additions & 10 deletions docs/datapackages.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,16 @@ the data packages to populate a local SQLite database.

`Open an issue on Github <https://github.com/catalyst-cooperative/pudl/issues>`__ and let us know if you have another example we can add.

SQLite
^^^^^^

If you want to access the data via SQL, we have provided a script that loads
a bundle of data packages into a local :mod:`sqlite3` database, e.g.:

.. code-block::
$ datapkg_to_sqlite --pkg_bundle_name pudl-example
Python, Pandas, and Jupyter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -111,16 +121,6 @@ The R programming language
Get someone who uses R to give us an example here... maybe we can get
someone from OKFN to do it?

SQLite
^^^^^^

If you'd rather access the data via SQL, we have provided a script that loads
a bundle of the datapackages into a local :mod:`sqlite3` database, e.g.:

.. code-block::
$ datapkg_to_sqlite --pkg_bundle_name pudl-example
Microsoft Access / Excel
^^^^^^^^^^^^^^^^^^^^^^^^^

Expand Down
4 changes: 1 addition & 3 deletions docs/datastore.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,7 @@ For more detailed usage information, see:
The downloaded data will be used by the script to populate a datastore under
the ``data`` directory in your workspace, organized by data source, form, and
date:

.. code-block::
date::

data/eia/form860/
data/eia/form923/
Expand Down
6 changes: 1 addition & 5 deletions docs/dev_setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -226,9 +226,6 @@ be hard for humans to catch but are easy for a computer.
managing and maintaining multi-language pre-commit hooks.


* Set up your editor / IDE to follow our code style guidelines.
* Run ``pudl_setup`` to create a local data management environment.

-------------------------------------------------------------------------------
Install and Validate the Data
-------------------------------------------------------------------------------
Expand Down Expand Up @@ -279,8 +276,7 @@ already downloaded datastore, you can point the tests at it with
$ tox -v -e travis -- --fast --pudl_in=AUTO
Additional details can be found in our
:ref:`documentation on testing <testing>`.
Additional details can be found in :ref:`testing`.

-------------------------------------------------------------------------------
Making a Pull Request
Expand Down
23 changes: 14 additions & 9 deletions docs/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,8 @@ with PUDL interactively:
You may also want to update your global ``conda`` settings:

.. code-block:: console
$ conda config --add channels conda-forge
$ conda config --set channel_priority strict
Expand All @@ -87,7 +89,7 @@ PUDL is also available via the official
:doc:`dev_setup` documentation.

In addition to making the :mod:`pudl` package available for import in Python,
installing ``catalystcoop.pudl`` installs the following command line tools:
installing ``catalystcoop.pudl`` provides the following command line tools:

* ``pudl_setup``
* ``pudl_data``
Expand All @@ -96,18 +98,20 @@ installing ``catalystcoop.pudl`` installs the following command line tools:
* ``datapkg_to_sqlite``
* ``epacems_to_parquet``

For information on how to use them, run them with the ``--help`` option. Most
of them are configured using settings files. Examples are provided with the
``catalystcoop.pudl`` package, and deployed by running ``pudl_setup`` as
described below.
For information on how to use these scripts, each can be run with the
``--help`` option. ``ferc1_to_sqlite`` and ``pudl_etl`` are configured with
YAML files. Examples are provided with the ``catalystcoop.pudl`` package, and
deployed by running ``pudl_setup`` as described below. Additional inormation
about the settings files can be found in our documentation on
:ref:`settings_files`

.. _install-workspace:

-------------------------------------------------------------------------------
Creating a Workspace
-------------------------------------------------------------------------------

PUDL needs to know where to store its big pile of input and output data. It
PUDL needs to know where to store its big piles of inputs and outputs. It
also provides some example configuration files and
`Jupyter <https://jupyter.org>`__ notebooks. The ``pudl_setup`` script lets
PUDL know where all this stuff should go. We call this a "PUDL workspace":
Expand All @@ -120,7 +124,8 @@ Here <PUDL_DIR> is the path to the directory where you want PUDL to do its
business -- this is where the datastore will be located, and any outputs that
are generated will end up. The script will also put a configuration file in
your home directory, called ``.pudl.yml`` that records the location of this
workspace and uses it by default in the future.
workspace and uses it by default in the future. If you run ``pudl_setup`` with
no arguments, it assumes you want to use the current directory.

The workspace is laid out like this:

Expand Down Expand Up @@ -164,9 +169,9 @@ run:

.. code-block:: console
$ conda env create --name=pudl --file=environment.yml
$ conda env create --name pudl --file environment.yml
You should probably periodically update the packages installed as part of PUDL,
You may want to periodically update PUDL and the packages it depends on
by running the following commands in the directory with ``environment.yml``
in it:

Expand Down
Loading

0 comments on commit fd2f867

Please sign in to comment.