Final read-through for documentation edits...

catalyst-cooperative · Sep 17, 2019 · fd2f867 · fd2f867
1 parent 50c98a8
commit fd2f867
Show file tree

Hide file tree

Showing 9 changed files with 112 additions and 124 deletions.
diff --git a/README.rst b/README.rst
@@ -37,33 +37,33 @@ The Public Utility Data Liberation Project (PUDL)
    :alt: Zenodo DOI
 
 `PUDL <https://catalyst.coop/pudl/>`__ makes US energy data easier to access
-and work with. Hundreds of gigabytes of supposedly public information published
-by government agencies, but in a bunch of different formats that can be hard to
+and work with. Hundreds of gigabytes of public information is published
+by government agencies, but in many different formats that make it hard to
 work with and combine. PUDL takes these spreadsheets, CSV files, and databases
-and turns them into easy to parse, well-documented `tabular data packages <https://https://frictionlessdata.io/docs/tabular-data-package/>`__
-that can be used to create a database, used directly with Python, R, Microsoft
-Access, and lots of other tools.
+and turns them into easy use
+`tabular data packages <https://https://frictionlessdata.io/docs/tabular-data-package/>`__
+that can populate a database, or be used directly with Python, R, Microsoft
+Access, and many other tools.
 
-The project currently contains data from:
+The project currently integrates data from:
 
 * `EIA Form 860 <https://www.eia.gov/electricity/data/eia860/>`__
 * `EIA Form 923 <https://www.eia.gov/electricity/data/eia923/>`__
 * `The EPA Continuous Emissions Monitoring System (CEMS) <https://ampd.epa.gov/ampd/>`__
 * `The EPA Integrated Planning Model (IPM) <https://www.epa.gov/airmarkets/national-electric-energy-data-system-needs-v6>`__
 * `FERC Form 1 <https://www.ferc.gov/docs-filing/forms/form-1/data.asp>`__
 
-We are especially interested in serving researchers, activists, journalists,
+The project is especially meant to serve researchers, activists, journalists,
 and policy makers that might not otherwise be able to afford access to this
-data from commercial data providers.
+data from existing commercial data providers.
 
 Getting Started
 ---------------
 
 Just want to play with some example data? Install
 `Anaconda <https://www.anaconda.com/distribution/>`__
-(or `miniconda <https://docs.conda.io/en/latest/miniconda.html>`__
-if you like the command line) with at least Python 3.7. Then work through the
-following terminal commands:
+(or `miniconda <https://docs.conda.io/en/latest/miniconda.html>`__) with at
+least Python 3.7. Then work through the following commands.
 
 First, we create and activate conda environment named ``pudl``. All the
 required packages are available from the community maintained ``conda-forge``
@@ -78,11 +78,10 @@ interactively.
     $ conda create -y -n pudl -c conda-forge --strict-channel-priority python=3.7 catalystcoop.pudl jupyter jupyterlab pip
     $ conda activate pudl
 
-Now we create a data management workspace -- a well defined directory structure
-that PUDL will use to organize the data it downloads, processes, and outputs --
-and download the most recent year's worth of data for each of the available
-datasets. You can run ``pudl_setup --help`` and ``pudl_data --help`` for more
-information.
+Now we create a data management workspace called ``pudl-work`` and download
+some data. The workspace is a well defined directory structure that PUDL uses
+to organize the data it downloads, processes, and outputs. You can run
+``pudl_setup --help`` and ``pudl_data --help`` for more information.
 
 .. code-block:: console
 
@@ -91,12 +90,12 @@ information.
     $ pudl_data --sources eia923 eia860 ferc1 epacems epaipm --years 2017 --states id
 
 Now that we have the original data as published by the federal agencies, we can
-run the data processing (ETL = Extract, Transform, Load) pipeline, that turns
-the raw data into an well organized, standardized bundle of data packages.
-This involves a couple of steps: cloning the FERC Form 1 into an SQLite
-database, extracting data from that database and all the other sources and
-cleaning it up, outputting that data into well organized CSV/JSON based data
-packages, and finally loading those data packages into a local database.
+run the ETL (Extract, Transform, Load) pipeline, that turns the raw data into
+an well organized, standardized bundle of data packages. This involves a couple
+of steps: cloning the FERC Form 1 into an SQLite database, extracting data from
+that database and all the other sources and cleaning it up, outputting that
+data into well organized CSV/JSON based data packages, and finally loading
+those data packages into a local database.
 
 PUDL provides a script to clone the FERC Form 1 database, controlled by a YAML
 file which you can find in the settings folder. Run it like this:
@@ -119,7 +118,7 @@ using. Run the ETL pipeline with this command:
 
     $ pudl_etl pudl-work/settings/etl_example.yml
 
-The generated data packages are made up of CSV and JSON files, that are both
+The generated data packages are made up of CSV and JSON files. They're both
 easy to parse programmatically, and readable by humans. They are also well
 suited to archiving, citation, and bulk distribution. However, to make the
 data easier to query and work with interactively, we typically load it into a
@@ -139,11 +138,6 @@ Jupyter notebook server, and open a notebook of PUDL usage examples:
 
     $ jupyter lab pudl-work/notebook/pudl_intro.ipynb
 
-**NOTE:** The example above requires a computer with at least **4 GB of RAM**
-and **several GB of free disk space**. You will also need to download
-**100s of MB of data**. This could take a while if you have a slow internet
-connection.
-
 For more details, see `the full PUDL documentation
 <https://catalystcoop-pudl.readthedocs.io/>`__ on Read The Docs.
 
@@ -167,9 +161,11 @@ contribute!
 Licensing
 ---------
 
-The PUDL software is released under the `MIT License <https://opensource.org/licenses/MIT>`__.
+The PUDL software is released under the
+`MIT License <https://opensource.org/licenses/MIT>`__.
 `The PUDL documentation <https://catalystcoop-pudl.readthedocs.io>`__
-and the data packages we distribute are released under the `Creative Commons Attribution 4.0 License <https://creativecommons.org/licenses/by/4.0/>`__.
+and the data packages we distribute are released under the
+`CC-BY-4.0 <https://creativecommons.org/licenses/by/4.0/>`__ license.
 
 Contact Us
 ----------

diff --git a/docs/clone_ferc1.rst b/docs/clone_ferc1.rst
@@ -33,8 +33,8 @@ ETL process. This can be done with the ``ferc1_to_sqlite`` script (which is an
 entrypoint into the :mod:`pudl.convert.ferc1_to_sqlite` module) which is
 installed as part of the PUDL Python package. It takes its instructions from a
 YAML file, an example of which is included in the ``settings`` directory in
-your PUDL workspace. Once you've :ref:`created a <datastore>` you can try this
-example:
+your PUDL workspace. Once you've :ref:`created a datastore <datastore>` you can
+try this example:
 
 .. code-block:: console
 
@@ -49,14 +49,16 @@ factor of ~10 (to ~8 GB rather than 800 MB). If for some reason you need access
 to those tables, you can create your own settings file and un-comment those
 tables in the list of tables that it directs the script to load.
 
-Note that this script pulls *all* the FERC Form 1 data into a *single*
-database, while FERC distributes a *separate* database for each year. Virtually
-all the database tables contain a ``report_year`` column that indicates which
-year they came from, preventing collisions between records in the merged
-multi-year database that we create. One notable exception is the
-``f1_respondent_id`` table, which maps ``respondent_id`` to the names of the
-respondents. For that table, we have allowed the most recently reported record
-to take precedence, overwriting previous mappings if they exist.
+.. note::
+
+    This script pulls *all* of the FERC Form 1 data into a *single* database,
+    but FERC distributes a *separate* database for each year. Virtually all
+    the database tables contain a ``report_year`` column that indicates which
+    year they came from, preventing collisions between records in the merged
+    multi-year database. One notable exception is the ``f1_respondent_id``
+    table, which maps ``respondent_id`` to the names of the respondents. For
+    that table, we have allowed the most recently reported record to take
+    precedence, overwriting previous mappings if they exist.
 
 Sadly, the FERC Form 1 database is not particularly... relational. The only
 foreign key relationships that exist map ``respondent_id`` fields in the

diff --git a/docs/datapackages.rst b/docs/datapackages.rst
@@ -76,6 +76,16 @@ the data packages to populate a local SQLite database.
 
 `Open an issue on Github <https://github.com/catalyst-cooperative/pudl/issues>`__ and let us know if you have another example we can add.
 
+SQLite
+^^^^^^
+
+If you want to access the data via SQL, we have provided a script that loads
+a bundle of data packages into a local :mod:`sqlite3` database, e.g.:
+
+.. code-block::
+
+    $ datapkg_to_sqlite --pkg_bundle_name pudl-example
+
 Python, Pandas, and Jupyter
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -111,16 +121,6 @@ The R programming language
     Get someone who uses R to give us an example here... maybe we can get
     someone from OKFN to do it?
 
-SQLite
-^^^^^^
-
-If you'd rather access the data via SQL, we have provided a script that loads
-a bundle of the datapackages into a local :mod:`sqlite3` database, e.g.:
-
-.. code-block::
-
-    $ datapkg_to_sqlite --pkg_bundle_name pudl-example
-
 Microsoft Access / Excel
 ^^^^^^^^^^^^^^^^^^^^^^^^^
 

diff --git a/docs/datastore.rst b/docs/datastore.rst
@@ -45,9 +45,7 @@ For more detailed usage information, see:
 
 The downloaded data will be used by the script to populate a datastore under
 the ``data`` directory in your workspace, organized by data source, form, and
-date:
-
-.. code-block::
+date::
 
     data/eia/form860/
     data/eia/form923/

diff --git a/docs/dev_setup.rst b/docs/dev_setup.rst
@@ -226,9 +226,6 @@ be hard for humans to catch but are easy for a computer.
     managing and maintaining multi-language pre-commit hooks.
 
 
-* Set up your editor / IDE to follow our code style guidelines.
-* Run ``pudl_setup`` to create a local data management environment.
-
 -------------------------------------------------------------------------------
 Install and Validate the Data
 -------------------------------------------------------------------------------
@@ -279,8 +276,7 @@ already downloaded datastore, you can point the tests at it with
 
     $ tox -v -e travis -- --fast --pudl_in=AUTO
 
-Additional details can be found in our
-:ref:`documentation on testing <testing>`.
+Additional details can be found in :ref:`testing`.
 
 -------------------------------------------------------------------------------
 Making a Pull Request

diff --git a/docs/install.rst b/docs/install.rst
@@ -68,6 +68,8 @@ with PUDL interactively:
 
 You may also want to update your global ``conda`` settings:
 
+.. code-block:: console
+
     $ conda config --add channels conda-forge
     $ conda config --set channel_priority strict
 
@@ -87,7 +89,7 @@ PUDL is also available via the official
     :doc:`dev_setup` documentation.
 
 In addition to making the :mod:`pudl` package available for import in Python,
-installing ``catalystcoop.pudl`` installs the following command line tools:
+installing ``catalystcoop.pudl`` provides the following command line tools:
 
 * ``pudl_setup``
 * ``pudl_data``
@@ -96,18 +98,20 @@ installing ``catalystcoop.pudl`` installs the following command line tools:
 * ``datapkg_to_sqlite``
 * ``epacems_to_parquet``
 
-For information on how to use them, run them with the ``--help`` option. Most
-of them are configured using settings files. Examples are provided with the
-``catalystcoop.pudl`` package, and deployed by running ``pudl_setup`` as
-described below.
+For information on how to use these scripts, each can be run with the
+``--help`` option. ``ferc1_to_sqlite`` and ``pudl_etl`` are configured with
+YAML files. Examples are provided with the ``catalystcoop.pudl`` package, and
+deployed by running ``pudl_setup`` as described below. Additional inormation
+about the settings files can be found in our documentation on
+:ref:`settings_files`
 
 .. _install-workspace:
 
 -------------------------------------------------------------------------------
 Creating a Workspace
 -------------------------------------------------------------------------------
 
-PUDL needs to know where to store its big pile of input and output data. It
+PUDL needs to know where to store its big piles of inputs and outputs. It
 also provides some example configuration files and
 `Jupyter <https://jupyter.org>`__ notebooks. The ``pudl_setup`` script lets
 PUDL know where all this stuff should go. We call this a "PUDL workspace":
@@ -120,7 +124,8 @@ Here <PUDL_DIR> is the path to the directory where you want PUDL to do its
 business -- this is where the datastore will be located, and any outputs that
 are generated will end up. The script will also put a configuration file in
 your home directory, called ``.pudl.yml`` that records the location of this
-workspace and uses it by default in the future.
+workspace and uses it by default in the future. If you run ``pudl_setup`` with
+no arguments, it assumes you want to use the current directory.
 
 The workspace is laid out like this:
 
@@ -164,9 +169,9 @@ run:
 
 .. code-block:: console
 
-   $ conda env create --name=pudl --file=environment.yml
+   $ conda env create --name pudl --file environment.yml
 
-You should probably periodically update the packages installed as part of PUDL,
+You may want to periodically update PUDL and the packages it depends on
 by running the following commands in the directory with ``environment.yml``
 in it: