Skip to content

Commit

Permalink
more docs
Browse files Browse the repository at this point in the history
  • Loading branch information
joezuntz committed Jul 7, 2020
1 parent 43d128d commit d4da69e
Show file tree
Hide file tree
Showing 2 changed files with 93 additions and 19 deletions.
65 changes: 46 additions & 19 deletions docs/config.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
Pipeline YAML Files
===================

The list of which steps to run in a pipeline, the overall inputs for it, execution information, and directories for output, are all defined in a configuration file in the YAML format.
Two YAML-format configuration files are needed to run a pipeline.

The first describes which steps to run in a pipeline, the overall inputs for it, execution information, and directories for outputs. It is described on this page. It includes the path to the second file, (see `Config`_ below); that file is described in more depth on the page :ref:`config2`.

Here is an example, from ``test/test.yml``. The different pieces are described below.

Expand Down Expand Up @@ -61,26 +63,18 @@ Here is an example, from ``test/test.yml``. The different pieces are described
# Put the logs from the individual stages in this directory:
log_dir: ./test/logs
# These will be run before and after the pipeline respectively
pre_script: ""
post_script: ""
Launcher
--------

The ``launcher`` parameter should be a dictionary that configures the workflow manager used to launch the jobs.

The ``name`` item in the dictionary sets which launcher is used. These options are currently allowed: ``mini``, ``parsl``, and ``cwl``.

See the :ref:`launchers` page for information on these launchers, and the other options they take.


Site
----

The ``site`` parameter should be a dictionary that configures the machine on which you are running the pipeline.
Modules
-------

The ``name`` item in the dictionary sets which site is used. These options are currently allowed: ``local``, ``cori-batch``, and ``cori-interactive``.
The ``modules`` option, which is a string, consists of the names of python modules to import and search for pipeline stages (with spaces between each).

See the :ref:`sites` page for information on these sites, and the other options they take.
Each module is imported at the start of the pipeline. For a stage to be found, it should be imported somewhere in the chain of imports under ``__init__.py`` in one of the packages listed here. You can specify subpackages, like ``module.submodule`` in this list after ``module`` if you need to.

The ``python_paths`` option can be set to a single string or list of strings, and gives paths to add to python's ``sys.path`` before attempting the import above.

Stages
------
Expand All @@ -105,6 +99,26 @@ Each dictionary represents one stage, and has these options, with the defaults a
``nprocess`` is the total number of processes, (across all nodes, not per-node). Process-level parallelism is currently implemented only using MPI, but if you need other approaches please open an issue.


Launcher
--------

The ``launcher`` parameter should be a dictionary that configures the workflow manager used to launch the jobs.

The ``name`` item in the dictionary sets which launcher is used. These options are currently allowed: ``mini``, ``parsl``, and ``cwl``.

See the :ref:`launchers` page for information on these launchers, and the other options they take.


Site
----

The ``site`` parameter should be a dictionary that configures the machine on which you are running the pipeline.

The ``name`` item in the dictionary sets which site is used. These options are currently allowed: ``local``, ``cori-batch``, and ``cori-interactive``.

See the :ref:`sites` page for information on these sites, and the other options they take.


Inputs
------

Expand All @@ -117,7 +131,7 @@ Config

The parameter ``config`` is required, and should be set to a path to another input YAML config file.

That file should contain
See the :ref:`config2` page for what that file should contain.

Resume
------
Expand All @@ -133,4 +147,17 @@ Directories

The parameter ``output_dir`` is required, and should be set to a directory where all the outputs from the pipeline will be saved. If the directory does not exist it will be created.

If the resume parameter is set to True, then this is the directory that will be checked for existing outputs.
If the resume parameter is set to True, then this is the directory that will be checked for existing outputs.

The parameter ``log_dir`` is required, and should be set to a directory where the printed output of the stages will be saved, in one file per stage.

Scripts
-------

Two parameters can be set to run additional scripts before or after a pipeline stage. You can use them to perform checks or process results.

Any executable specified by ``pre_script`` will be run before the pipeline. If it returns a non-zero status then the pipeline will not be run and an exception will be raised.

Any executable specified by ``post_script`` will be run after the pipeline, but only if the pipeline completes successfully. If the post_script returns a non-zero status then it will be returned as the ceci exit code, but no exception will be raised.

Both scripts are called with the same arguments as the original executable was called with.
47 changes: 47 additions & 0 deletions docs/config2.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
.. _config2:

Config YAML files
=================

The second configuration file needed for a pipeline configures individual stages it is running.

Each pipeline stage specified any configuration options it can take as part of its class definition, in the ``config_options`` dictionary. This can either specify a default value for the config option, or if there is no sensible default, the type of the option expected (str, int, etc.).


Search sequence
---------------

The following places will be searched for config values:

- The command line (if you are running the stage stand-alone, not as part of a pipeline)
- The stage's section in this file
- The ``global`` section in this file
- Any default value specified in the ``config_options``

If no value is found and there is no default, and error is raised.


Here's an example file:

.. code-block:: yaml
global:
chunk_rows: 100000
pixelization: healpix
nside: 512
sparse: True
TXGCRTwoCatalogInput:
metacal_dir: /global/cscratch1/sd/desc/DC2/data/Run2.2i/dpdd/Run2.2i-t3828/metacal_table_summary
photo_dir: /global/cscratch1/sd/desc/DC2/data/Run2.2i/dpdd/Run2.2i-t3828/object_table_summary
TXIngestRedmagic:
lens_zbin_edges: [0.1, 0.3, 0.5]
PZPDFMLZ:
nz: 301
zmax: 3.0
...
The keys here are the names of pipeline stages. The ``global`` section can be searched by any stage.

0 comments on commit d4da69e

Please sign in to comment.