From bdc86facfac28c82f2f4e2c9476b03e68d322a79 Mon Sep 17 00:00:00 2001
From: Gary Pavlis <pavlis@indiana.edu>
Date: Fri, 15 Dec 2023 07:40:57 -0500
Subject: [PATCH] updates for database v2 late changes and assorted
 wordsmithing

---
 docs/source/user_manual/CRUD_operations.rst   | 27 ++++++++--
 .../user_manual/development_strategies.rst    | 51 +++++++++++++------
 docs/source/user_manual/normalization.rst     | 41 ++++++++-------
 .../user_manual/parallel_processing.rst       |  4 +-
 4 files changed, 81 insertions(+), 42 deletions(-)

diff --git a/docs/source/user_manual/CRUD_operations.rst b/docs/source/user_manual/CRUD_operations.rst
index ab1727493..115437c5e 100644
--- a/docs/source/user_manual/CRUD_operations.rst
+++ b/docs/source/user_manual/CRUD_operations.rst
@@ -202,11 +202,11 @@ and most up to date usage:
     "task".  That means, writes are performed in parallel by ensemble.
 
 Note when saving seismic data, the `save_data` method, by default,
-returns only the `ObjectId` of the document saved.  Similarly, by
-default `write_distributed_data` returns a bag/RDD of `ObjectID`s.
-Both have an option to return a copy of the data saved to allow their
-use for an intermediate save during a workflow, but be warned that is not
-the default.  The default was found to be important to avoid
+returns only the `ObjectId` of the document saved. Returning
+a copy of the data is an option.
+`write_distributed_data` is more dogmatic and always only returns
+a python list of `ObjectID`s.
+The default was found to be important to avoid
 memory faults that can happen when a workflow computation is initiated in
 the standard way (e.g. in dask calling the bag "compute" method.).
 If the last step in the workflow is a save and the bag/RDD contains the
@@ -245,6 +245,23 @@ should recognize:
     in the section on reading data.
 #.  The writers all have a `save_history` to save the object-level history.
     That data is stored in a separate collection called `history`.
+#.  Writers have a `mode` argument that must be one of "promiscuous",
+    "cautious", or "pedantic".   Readers also use this argument, but
+    writing this controls how much a schema is enforced on the output.
+    The default ("promiscuous") does no schema enforcement at all.
+    All modes, however, do dogmatically enforce one rule.  Any attribute
+    key interpreted as loaded by normalization is erased before the save.
+    In MsPASS normalization data are normally loaded by prepending a
+    the name of the collection to the attribute.  e.g. the latitude of
+    a station ("lat" in the MsPASS schema) stored in the channel collection
+    would be loaded with the key "channel_lat".   Attributes with
+    one of the standard collection names ("site", "channel", and "source")
+    will always be erased before the wf document is saved.  When node
+    is set to "cautious" the writer will attempt to correct any time mismatches
+    and log an error if any issues are detected.  In "pedantic" mode any
+    type mismatches will cause the datum to be killed before the save.
+    "pedantic" is rarely advised for writing unless one is writing to a
+    files with a format that is dogmatic about attribute types.
 
 
 Read
diff --git a/docs/source/user_manual/development_strategies.rst b/docs/source/user_manual/development_strategies.rst
index adfeff023..a01ce0208 100644
--- a/docs/source/user_manual/development_strategies.rst
+++ b/docs/source/user_manual/development_strategies.rst
@@ -23,9 +23,7 @@ discussed in common tutorials found by searching with the keywords
 #. Inserting matplotlib graphics to visualize data at an intermediate stage of
    processing.
 #. If you have a box that throws a mysterious exception that is not self-explanatory
-   the `%debug` magic command can be useful.   After the box with the
-   exception insert a code box, add the magic command, and execute push
-   the run button for the new cell.
+   the `%debug` magic command can be useful.
    To use this feature add a new code box after the cell with problems, put
    that command in the box, and push the jupyter run button.  You will get
    an ipython prompt you can use to investigate variables defined where the
@@ -105,20 +103,41 @@ following:
     resources easily found with a web search.
 #.  To test a python function with the mspass container, copy your python
     code to a directory you mount with the appropriate docker or singularity run
-    incantation.  When the container is running launch a terminal window with
-    the jupyter interface and copy the python code to an appropriate
-    directory in the mspass installation file tree in the container file
-    system (currently `/opt/conda/lib/python3.10/site-packages/mspasspy`
-    but not the `python3.10` could change as new python versions appear.)
-    Then you can import your module as within the container using
-    the standard import syntax with the top level, in this case,
-    being defined as `mspasspy`.  To be more concrete if you have a new
-    python module file called `myfancystuff.py` you copy into the `algorithms`
-    directory you could expose a test script in a notebook to the new module with
-    something like this:  `import mspasspy.algorithms.myfancystuff as mystuff`.
-#.  Once you are finished testing you can do one of two things. (a) submit
+    incantation.  The simplest way to do that is to just put your python
+    script in the same directory as your notebook.   In that case, the
+    notebook code need only include a simple `import`.   e.g. if you have
+    your code saved in a file `mymodule.py` and you want to use a function
+    in that module called `myfunction`, in your notebook you would just
+    enter this simple, failry standard command:
+
+    .. code-block:: python
+
+      from mymodule import myfunction
+
+    If `mymodule` is located in a different directory use the
+    docker "--mount" option or apptainer/singularity "-B" options to
+    "bind" that directory to the container.   For example, suppose we have
+    module `mymodule.py` stored in a directory called `/home/myname/python`.
+    With docker this could be mounted on the standard container
+    with the following incantation:
+    .. code-block:: bash
+
+      docker run --mount src=/home/myname/python,target=/mnt,type=bind -p 8888:8888 mspass/mspass
+
+    To make that module accessible with the same import command as above you
+    would need to change the python search path.  For this example, you could
+    use this incanation:
+
+    .. code-block:: python
+
+      import sys
+      sys.path.append('/mnt')
+
+#.  Once you are finished testing you can do one of two things to make
+    it a more durable feature. (a) Assimilante
+    your module into mspass and submit
     you code as a pull request to the github site for mspass.   If accepted it
-    becomes part of mspass.  (b) build a custom docker container that
+    becomes part of mspass.  (b) Build a custom docker container that
     adds your software as an extension of the mspass container.  The docker
     documentation and the examples in the top level directory for the MsPASS
     source code tree should get you started.  It is beyond the scope of this
diff --git a/docs/source/user_manual/normalization.rst b/docs/source/user_manual/normalization.rst
index e65f9d70d..4e7094d26 100644
--- a/docs/source/user_manual/normalization.rst
+++ b/docs/source/user_manual/normalization.rst
@@ -29,7 +29,7 @@ For small datasets these issues can be minor, but for very large
 data sets we have found poorly designed normalization algorithms
 can be a serious bottleneck to performance.
 A key difference all users need to appreciate
-is that with a relational database a "join" is always a global operation done between all
+is that with a relational database, a "join" is always a global operation done between all
 tuples in two relations (tables or table subsets).  In MongoDB
 normalization is an atomic operation made one document (recall a document
 is analogous to a tuple) at a time.  Because all database operations are
@@ -49,7 +49,7 @@ tables are easily loaded into a Dataframe with one line of python code
 (:code:`read_csv`).  That abstraction is possible because a MongoDB "collection"
 is just an alternative way to represent a table (relation).
 
-Before proceeding it is important to give a pair of definitions we used repeatedly
+Before proceeding it is important to give a pair of definitions we use repeatedly
 in the text below.   We define the :code:`normalizing` collection/table as the
 smaller collection/table holding the repetitious data we aim to cross-reference.
 In addition, when we use the term :code:`target of normalization`
@@ -93,7 +93,7 @@ be accomplished one of two ways:
 Both approaches utilize the concept of a :code:`normalization operator`
 we discuss in detail in this section.  Readers familiar with relational
 database concept may find it helpful to view a :code:`normalization operator`
-as equivalent the operation used to define a database join.
+as equivalent to the operation used to define a database join.
 
 This section focuses on the first approach.   The second is covered in
 a later section below. The most common operators for normalization while
@@ -146,9 +146,9 @@ makes it inevitably slower than the comparable Id-based algorithm
 `py:class:<mspasspy.db.normalize.ObjectIdMatcher>`.
 We suggest that unless you are absolutely certain of the
 completeness of the :code:`channel` collection, you should use the
-Id-based method discussed here for doing normalization while readng.
+Id-based method discussed here for doing normalization while reading.
 
-Because miniseed normalization is so fundamental to modern data
+Because miniseed normalization is so fundamental to modern seismology data,
 we created a special python function called
 :py:func:`normalize_mseed <mspasspy.db.normalize.normalize_mseed>`.
 It is used for defining :code:`channel_id`
@@ -233,13 +233,16 @@ The following does the same operation as above in parallel with dask
   channel_matcher = MiniseedMatcher(db)
   # loop over all wf_miniseed records
   cursor = db.wf_miniseed.find({})
-  dataset = read_distributed_data(db,normalize=[channel_matcher])
+  dataset = read_distributed_data(cursor,
+                   normalize=[channel_matcher],
+                   collection='wf_miniseed',
+                )
   # porocessing steps as map operators follow
   # normally terminate with a save
   dataset.compute()
 
 Reading ensembles with normalization is similar.   The following is a
-serial job that reads ensembles and normalizes each ensemble with data from
+serial job that reads ensembles and normalizes the ensemble with data from
 the source and channel collections.  It assumes source_id was defined
 previously.
 
@@ -258,7 +261,7 @@ previously.
   sourceid_list = db.wf_miniseed.distinct("source_id")
   for srcid in sourceid_list:
     cursor = db.wf_miniseed.find({"source_id" : srcid})
-    ensemble = db.read_ensemble_data(cursor,
+    ensemble = db.read_data(cursor,
        normalize=[channel_matcher],
        normalize_ensemble=[source_matcher])
     # processing functions for ensembles to follow here
@@ -317,7 +320,7 @@ Next, the parallel version of the job immediately above:
   channel_matcher = MiniseedMatcher(db)
   # loop over all wf_miniseed records
   cursor = db.wf_miniseed.find({})
-  dataset = read_distributed_data(db,collection="wf_miniseed")
+  dataset = read_distributed_data(cursor,collection="wf_miniseed")
   dataset = dataset.map(normalize,channel_matcher)
   # processing steps as map operators follow
   # normally terminate with a save
@@ -523,7 +526,7 @@ different keys to access attributes stored in the database and
 the equivalent keys used to access the same data in a workflow.
 In addition, there is a type mismatch between a document/tuple/row
 abstraction in a MongoDB document and the internal use by the matcher
-class family.  That is, pymongo treats represents a "document" as a
+class family.  That is, pymongo represents a "document" as a
 python dictionary while the matchers require posting the same data to
 the MsPASS Metadata container to work more efficiently with the C++
 code base that defines data objects.
@@ -555,7 +558,7 @@ is a hyperlink to the docstring for the class:
    * - :py:class:`OriginTimeMatcher <mspasspy.db.normalize.OriginTimeMatcher>`
      - match data with start time defined by event origin time
 
-Noting currently all of these have database query versions that differ only
+Noting that currently all of these have database query versions that differ only
 by have "DB" embedded in the class name
 (e.g. the MongoDB version of :code:`EqualityMatcher` is :code:`EqualityDBMatcher`.)
 
@@ -581,8 +584,8 @@ idea is most clearly seen by a simple example.
   attribute_list = ['_id','lat','lon','elev']
   matcher = ObjectIdMatcher(db,collection="site",attributes_to_load=attribute_list)
   # This says load the entire dataset presumed staged to MongoDB
-  cursor = db.wf_miniseed.find({})   #handle to entire data set
-  dataset = read_distributed_data(cursor)  # dataset returned is a bag
+  cursor = db.wf_TimeSeries.find({})   #handle to entire data set
+  dataset = read_distributed_data(cursor,collection='wf_TimeSeries')  # dataset returned is a bag
   dataset = dataset.map(normalize,matcher)
   # additional workflow elements and usually ending with a save would be here
   dataset.compute()
@@ -591,10 +594,10 @@ This example loads receiver coordinate information from data that was assumed
 previously loaded into MongoDB in the "site" collection.  It assumes
 matching can be done using the site collection ObjectId loaded with the
 waveform data at read time with the key "site_id".   i.e. this is an
-inline version of what could also be accomplished (more slowly) by
-calling :code:`read_distribute_data` with "site" in the normalize list.
+inline version of what could also be accomplished by
+calling :code:`read_distribute_data` with a matcher for site in the normalize list.
 
-Key things this example demonstrates in common to all in-line
+Key things this example demonstrates common to all in-line
 normalization workflows are:
 
 +  :code:`normalize` appears only as arg0 of a map operation (dask syntax -
@@ -783,7 +786,7 @@ We know of three solutions to that problem:
     :py:class:`DictionaryCacheMatcher <mspasspy.db.normalize.DictionaryCacheMatcher>`,
     and :py:class:`DataFrameCacheMatcher <mspasspy.db.normalize.DataFrameCacheMatcher>`).
     One could also build directly on the base class, but we can think of no
-    example where would be preferable to extending one of the intermediate
+    example where that would be preferable to extending one of the intermediate
     classes.  The remainder of this section focuses only on some hints for
     extending one of the intermediate classes.
 
@@ -862,7 +865,7 @@ intermediate classes you should use to build your custom matcher are:
 -  The :py:class:`DatabaseMatcher <mspasspy.db.normalize.DatabaseMatcher>`
    requires implementing only one method called
    :py:meth:`query_generator <mspasspy.db.normalize.DatabaseMatcher.query_generator>`.
-   Tha method needs to create a python dictionary in pymongo syntax that is to
+   That method needs to create a python dictionary in pymongo syntax that is to
    be applied to the normalizing collection.  That query would normally be
    constructed from one or more Metadata attributes in a data object but
    time queries may also want to use the data start time and endtime available
@@ -878,7 +881,7 @@ intermediate classes you should use to build your custom matcher are:
    The other method,
    :py:meth:`db_make_cache_id <mspasspy.db.normalize.DictionaryCacheMatcher.db_make_cache_id>`,
    needs to do the same thing and create identical keys.
-   The difference being that
+   The difference between the two is that
    :py:meth:`db_make_cache_id <mspasspy.db.normalize.DictionaryCacheMatcher.db_make_cache_id>`
    is used as the data loader to create the dictionary-based cache while
    :py:meth:`cache_id <mspasspy.db.normalize.DictionaryCacheMatcher.cache_id>`
diff --git a/docs/source/user_manual/parallel_processing.rst b/docs/source/user_manual/parallel_processing.rst
index b4859d9e8..9cf67485c 100644
--- a/docs/source/user_manual/parallel_processing.rst
+++ b/docs/source/user_manual/parallel_processing.rst
@@ -8,7 +8,7 @@ One of the primary goals of MsPASS was a framework to make
 parallel processing of seismic data no more difficult than running
 a typical python script on a desktop machine.   In modern IT lingo
 our goals was a "scalable" framework.  The form of parallelism we
-exploit is a one of a large class of problems that can reduced to
+exploit is a one of a large class of problems that can br reduced to
 what is called a directed cyclic graph (DAG) in computer science.
 Any book on "big data" will discuss this concept.
 Chapter 1 of Daniel (2019) has a particularly useful description using
@@ -58,7 +58,7 @@ computer's memory.   Spark refers to this abstraction as a
 Resiliant Distributed Dataset (RDD) while Dask calls the same thing a "bag".
 
 In MsPASS the normal content of a bag/RDD is a dataset made up of *N*
-MsPASS data objects:  TimeSeries, Seismogram, or one of the ensemble of
+MsPASS data objects:  `TimeSeries`, `Seismogram`, or one of the ensemble of
 either of the atomic types.   An implicit assumption in the current
 implementation is that any processing
 was proceeded by a data assembly and validation phase.