Skip to content

Commit

Permalink
updates for database v2 late changes and assorted wordsmithing
Browse files Browse the repository at this point in the history
  • Loading branch information
pavlis committed Dec 15, 2023
1 parent aa3c477 commit bdc86fa
Show file tree
Hide file tree
Showing 4 changed files with 81 additions and 42 deletions.
27 changes: 22 additions & 5 deletions docs/source/user_manual/CRUD_operations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -202,11 +202,11 @@ and most up to date usage:
"task". That means, writes are performed in parallel by ensemble.

Note when saving seismic data, the `save_data` method, by default,
returns only the `ObjectId` of the document saved. Similarly, by
default `write_distributed_data` returns a bag/RDD of `ObjectID`s.
Both have an option to return a copy of the data saved to allow their
use for an intermediate save during a workflow, but be warned that is not
the default. The default was found to be important to avoid
returns only the `ObjectId` of the document saved. Returning
a copy of the data is an option.
`write_distributed_data` is more dogmatic and always only returns
a python list of `ObjectID`s.
The default was found to be important to avoid
memory faults that can happen when a workflow computation is initiated in
the standard way (e.g. in dask calling the bag "compute" method.).
If the last step in the workflow is a save and the bag/RDD contains the
Expand Down Expand Up @@ -245,6 +245,23 @@ should recognize:
in the section on reading data.
#. The writers all have a `save_history` to save the object-level history.
That data is stored in a separate collection called `history`.
#. Writers have a `mode` argument that must be one of "promiscuous",
"cautious", or "pedantic". Readers also use this argument, but
writing this controls how much a schema is enforced on the output.
The default ("promiscuous") does no schema enforcement at all.
All modes, however, do dogmatically enforce one rule. Any attribute
key interpreted as loaded by normalization is erased before the save.
In MsPASS normalization data are normally loaded by prepending a
the name of the collection to the attribute. e.g. the latitude of
a station ("lat" in the MsPASS schema) stored in the channel collection
would be loaded with the key "channel_lat". Attributes with
one of the standard collection names ("site", "channel", and "source")
will always be erased before the wf document is saved. When node
is set to "cautious" the writer will attempt to correct any time mismatches
and log an error if any issues are detected. In "pedantic" mode any
type mismatches will cause the datum to be killed before the save.
"pedantic" is rarely advised for writing unless one is writing to a
files with a format that is dogmatic about attribute types.


Read
Expand Down
51 changes: 35 additions & 16 deletions docs/source/user_manual/development_strategies.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,7 @@ discussed in common tutorials found by searching with the keywords
#. Inserting matplotlib graphics to visualize data at an intermediate stage of
processing.
#. If you have a box that throws a mysterious exception that is not self-explanatory
the `%debug` magic command can be useful. After the box with the
exception insert a code box, add the magic command, and execute push
the run button for the new cell.
the `%debug` magic command can be useful.
To use this feature add a new code box after the cell with problems, put
that command in the box, and push the jupyter run button. You will get
an ipython prompt you can use to investigate variables defined where the
Expand Down Expand Up @@ -105,20 +103,41 @@ following:
resources easily found with a web search.
#. To test a python function with the mspass container, copy your python
code to a directory you mount with the appropriate docker or singularity run
incantation. When the container is running launch a terminal window with
the jupyter interface and copy the python code to an appropriate
directory in the mspass installation file tree in the container file
system (currently `/opt/conda/lib/python3.10/site-packages/mspasspy`
but not the `python3.10` could change as new python versions appear.)
Then you can import your module as within the container using
the standard import syntax with the top level, in this case,
being defined as `mspasspy`. To be more concrete if you have a new
python module file called `myfancystuff.py` you copy into the `algorithms`
directory you could expose a test script in a notebook to the new module with
something like this: `import mspasspy.algorithms.myfancystuff as mystuff`.
#. Once you are finished testing you can do one of two things. (a) submit
incantation. The simplest way to do that is to just put your python
script in the same directory as your notebook. In that case, the
notebook code need only include a simple `import`. e.g. if you have
your code saved in a file `mymodule.py` and you want to use a function
in that module called `myfunction`, in your notebook you would just
enter this simple, failry standard command:

.. code-block:: python
from mymodule import myfunction
If `mymodule` is located in a different directory use the
docker "--mount" option or apptainer/singularity "-B" options to
"bind" that directory to the container. For example, suppose we have
module `mymodule.py` stored in a directory called `/home/myname/python`.
With docker this could be mounted on the standard container
with the following incantation:
.. code-block:: bash
docker run --mount src=/home/myname/python,target=/mnt,type=bind -p 8888:8888 mspass/mspass
To make that module accessible with the same import command as above you
would need to change the python search path. For this example, you could
use this incanation:

.. code-block:: python
import sys
sys.path.append('/mnt')
#. Once you are finished testing you can do one of two things to make
it a more durable feature. (a) Assimilante
your module into mspass and submit
you code as a pull request to the github site for mspass. If accepted it
becomes part of mspass. (b) build a custom docker container that
becomes part of mspass. (b) Build a custom docker container that
adds your software as an extension of the mspass container. The docker
documentation and the examples in the top level directory for the MsPASS
source code tree should get you started. It is beyond the scope of this
Expand Down
41 changes: 22 additions & 19 deletions docs/source/user_manual/normalization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ For small datasets these issues can be minor, but for very large
data sets we have found poorly designed normalization algorithms
can be a serious bottleneck to performance.
A key difference all users need to appreciate
is that with a relational database a "join" is always a global operation done between all
is that with a relational database, a "join" is always a global operation done between all
tuples in two relations (tables or table subsets). In MongoDB
normalization is an atomic operation made one document (recall a document
is analogous to a tuple) at a time. Because all database operations are
Expand All @@ -49,7 +49,7 @@ tables are easily loaded into a Dataframe with one line of python code
(:code:`read_csv`). That abstraction is possible because a MongoDB "collection"
is just an alternative way to represent a table (relation).

Before proceeding it is important to give a pair of definitions we used repeatedly
Before proceeding it is important to give a pair of definitions we use repeatedly
in the text below. We define the :code:`normalizing` collection/table as the
smaller collection/table holding the repetitious data we aim to cross-reference.
In addition, when we use the term :code:`target of normalization`
Expand Down Expand Up @@ -93,7 +93,7 @@ be accomplished one of two ways:
Both approaches utilize the concept of a :code:`normalization operator`
we discuss in detail in this section. Readers familiar with relational
database concept may find it helpful to view a :code:`normalization operator`
as equivalent the operation used to define a database join.
as equivalent to the operation used to define a database join.

This section focuses on the first approach. The second is covered in
a later section below. The most common operators for normalization while
Expand Down Expand Up @@ -146,9 +146,9 @@ makes it inevitably slower than the comparable Id-based algorithm
`py:class:<mspasspy.db.normalize.ObjectIdMatcher>`.
We suggest that unless you are absolutely certain of the
completeness of the :code:`channel` collection, you should use the
Id-based method discussed here for doing normalization while readng.
Id-based method discussed here for doing normalization while reading.

Because miniseed normalization is so fundamental to modern data
Because miniseed normalization is so fundamental to modern seismology data,
we created a special python function called
:py:func:`normalize_mseed <mspasspy.db.normalize.normalize_mseed>`.
It is used for defining :code:`channel_id`
Expand Down Expand Up @@ -233,13 +233,16 @@ The following does the same operation as above in parallel with dask
channel_matcher = MiniseedMatcher(db)
# loop over all wf_miniseed records
cursor = db.wf_miniseed.find({})
dataset = read_distributed_data(db,normalize=[channel_matcher])
dataset = read_distributed_data(cursor,
normalize=[channel_matcher],
collection='wf_miniseed',
)
# porocessing steps as map operators follow
# normally terminate with a save
dataset.compute()
Reading ensembles with normalization is similar. The following is a
serial job that reads ensembles and normalizes each ensemble with data from
serial job that reads ensembles and normalizes the ensemble with data from
the source and channel collections. It assumes source_id was defined
previously.

Expand All @@ -258,7 +261,7 @@ previously.
sourceid_list = db.wf_miniseed.distinct("source_id")
for srcid in sourceid_list:
cursor = db.wf_miniseed.find({"source_id" : srcid})
ensemble = db.read_ensemble_data(cursor,
ensemble = db.read_data(cursor,
normalize=[channel_matcher],
normalize_ensemble=[source_matcher])
# processing functions for ensembles to follow here
Expand Down Expand Up @@ -317,7 +320,7 @@ Next, the parallel version of the job immediately above:
channel_matcher = MiniseedMatcher(db)
# loop over all wf_miniseed records
cursor = db.wf_miniseed.find({})
dataset = read_distributed_data(db,collection="wf_miniseed")
dataset = read_distributed_data(cursor,collection="wf_miniseed")
dataset = dataset.map(normalize,channel_matcher)
# processing steps as map operators follow
# normally terminate with a save
Expand Down Expand Up @@ -523,7 +526,7 @@ different keys to access attributes stored in the database and
the equivalent keys used to access the same data in a workflow.
In addition, there is a type mismatch between a document/tuple/row
abstraction in a MongoDB document and the internal use by the matcher
class family. That is, pymongo treats represents a "document" as a
class family. That is, pymongo represents a "document" as a
python dictionary while the matchers require posting the same data to
the MsPASS Metadata container to work more efficiently with the C++
code base that defines data objects.
Expand Down Expand Up @@ -555,7 +558,7 @@ is a hyperlink to the docstring for the class:
* - :py:class:`OriginTimeMatcher <mspasspy.db.normalize.OriginTimeMatcher>`
- match data with start time defined by event origin time

Noting currently all of these have database query versions that differ only
Noting that currently all of these have database query versions that differ only
by have "DB" embedded in the class name
(e.g. the MongoDB version of :code:`EqualityMatcher` is :code:`EqualityDBMatcher`.)

Expand All @@ -581,8 +584,8 @@ idea is most clearly seen by a simple example.
attribute_list = ['_id','lat','lon','elev']
matcher = ObjectIdMatcher(db,collection="site",attributes_to_load=attribute_list)
# This says load the entire dataset presumed staged to MongoDB
cursor = db.wf_miniseed.find({}) #handle to entire data set
dataset = read_distributed_data(cursor) # dataset returned is a bag
cursor = db.wf_TimeSeries.find({}) #handle to entire data set
dataset = read_distributed_data(cursor,collection='wf_TimeSeries') # dataset returned is a bag
dataset = dataset.map(normalize,matcher)
# additional workflow elements and usually ending with a save would be here
dataset.compute()
Expand All @@ -591,10 +594,10 @@ This example loads receiver coordinate information from data that was assumed
previously loaded into MongoDB in the "site" collection. It assumes
matching can be done using the site collection ObjectId loaded with the
waveform data at read time with the key "site_id". i.e. this is an
inline version of what could also be accomplished (more slowly) by
calling :code:`read_distribute_data` with "site" in the normalize list.
inline version of what could also be accomplished by
calling :code:`read_distribute_data` with a matcher for site in the normalize list.

Key things this example demonstrates in common to all in-line
Key things this example demonstrates common to all in-line
normalization workflows are:

+ :code:`normalize` appears only as arg0 of a map operation (dask syntax -
Expand Down Expand Up @@ -783,7 +786,7 @@ We know of three solutions to that problem:
:py:class:`DictionaryCacheMatcher <mspasspy.db.normalize.DictionaryCacheMatcher>`,
and :py:class:`DataFrameCacheMatcher <mspasspy.db.normalize.DataFrameCacheMatcher>`).
One could also build directly on the base class, but we can think of no
example where would be preferable to extending one of the intermediate
example where that would be preferable to extending one of the intermediate
classes. The remainder of this section focuses only on some hints for
extending one of the intermediate classes.

Expand Down Expand Up @@ -862,7 +865,7 @@ intermediate classes you should use to build your custom matcher are:
- The :py:class:`DatabaseMatcher <mspasspy.db.normalize.DatabaseMatcher>`
requires implementing only one method called
:py:meth:`query_generator <mspasspy.db.normalize.DatabaseMatcher.query_generator>`.
Tha method needs to create a python dictionary in pymongo syntax that is to
That method needs to create a python dictionary in pymongo syntax that is to
be applied to the normalizing collection. That query would normally be
constructed from one or more Metadata attributes in a data object but
time queries may also want to use the data start time and endtime available
Expand All @@ -878,7 +881,7 @@ intermediate classes you should use to build your custom matcher are:
The other method,
:py:meth:`db_make_cache_id <mspasspy.db.normalize.DictionaryCacheMatcher.db_make_cache_id>`,
needs to do the same thing and create identical keys.
The difference being that
The difference between the two is that
:py:meth:`db_make_cache_id <mspasspy.db.normalize.DictionaryCacheMatcher.db_make_cache_id>`
is used as the data loader to create the dictionary-based cache while
:py:meth:`cache_id <mspasspy.db.normalize.DictionaryCacheMatcher.cache_id>`
Expand Down
4 changes: 2 additions & 2 deletions docs/source/user_manual/parallel_processing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ One of the primary goals of MsPASS was a framework to make
parallel processing of seismic data no more difficult than running
a typical python script on a desktop machine. In modern IT lingo
our goals was a "scalable" framework. The form of parallelism we
exploit is a one of a large class of problems that can reduced to
exploit is a one of a large class of problems that can br reduced to
what is called a directed cyclic graph (DAG) in computer science.
Any book on "big data" will discuss this concept.
Chapter 1 of Daniel (2019) has a particularly useful description using
Expand Down Expand Up @@ -58,7 +58,7 @@ computer's memory. Spark refers to this abstraction as a
Resiliant Distributed Dataset (RDD) while Dask calls the same thing a "bag".

In MsPASS the normal content of a bag/RDD is a dataset made up of *N*
MsPASS data objects: TimeSeries, Seismogram, or one of the ensemble of
MsPASS data objects: `TimeSeries`, `Seismogram`, or one of the ensemble of
either of the atomic types. An implicit assumption in the current
implementation is that any processing
was proceeded by a data assembly and validation phase.
Expand Down

0 comments on commit bdc86fa

Please sign in to comment.