From bdc86facfac28c82f2f4e2c9476b03e68d322a79 Mon Sep 17 00:00:00 2001 From: Gary Pavlis <pavlis@indiana.edu> Date: Fri, 15 Dec 2023 07:40:57 -0500 Subject: [PATCH] updates for database v2 late changes and assorted wordsmithing --- docs/source/user_manual/CRUD_operations.rst | 27 ++++++++-- .../user_manual/development_strategies.rst | 51 +++++++++++++------ docs/source/user_manual/normalization.rst | 41 ++++++++------- .../user_manual/parallel_processing.rst | 4 +- 4 files changed, 81 insertions(+), 42 deletions(-) diff --git a/docs/source/user_manual/CRUD_operations.rst b/docs/source/user_manual/CRUD_operations.rst index ab1727493..115437c5e 100644 --- a/docs/source/user_manual/CRUD_operations.rst +++ b/docs/source/user_manual/CRUD_operations.rst @@ -202,11 +202,11 @@ and most up to date usage: "task". That means, writes are performed in parallel by ensemble. Note when saving seismic data, the `save_data` method, by default, -returns only the `ObjectId` of the document saved. Similarly, by -default `write_distributed_data` returns a bag/RDD of `ObjectID`s. -Both have an option to return a copy of the data saved to allow their -use for an intermediate save during a workflow, but be warned that is not -the default. The default was found to be important to avoid +returns only the `ObjectId` of the document saved. Returning +a copy of the data is an option. +`write_distributed_data` is more dogmatic and always only returns +a python list of `ObjectID`s. +The default was found to be important to avoid memory faults that can happen when a workflow computation is initiated in the standard way (e.g. in dask calling the bag "compute" method.). If the last step in the workflow is a save and the bag/RDD contains the @@ -245,6 +245,23 @@ should recognize: in the section on reading data. #. The writers all have a `save_history` to save the object-level history. That data is stored in a separate collection called `history`. +#. Writers have a `mode` argument that must be one of "promiscuous", + "cautious", or "pedantic". Readers also use this argument, but + writing this controls how much a schema is enforced on the output. + The default ("promiscuous") does no schema enforcement at all. + All modes, however, do dogmatically enforce one rule. Any attribute + key interpreted as loaded by normalization is erased before the save. + In MsPASS normalization data are normally loaded by prepending a + the name of the collection to the attribute. e.g. the latitude of + a station ("lat" in the MsPASS schema) stored in the channel collection + would be loaded with the key "channel_lat". Attributes with + one of the standard collection names ("site", "channel", and "source") + will always be erased before the wf document is saved. When node + is set to "cautious" the writer will attempt to correct any time mismatches + and log an error if any issues are detected. In "pedantic" mode any + type mismatches will cause the datum to be killed before the save. + "pedantic" is rarely advised for writing unless one is writing to a + files with a format that is dogmatic about attribute types. Read diff --git a/docs/source/user_manual/development_strategies.rst b/docs/source/user_manual/development_strategies.rst index adfeff023..a01ce0208 100644 --- a/docs/source/user_manual/development_strategies.rst +++ b/docs/source/user_manual/development_strategies.rst @@ -23,9 +23,7 @@ discussed in common tutorials found by searching with the keywords #. Inserting matplotlib graphics to visualize data at an intermediate stage of processing. #. If you have a box that throws a mysterious exception that is not self-explanatory - the `%debug` magic command can be useful. After the box with the - exception insert a code box, add the magic command, and execute push - the run button for the new cell. + the `%debug` magic command can be useful. To use this feature add a new code box after the cell with problems, put that command in the box, and push the jupyter run button. You will get an ipython prompt you can use to investigate variables defined where the @@ -105,20 +103,41 @@ following: resources easily found with a web search. #. To test a python function with the mspass container, copy your python code to a directory you mount with the appropriate docker or singularity run - incantation. When the container is running launch a terminal window with - the jupyter interface and copy the python code to an appropriate - directory in the mspass installation file tree in the container file - system (currently `/opt/conda/lib/python3.10/site-packages/mspasspy` - but not the `python3.10` could change as new python versions appear.) - Then you can import your module as within the container using - the standard import syntax with the top level, in this case, - being defined as `mspasspy`. To be more concrete if you have a new - python module file called `myfancystuff.py` you copy into the `algorithms` - directory you could expose a test script in a notebook to the new module with - something like this: `import mspasspy.algorithms.myfancystuff as mystuff`. -#. Once you are finished testing you can do one of two things. (a) submit + incantation. The simplest way to do that is to just put your python + script in the same directory as your notebook. In that case, the + notebook code need only include a simple `import`. e.g. if you have + your code saved in a file `mymodule.py` and you want to use a function + in that module called `myfunction`, in your notebook you would just + enter this simple, failry standard command: + + .. code-block:: python + + from mymodule import myfunction + + If `mymodule` is located in a different directory use the + docker "--mount" option or apptainer/singularity "-B" options to + "bind" that directory to the container. For example, suppose we have + module `mymodule.py` stored in a directory called `/home/myname/python`. + With docker this could be mounted on the standard container + with the following incantation: + .. code-block:: bash + + docker run --mount src=/home/myname/python,target=/mnt,type=bind -p 8888:8888 mspass/mspass + + To make that module accessible with the same import command as above you + would need to change the python search path. For this example, you could + use this incanation: + + .. code-block:: python + + import sys + sys.path.append('/mnt') + +#. Once you are finished testing you can do one of two things to make + it a more durable feature. (a) Assimilante + your module into mspass and submit you code as a pull request to the github site for mspass. If accepted it - becomes part of mspass. (b) build a custom docker container that + becomes part of mspass. (b) Build a custom docker container that adds your software as an extension of the mspass container. The docker documentation and the examples in the top level directory for the MsPASS source code tree should get you started. It is beyond the scope of this diff --git a/docs/source/user_manual/normalization.rst b/docs/source/user_manual/normalization.rst index e65f9d70d..4e7094d26 100644 --- a/docs/source/user_manual/normalization.rst +++ b/docs/source/user_manual/normalization.rst @@ -29,7 +29,7 @@ For small datasets these issues can be minor, but for very large data sets we have found poorly designed normalization algorithms can be a serious bottleneck to performance. A key difference all users need to appreciate -is that with a relational database a "join" is always a global operation done between all +is that with a relational database, a "join" is always a global operation done between all tuples in two relations (tables or table subsets). In MongoDB normalization is an atomic operation made one document (recall a document is analogous to a tuple) at a time. Because all database operations are @@ -49,7 +49,7 @@ tables are easily loaded into a Dataframe with one line of python code (:code:`read_csv`). That abstraction is possible because a MongoDB "collection" is just an alternative way to represent a table (relation). -Before proceeding it is important to give a pair of definitions we used repeatedly +Before proceeding it is important to give a pair of definitions we use repeatedly in the text below. We define the :code:`normalizing` collection/table as the smaller collection/table holding the repetitious data we aim to cross-reference. In addition, when we use the term :code:`target of normalization` @@ -93,7 +93,7 @@ be accomplished one of two ways: Both approaches utilize the concept of a :code:`normalization operator` we discuss in detail in this section. Readers familiar with relational database concept may find it helpful to view a :code:`normalization operator` -as equivalent the operation used to define a database join. +as equivalent to the operation used to define a database join. This section focuses on the first approach. The second is covered in a later section below. The most common operators for normalization while @@ -146,9 +146,9 @@ makes it inevitably slower than the comparable Id-based algorithm `py:class:<mspasspy.db.normalize.ObjectIdMatcher>`. We suggest that unless you are absolutely certain of the completeness of the :code:`channel` collection, you should use the -Id-based method discussed here for doing normalization while readng. +Id-based method discussed here for doing normalization while reading. -Because miniseed normalization is so fundamental to modern data +Because miniseed normalization is so fundamental to modern seismology data, we created a special python function called :py:func:`normalize_mseed <mspasspy.db.normalize.normalize_mseed>`. It is used for defining :code:`channel_id` @@ -233,13 +233,16 @@ The following does the same operation as above in parallel with dask channel_matcher = MiniseedMatcher(db) # loop over all wf_miniseed records cursor = db.wf_miniseed.find({}) - dataset = read_distributed_data(db,normalize=[channel_matcher]) + dataset = read_distributed_data(cursor, + normalize=[channel_matcher], + collection='wf_miniseed', + ) # porocessing steps as map operators follow # normally terminate with a save dataset.compute() Reading ensembles with normalization is similar. The following is a -serial job that reads ensembles and normalizes each ensemble with data from +serial job that reads ensembles and normalizes the ensemble with data from the source and channel collections. It assumes source_id was defined previously. @@ -258,7 +261,7 @@ previously. sourceid_list = db.wf_miniseed.distinct("source_id") for srcid in sourceid_list: cursor = db.wf_miniseed.find({"source_id" : srcid}) - ensemble = db.read_ensemble_data(cursor, + ensemble = db.read_data(cursor, normalize=[channel_matcher], normalize_ensemble=[source_matcher]) # processing functions for ensembles to follow here @@ -317,7 +320,7 @@ Next, the parallel version of the job immediately above: channel_matcher = MiniseedMatcher(db) # loop over all wf_miniseed records cursor = db.wf_miniseed.find({}) - dataset = read_distributed_data(db,collection="wf_miniseed") + dataset = read_distributed_data(cursor,collection="wf_miniseed") dataset = dataset.map(normalize,channel_matcher) # processing steps as map operators follow # normally terminate with a save @@ -523,7 +526,7 @@ different keys to access attributes stored in the database and the equivalent keys used to access the same data in a workflow. In addition, there is a type mismatch between a document/tuple/row abstraction in a MongoDB document and the internal use by the matcher -class family. That is, pymongo treats represents a "document" as a +class family. That is, pymongo represents a "document" as a python dictionary while the matchers require posting the same data to the MsPASS Metadata container to work more efficiently with the C++ code base that defines data objects. @@ -555,7 +558,7 @@ is a hyperlink to the docstring for the class: * - :py:class:`OriginTimeMatcher <mspasspy.db.normalize.OriginTimeMatcher>` - match data with start time defined by event origin time -Noting currently all of these have database query versions that differ only +Noting that currently all of these have database query versions that differ only by have "DB" embedded in the class name (e.g. the MongoDB version of :code:`EqualityMatcher` is :code:`EqualityDBMatcher`.) @@ -581,8 +584,8 @@ idea is most clearly seen by a simple example. attribute_list = ['_id','lat','lon','elev'] matcher = ObjectIdMatcher(db,collection="site",attributes_to_load=attribute_list) # This says load the entire dataset presumed staged to MongoDB - cursor = db.wf_miniseed.find({}) #handle to entire data set - dataset = read_distributed_data(cursor) # dataset returned is a bag + cursor = db.wf_TimeSeries.find({}) #handle to entire data set + dataset = read_distributed_data(cursor,collection='wf_TimeSeries') # dataset returned is a bag dataset = dataset.map(normalize,matcher) # additional workflow elements and usually ending with a save would be here dataset.compute() @@ -591,10 +594,10 @@ This example loads receiver coordinate information from data that was assumed previously loaded into MongoDB in the "site" collection. It assumes matching can be done using the site collection ObjectId loaded with the waveform data at read time with the key "site_id". i.e. this is an -inline version of what could also be accomplished (more slowly) by -calling :code:`read_distribute_data` with "site" in the normalize list. +inline version of what could also be accomplished by +calling :code:`read_distribute_data` with a matcher for site in the normalize list. -Key things this example demonstrates in common to all in-line +Key things this example demonstrates common to all in-line normalization workflows are: + :code:`normalize` appears only as arg0 of a map operation (dask syntax - @@ -783,7 +786,7 @@ We know of three solutions to that problem: :py:class:`DictionaryCacheMatcher <mspasspy.db.normalize.DictionaryCacheMatcher>`, and :py:class:`DataFrameCacheMatcher <mspasspy.db.normalize.DataFrameCacheMatcher>`). One could also build directly on the base class, but we can think of no - example where would be preferable to extending one of the intermediate + example where that would be preferable to extending one of the intermediate classes. The remainder of this section focuses only on some hints for extending one of the intermediate classes. @@ -862,7 +865,7 @@ intermediate classes you should use to build your custom matcher are: - The :py:class:`DatabaseMatcher <mspasspy.db.normalize.DatabaseMatcher>` requires implementing only one method called :py:meth:`query_generator <mspasspy.db.normalize.DatabaseMatcher.query_generator>`. - Tha method needs to create a python dictionary in pymongo syntax that is to + That method needs to create a python dictionary in pymongo syntax that is to be applied to the normalizing collection. That query would normally be constructed from one or more Metadata attributes in a data object but time queries may also want to use the data start time and endtime available @@ -878,7 +881,7 @@ intermediate classes you should use to build your custom matcher are: The other method, :py:meth:`db_make_cache_id <mspasspy.db.normalize.DictionaryCacheMatcher.db_make_cache_id>`, needs to do the same thing and create identical keys. - The difference being that + The difference between the two is that :py:meth:`db_make_cache_id <mspasspy.db.normalize.DictionaryCacheMatcher.db_make_cache_id>` is used as the data loader to create the dictionary-based cache while :py:meth:`cache_id <mspasspy.db.normalize.DictionaryCacheMatcher.cache_id>` diff --git a/docs/source/user_manual/parallel_processing.rst b/docs/source/user_manual/parallel_processing.rst index b4859d9e8..9cf67485c 100644 --- a/docs/source/user_manual/parallel_processing.rst +++ b/docs/source/user_manual/parallel_processing.rst @@ -8,7 +8,7 @@ One of the primary goals of MsPASS was a framework to make parallel processing of seismic data no more difficult than running a typical python script on a desktop machine. In modern IT lingo our goals was a "scalable" framework. The form of parallelism we -exploit is a one of a large class of problems that can reduced to +exploit is a one of a large class of problems that can br reduced to what is called a directed cyclic graph (DAG) in computer science. Any book on "big data" will discuss this concept. Chapter 1 of Daniel (2019) has a particularly useful description using @@ -58,7 +58,7 @@ computer's memory. Spark refers to this abstraction as a Resiliant Distributed Dataset (RDD) while Dask calls the same thing a "bag". In MsPASS the normal content of a bag/RDD is a dataset made up of *N* -MsPASS data objects: TimeSeries, Seismogram, or one of the ensemble of +MsPASS data objects: `TimeSeries`, `Seismogram`, or one of the ensemble of either of the atomic types. An implicit assumption in the current implementation is that any processing was proceeded by a data assembly and validation phase.