Major Revision of database and distributed #472

pavlis · 2023-11-20T21:48:24Z

This is the major revision of the database.py and distributed.py module that has been in the works for months. It was originally forked from a now ancient version of master so merging will be a challenge. The list is long, but here is a summary of the main changes:

The core Database class was drastically changed. My original objective was to reduce redundant code I saw in the older code. That is why this branch has the now misleading name "cleanup_database". In the process I streamlined common code into some new private members. I realized that those changes made it relatively easy to make the save_data and read_data methods more generic. They now handle atomic and ensembles through the same methods. The older versions with "ensemble" in their name were retained but now generate a "this method is deprecated" warning. A big change in behavior for the save_data method is that by default it will now only return the ObjectId of the document that defines that datum (ensembles return a list of ObjectIds by default). There is a new "return_data" boolean that can be set true to mimic the current behavior.
Handling of ensembles is now standardized in Database and made to be more like atomic data. That includes changes that make ensembles more like atomic data wrt the live/dead concept. Old scripts with ensembles will break with this version as "live" is no longer a method but an attribute of the class.
The use of the Undertaker class has been extended. It should now be viewed, as the name suggests, as the standard way to handle dead data. The Database class no contains an Undertaker instance as self.stedronsky (Stedronsky was the local undertaker when I was growing up in rural South Dakota - a bad programming joke).
The distributed.py module has been almost completely rewritten. Like Database the readers and writers now handle ensembles as well as atomic data bar/rdd containers. The implementation details are important but best punted to "read the docstring". New revisions in the user manual in a related branch also document the new api. An important new feature is saves (write_distributed_data) for both atomic data and ensembles use a bulk write to reduce database traffic on a save. Both also add a new "post_elog" boolean that can be used to further reduce database traffic by posting error log data as subdocuments in the wf documents. Note that feature has not yet been added to the standard schema definition.
The pytest file for distributed is completely new. The one for database.py has not changed in content but has major changes to match revisions to the behavior of Database methods.

There are probably others I have forgotten. Point is there are huge changes here.

I'm submitting this pull request, but currently there are some additional things that most definitely will need to be done:

Some tests that work on my local system are failing on github. I will need to resolve that.
All the python code needs reformatting with black. I don't want to do that until I resolve 1.
All the rest of the development team need to peruse this material and suggest any changes before we even attempt to merge this.

…cussion

…class

…orted in ascending order

…inary mode

…I leave this for a while

…ing problems in tests

… github for group meeting discussion

…dantic mode

…ata objects

… cleanup_database

…area

pavlis · 2023-12-12T13:38:15Z

This branch is getting close to being ready to merge. Current status:

I did some minor revisions locally before merging master with this one. Merge with master was much easier than I had feared.
The big residual problem is with spark. After a major revision of my development system I now have a close clone to the master docker container setup for testing. I pushed this version today to verify the tests fail on github the same way they do here. That seems to be true. There is some residual problem with initialization of the full test suite causing these problem. When I run the single file python/tests/test_distributed.py by itself it succeeds with no errors. When run in the full chain it fails. I am guessing there is some initialization I don't understand in the overall setup that is doing a global initialization of SparkContext. That is wrong and needs to be fixed, but I need help to do that as I can't find it.
I need to go back and do some black formatting, but that is minor.

pavlis · 2023-12-13T11:41:41Z

The latest update is close, but still has two residual problems I need help fixing:

Assuming I get the same result as my local test two pytest files are failing. Both errors are in sections of MsPASS I have never dealt with before. Like too many of our test scripts there are no comments or docstring info to guide me so I am tossing it back to someone else in the group to fix three offending sections. (warning; checks are in progress as I write this 3 may be wrong - that is what happened locally).
Outside pytest I was working on our tutorial notebooks. I discovered a problem with the new Database.save_data method that must be fixed. When running the "getting_started" tutorial in "notebooks" I found that when that notebook downloads data with obspy's get_waveform method, converts the data to TimeSeries objects, and runs save_data, the channel information is stripped (net, sta, chan) when the default mode is "promiscuous". Fixing the problem is problematic only because my local system remains a bit unstable after upgrading to Ubuntu 22.04 and I'm unable to make the local mspass installation run with my favorite ide spyder. I'll hack on this and may solve it before item 1 is resolved, but I wanted to make it clear that that is a big problem that needs to be fixed before this should be considered for merging to master.

…lems

Changes here implemented a new way to automatically delete attributes on all save modes that are present by normalization. Uses the "collection_" prefix to defins such. Tests for update were cleaned up when I found they were failing.

pavlis · 2023-12-20T14:09:32Z

Testing this with our getting_started tutorial with the docker container revealed a blemish I don't now how to fix. When running a dask parallel run I get a ton of warnings posted with this line:

/tmp/ipykernel_223/2302293414.py:46: DeprecationWarning: `alltrue` is deprecated as of NumPy 1.25.0, and will be removed in NumPy 2.0. Please use `all` instead.

Probably a setup change we need to fix. I suspect the current docker container will generate the same error since the parts being referenced in this branch are now identical to master. Not sure how you can test that other than running the same tutorial.

pavlis · 2024-01-04T13:52:34Z

@wangyinz can you look at the latest run here and see if you can figure out what is wrong? Spark is throwing some error and I'm not sure how to even debug the error. I'm unclear how this error could be associated with any of the changes in this branch, but suspect that is because the issue is something outside spark I'm not seeing

wangyinz · 2024-01-04T15:25:28Z

The error is actually at https://github.com/mspass-team/mspass/actions/runs/7409175173/job/20158904073#step:11:314 in the error logs. It says the root cause is Database.read_data: arg0 has unsupported type=<class 'mspasspy.ccore.seismic.TimeSeries'>. I can see why it errors out this way. Previously we have the following:

mspass/python/mspasspy/db/database.py

Lines 483 to 486 in 5ec0bd0

    
           try: 
        
               oid = object_id["_id"] 
        
           except: 
        
               oid = object_id

This way, it supports not only dict but really any types that supports the [ ] operator. It is leveraging the duck typing feature of Python. However, in this branch, we have strict type check instead and that's reason why it failed. While the TimeSeries or Metadata does appear to be a dict, it is not a dict type object. Well, I think what we should do here is to relax the strict type checking for dict and change that into the old way to accept any dict-alike types.

pavlis · 2024-01-04T18:18:50Z

Ok that explains that error. BTW the reason I didn't catch it is I get a different error when I run pytest locally. I have some weird version skew problem that is causing a "codex error" in the Database constructor. I forget that and didn't even look at the output on github to realize it contained a different error. Warning is that you may see some additional failed pushes before I resolve this.

A different issue is I don't think the approach used in the old version for duck typing is the right one for this particular problem. Part of that is a constraint I put on what the new implementation of read_data assumes about what it is given. It needs to define operator[] (In C++ language) but testing to fetch a specific attribute, "_id", seems not very robust. I'm also not sure just defining operator[] is enough any longer. The new doc2md function centralizes the behavior and the input must resolve to a means to run this command: md=Metadata(doc) that is the default promiscuous mode. In cautious and pedantic mode it uses a loop over the keys and runs md[k] = doc[k] which pretty much means operator [].

The fix I made is to allow dict or Metadata in the same isinstance test. It will then also accept a TimeSeries or other seismic data object as an input pattern because all are subclasses of Metadata. In the version I just made, however, I warn in the docstring that you should not use a seismic data object as arg0 to read_data as it is a bit intrinsically dangerous as it violates an implicit assumption of the algorithm.

codecov · 2024-01-04T18:59:18Z

Codecov Report

Attention: 174 lines in your changes are missing coverage. Please review.

Comparison is base (5ec0bd0) 54.31% compared to head (c790098) 53.55%.

❗ Current head c790098 differs from pull request most recent head 565da0b. Consider uploading reports for the commit 565da0b to get more accurate results

Files	Patch %	Lines
python/mspasspy/util/Undertaker.py	30.45%	121 Missing ⚠️
cxx/src/lib/io/fileio.cc	38.00%	31 Missing ⚠️
cxx/python/seismic/seismic_py.cc	33.33%	10 Missing ⚠️
python/mspasspy/db/normalize.py	30.76%	9 Missing ⚠️
cxx/src/lib/seismic/TimeSeries.cc	50.00%	2 Missing ⚠️
cxx/include/mspass/utility/ErrorLogger.h	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #472      +/-   ##
==========================================
- Coverage   54.31%   53.55%   -0.77%     
==========================================
  Files         144      144              
  Lines       21956    22341     +385     
==========================================
+ Hits        11925    11964      +39     
- Misses      10031    10377     +346

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pavlis · 2024-01-04T22:09:56Z

Hmm - I do not quite understand why the tests just failed. The last commit was some minor changes with black to a few test files,. Now it is failing with a file exists error. @wangyinz how are such things managed on github? I think this is in the tests for the new distributed module where it tests reading and writing to files. I thought I hat the fixtures set up to remove files on exit and it seemed to do that locally, so I'm a bit puzzled. Not sure what the solution is.

On a different note I see someone in the group (maybe me) need to write a pytest script to test Undertaker. It is being tested only indirectly by database reader and writer tests handling killed data. We really should have a comprehensive test file.

wangyinz · 2024-01-05T05:41:08Z

This is a weird one. You can see that the "pull_request" workflow failed but the "push" workflow ran through. I can see the difference is here. It actually did a automatic merge into the master branch and that is what is failing. I guess this is really just telling us there will be problems if not merging it carefully.

… tests of a new function

wangyinz · 2024-01-16T16:27:25Z

We will merge this branch for now while @Aristoeu is working on adding more tests.

pavlis added 30 commits July 10, 2023 10:15

Prototype revisions to save methods - incomplete posted for group dis…

c4ddef9

…cussion

Add mummify method and make several changes to allow use in Database …

9a0110d

…class

Add clear method for ErrorLogger for use in parallel writer

cc432e5

Improve error messages

293f5b7

Change handling of dead data for ensembles to be more robust

bc988a1

Major revision to readers

f04e185

Add code for new concept of abortion

0ff2bf8

Changes to fix circular dependence errors

ec6558b

Restore read_to_dataframe cut by mistake

e5bda33

Disable test that is no longer valid

0afdda2

Fix bugs found with pytest test_ensembles

d4b8389

Fix bugs found in initial round with pytest

ec4e97c

Modify tests to mesh with changes for new Database implementation

fb17529

Fix bug that would not handle foff==0 correctly unless the list was s…

a484e3f

…orted in ascending order

Fix same issue for TimeSeriesEnemble - missed in previous commit

23e11e1

Add new implementation to read ensembles more efficiently stored in b…

f4de76b

…inary mode

Implement exclude_objects option for deprecated ensemble methods

508be1e

Fix several bugs - incomplete debug. Intermediate commit

980cc81

Change isinstance to pymongo base class to resolve inconsistent behavior

0695e8f

Numerous fixes found in testing - intermediate save for backup while …

6606773

…I leave this for a while

Add some attributes to Seismogram metadata that were missing and caus…

0d14e30

…ing problems in tests

Changes to get test_save_read_data to run

eeaac1c

Modify test scripts to mate with changes in Database detailed behavior.

8ae21c1

Bug fixes found with current round of pytest - incomplete. Pushing to…

cd83986

… github for group meeting discussion

Add attributes to Metadata to prevent bad behavior in cautions and pe…

b584385

…dantic mode

Add binding code to handle live in ensembles consistent with atomic d…

b42669b

…ata objects

Changes for ensemble using live as a property

9b67484

Bug fixes from working through pytest suite

95cb6e5

Modify tests for new database features and comments to clarify them

038e678

Modify this test to handle ensemble live as a property

f454beb

pavlis added 5 commits December 4, 2023 08:48

Merge branch 'cleanup_database' of github.com:mspass-team/mspass into…

601dc33

… cleanup_database

Fix spark setup problem and section to cleanup files created in data …

d46fb3e

…area

Disable problematic import

baac626

Add delete to spark fixture

d4feac4

Changes to resolve git conflicts with master

e900559

pavlis added 2 commits December 13, 2023 06:29

Fix spark setup to use pytest-spark consistent with other spark tests

264340c

Minor fix required by save_data api change

2106fcc

pavlis added 3 commits December 13, 2023 06:46

Restore DatetimeConverstion - needed for master but caused local prob…

58e36fa

…lems

Cleanup handling of normalizing data.

83af3d5

Changes here implemented a new way to automatically delete attributes on all save modes that are present by normalization. Uses the "collection_" prefix to defins such. Tests for update were cleaned up when I found they were failing.

Fix two typos that caused serious bugs

667629e

pavlis added 2 commits December 21, 2023 06:18

Fix bug in mishandling relative time stardard

1230dd2

Disable two test lines of dubious value to get a clean build

ef0e703

pavlis added 2 commits January 4, 2024 13:32

Allow subclasses of Metadata as input to read_data

6f03b64

Disable failing test that seems no longer correct

ed21cac

Black format patches

c790098

Fix bug in reversed coords values and add geoJSON format version with…

565da0b

… tests of a new function

Aristoeu mentioned this pull request Jan 16, 2024

fix seismogram array bug in fileio #463

Closed

wangyinz merged commit c8021f6 into master Jan 16, 2024
12 checks passed

wangyinz deleted the cleanup_database branch January 16, 2024 16:27

Aristoeu mentioned this pull request Jan 25, 2024

Add tests for new undertaker class #497

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major Revision of database and distributed #472

Major Revision of database and distributed #472

pavlis commented Nov 20, 2023

pavlis commented Dec 12, 2023

pavlis commented Dec 13, 2023

pavlis commented Dec 20, 2023

pavlis commented Jan 4, 2024 •

edited

Loading

wangyinz commented Jan 4, 2024

pavlis commented Jan 4, 2024

codecov bot commented Jan 4, 2024 •

edited

Loading

pavlis commented Jan 4, 2024

wangyinz commented Jan 5, 2024

wangyinz commented Jan 16, 2024

Major Revision of database and distributed #472

Major Revision of database and distributed #472

Conversation

pavlis commented Nov 20, 2023

pavlis commented Dec 12, 2023

pavlis commented Dec 13, 2023

pavlis commented Dec 20, 2023

pavlis commented Jan 4, 2024 • edited Loading

wangyinz commented Jan 4, 2024

pavlis commented Jan 4, 2024

codecov bot commented Jan 4, 2024 • edited Loading

Codecov Report

pavlis commented Jan 4, 2024

wangyinz commented Jan 5, 2024

wangyinz commented Jan 16, 2024

pavlis commented Jan 4, 2024 •

edited

Loading

codecov bot commented Jan 4, 2024 •

edited

Loading