Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major Revision of database and distributed #472

Merged
merged 81 commits into from
Jan 16, 2024
Merged

Conversation

pavlis
Copy link
Collaborator

@pavlis pavlis commented Nov 20, 2023

This is the major revision of the database.py and distributed.py module that has been in the works for months. It was originally forked from a now ancient version of master so merging will be a challenge. The list is long, but here is a summary of the main changes:

  1. The core Database class was drastically changed. My original objective was to reduce redundant code I saw in the older code. That is why this branch has the now misleading name "cleanup_database". In the process I streamlined common code into some new private members. I realized that those changes made it relatively easy to make the save_data and read_data methods more generic. They now handle atomic and ensembles through the same methods. The older versions with "ensemble" in their name were retained but now generate a "this method is deprecated" warning. A big change in behavior for the save_data method is that by default it will now only return the ObjectId of the document that defines that datum (ensembles return a list of ObjectIds by default). There is a new "return_data" boolean that can be set true to mimic the current behavior.
  2. Handling of ensembles is now standardized in Database and made to be more like atomic data. That includes changes that make ensembles more like atomic data wrt the live/dead concept. Old scripts with ensembles will break with this version as "live" is no longer a method but an attribute of the class.
  3. The use of the Undertaker class has been extended. It should now be viewed, as the name suggests, as the standard way to handle dead data. The Database class no contains an Undertaker instance as self.stedronsky (Stedronsky was the local undertaker when I was growing up in rural South Dakota - a bad programming joke).
  4. The distributed.py module has been almost completely rewritten. Like Database the readers and writers now handle ensembles as well as atomic data bar/rdd containers. The implementation details are important but best punted to "read the docstring". New revisions in the user manual in a related branch also document the new api. An important new feature is saves (write_distributed_data) for both atomic data and ensembles use a bulk write to reduce database traffic on a save. Both also add a new "post_elog" boolean that can be used to further reduce database traffic by posting error log data as subdocuments in the wf documents. Note that feature has not yet been added to the standard schema definition.
  5. The pytest file for distributed is completely new. The one for database.py has not changed in content but has major changes to match revisions to the behavior of Database methods.

There are probably others I have forgotten. Point is there are huge changes here.

I'm submitting this pull request, but currently there are some additional things that most definitely will need to be done:

  1. Some tests that work on my local system are failing on github. I will need to resolve that.
  2. All the python code needs reformatting with black. I don't want to do that until I resolve 1.
  3. All the rest of the development team need to peruse this material and suggest any changes before we even attempt to merge this.

pavlis added 30 commits July 10, 2023 10:15
@pavlis
Copy link
Collaborator Author

pavlis commented Dec 12, 2023

This branch is getting close to being ready to merge. Current status:

  1. I did some minor revisions locally before merging master with this one. Merge with master was much easier than I had feared.
  2. The big residual problem is with spark. After a major revision of my development system I now have a close clone to the master docker container setup for testing. I pushed this version today to verify the tests fail on github the same way they do here. That seems to be true. There is some residual problem with initialization of the full test suite causing these problem. When I run the single file python/tests/test_distributed.py by itself it succeeds with no errors. When run in the full chain it fails. I am guessing there is some initialization I don't understand in the overall setup that is doing a global initialization of SparkContext. That is wrong and needs to be fixed, but I need help to do that as I can't find it.
  3. I need to go back and do some black formatting, but that is minor.

@pavlis
Copy link
Collaborator Author

pavlis commented Dec 13, 2023

The latest update is close, but still has two residual problems I need help fixing:

  1. Assuming I get the same result as my local test two pytest files are failing. Both errors are in sections of MsPASS I have never dealt with before. Like too many of our test scripts there are no comments or docstring info to guide me so I am tossing it back to someone else in the group to fix three offending sections. (warning; checks are in progress as I write this 3 may be wrong - that is what happened locally).
  2. Outside pytest I was working on our tutorial notebooks. I discovered a problem with the new Database.save_data method that must be fixed. When running the "getting_started" tutorial in "notebooks" I found that when that notebook downloads data with obspy's get_waveform method, converts the data to TimeSeries objects, and runs save_data, the channel information is stripped (net, sta, chan) when the default mode is "promiscuous". Fixing the problem is problematic only because my local system remains a bit unstable after upgrading to Ubuntu 22.04 and I'm unable to make the local mspass installation run with my favorite ide spyder. I'll hack on this and may solve it before item 1 is resolved, but I wanted to make it clear that that is a big problem that needs to be fixed before this should be considered for merging to master.

Changes here implemented a new way to automatically delete attributes on
all save modes that are present by normalization.  Uses the
"collection_" prefix to defins such. Tests for update were cleaned up
when I found they were failing.
@pavlis
Copy link
Collaborator Author

pavlis commented Dec 20, 2023

Testing this with our getting_started tutorial with the docker container revealed a blemish I don't now how to fix. When running a dask parallel run I get a ton of warnings posted with this line:

/tmp/ipykernel_223/2302293414.py:46: DeprecationWarning: `alltrue` is deprecated as of NumPy 1.25.0, and will be removed in NumPy 2.0. Please use `all` instead.

Probably a setup change we need to fix. I suspect the current docker container will generate the same error since the parts being referenced in this branch are now identical to master. Not sure how you can test that other than running the same tutorial.

@pavlis
Copy link
Collaborator Author

pavlis commented Jan 4, 2024

@wangyinz can you look at the latest run here and see if you can figure out what is wrong? Spark is throwing some error and I'm not sure how to even debug the error. I'm unclear how this error could be associated with any of the changes in this branch, but suspect that is because the issue is something outside spark I'm not seeing

@wangyinz
Copy link
Member

wangyinz commented Jan 4, 2024

The error is actually at https://github.com/mspass-team/mspass/actions/runs/7409175173/job/20158904073#step:11:314 in the error logs. It says the root cause is Database.read_data: arg0 has unsupported type=<class 'mspasspy.ccore.seismic.TimeSeries'>. I can see why it errors out this way. Previously we have the following:

try:
oid = object_id["_id"]
except:
oid = object_id

This way, it supports not only dict but really any types that supports the [ ] operator. It is leveraging the duck typing feature of Python. However, in this branch, we have strict type check instead and that's reason why it failed. While the TimeSeries or Metadata does appear to be a dict, it is not a dict type object. Well, I think what we should do here is to relax the strict type checking for dict and change that into the old way to accept any dict-alike types.

@pavlis
Copy link
Collaborator Author

pavlis commented Jan 4, 2024

Ok that explains that error. BTW the reason I didn't catch it is I get a different error when I run pytest locally. I have some weird version skew problem that is causing a "codex error" in the Database constructor. I forget that and didn't even look at the output on github to realize it contained a different error. Warning is that you may see some additional failed pushes before I resolve this.

A different issue is I don't think the approach used in the old version for duck typing is the right one for this particular problem. Part of that is a constraint I put on what the new implementation of read_data assumes about what it is given. It needs to define operator[] (In C++ language) but testing to fetch a specific attribute, "_id", seems not very robust. I'm also not sure just defining operator[] is enough any longer. The new doc2md function centralizes the behavior and the input must resolve to a means to run this command: md=Metadata(doc) that is the default promiscuous mode. In cautious and pedantic mode it uses a loop over the keys and runs md[k] = doc[k] which pretty much means operator [].

The fix I made is to allow dict or Metadata in the same isinstance test. It will then also accept a TimeSeries or other seismic data object as an input pattern because all are subclasses of Metadata. In the version I just made, however, I warn in the docstring that you should not use a seismic data object as arg0 to read_data as it is a bit intrinsically dangerous as it violates an implicit assumption of the algorithm.

Copy link

codecov bot commented Jan 4, 2024

Codecov Report

Attention: 174 lines in your changes are missing coverage. Please review.

Comparison is base (5ec0bd0) 54.31% compared to head (c790098) 53.55%.

❗ Current head c790098 differs from pull request most recent head 565da0b. Consider uploading reports for the commit 565da0b to get more accurate results

Files Patch % Lines
python/mspasspy/util/Undertaker.py 30.45% 121 Missing ⚠️
cxx/src/lib/io/fileio.cc 38.00% 31 Missing ⚠️
cxx/python/seismic/seismic_py.cc 33.33% 10 Missing ⚠️
python/mspasspy/db/normalize.py 30.76% 9 Missing ⚠️
cxx/src/lib/seismic/TimeSeries.cc 50.00% 2 Missing ⚠️
cxx/include/mspass/utility/ErrorLogger.h 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #472      +/-   ##
==========================================
- Coverage   54.31%   53.55%   -0.77%     
==========================================
  Files         144      144              
  Lines       21956    22341     +385     
==========================================
+ Hits        11925    11964      +39     
- Misses      10031    10377     +346     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pavlis
Copy link
Collaborator Author

pavlis commented Jan 4, 2024

Hmm - I do not quite understand why the tests just failed. The last commit was some minor changes with black to a few test files,. Now it is failing with a file exists error. @wangyinz how are such things managed on github? I think this is in the tests for the new distributed module where it tests reading and writing to files. I thought I hat the fixtures set up to remove files on exit and it seemed to do that locally, so I'm a bit puzzled. Not sure what the solution is.

On a different note I see someone in the group (maybe me) need to write a pytest script to test Undertaker. It is being tested only indirectly by database reader and writer tests handling killed data. We really should have a comprehensive test file.

@wangyinz
Copy link
Member

wangyinz commented Jan 5, 2024

This is a weird one. You can see that the "pull_request" workflow failed but the "push" workflow ran through. I can see the difference is here. It actually did a automatic merge into the master branch and that is what is failing. I guess this is really just telling us there will be problems if not merging it carefully.

@wangyinz
Copy link
Member

We will merge this branch for now while @Aristoeu is working on adding more tests.

@wangyinz wangyinz merged commit c8021f6 into master Jan 16, 2024
12 checks passed
@wangyinz wangyinz deleted the cleanup_database branch January 16, 2024 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants