-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major Revision of database and distributed #472
Conversation
…orted in ascending order
…I leave this for a while
…ing problems in tests
… github for group meeting discussion
… cleanup_database
This branch is getting close to being ready to merge. Current status:
|
The latest update is close, but still has two residual problems I need help fixing:
|
Changes here implemented a new way to automatically delete attributes on all save modes that are present by normalization. Uses the "collection_" prefix to defins such. Tests for update were cleaned up when I found they were failing.
Testing this with our getting_started tutorial with the docker container revealed a blemish I don't now how to fix. When running a dask parallel run I get a ton of warnings posted with this line:
Probably a setup change we need to fix. I suspect the current docker container will generate the same error since the parts being referenced in this branch are now identical to master. Not sure how you can test that other than running the same tutorial. |
@wangyinz can you look at the latest run here and see if you can figure out what is wrong? Spark is throwing some error and I'm not sure how to even debug the error. I'm unclear how this error could be associated with any of the changes in this branch, but suspect that is because the issue is something outside spark I'm not seeing |
The error is actually at https://github.com/mspass-team/mspass/actions/runs/7409175173/job/20158904073#step:11:314 in the error logs. It says the root cause is mspass/python/mspasspy/db/database.py Lines 483 to 486 in 5ec0bd0
This way, it supports not only dict but really any types that supports the [ ] operator. It is leveraging the duck typing feature of Python. However, in this branch, we have strict type check instead and that's reason why it failed. While the TimeSeries or Metadata does appear to be a dict, it is not a dict type object. Well, I think what we should do here is to relax the strict type checking for dict and change that into the old way to accept any dict-alike types.
|
Ok that explains that error. BTW the reason I didn't catch it is I get a different error when I run pytest locally. I have some weird version skew problem that is causing a "codex error" in the Database constructor. I forget that and didn't even look at the output on github to realize it contained a different error. Warning is that you may see some additional failed pushes before I resolve this. A different issue is I don't think the approach used in the old version for duck typing is the right one for this particular problem. Part of that is a constraint I put on what the new implementation of The fix I made is to allow dict or Metadata in the same isinstance test. It will then also accept a TimeSeries or other seismic data object as an input pattern because all are subclasses of Metadata. In the version I just made, however, I warn in the docstring that you should not use a seismic data object as arg0 to |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #472 +/- ##
==========================================
- Coverage 54.31% 53.55% -0.77%
==========================================
Files 144 144
Lines 21956 22341 +385
==========================================
+ Hits 11925 11964 +39
- Misses 10031 10377 +346 ☔ View full report in Codecov by Sentry. |
Hmm - I do not quite understand why the tests just failed. The last commit was some minor changes with black to a few test files,. Now it is failing with a file exists error. @wangyinz how are such things managed on github? I think this is in the tests for the new distributed module where it tests reading and writing to files. I thought I hat the fixtures set up to remove files on exit and it seemed to do that locally, so I'm a bit puzzled. Not sure what the solution is. On a different note I see someone in the group (maybe me) need to write a pytest script to test Undertaker. It is being tested only indirectly by database reader and writer tests handling killed data. We really should have a comprehensive test file. |
This is a weird one. You can see that the "pull_request" workflow failed but the "push" workflow ran through. I can see the difference is here. It actually did a automatic merge into the master branch and that is what is failing. I guess this is really just telling us there will be problems if not merging it carefully. |
… tests of a new function
We will merge this branch for now while @Aristoeu is working on adding more tests. |
This is the major revision of the database.py and distributed.py module that has been in the works for months. It was originally forked from a now ancient version of master so merging will be a challenge. The list is long, but here is a summary of the main changes:
Database
class was drastically changed. My original objective was to reduce redundant code I saw in the older code. That is why this branch has the now misleading name "cleanup_database". In the process I streamlined common code into some new private members. I realized that those changes made it relatively easy to make thesave_data
andread_data
methods more generic. They now handle atomic and ensembles through the same methods. The older versions with "ensemble" in their name were retained but now generate a "this method is deprecated" warning. A big change in behavior for thesave_data
method is that by default it will now only return the ObjectId of the document that defines that datum (ensembles return a list of ObjectIds by default). There is a new "return_data" boolean that can be set true to mimic the current behavior.Undertaker
class has been extended. It should now be viewed, as the name suggests, as the standard way to handle dead data. The Database class no contains an Undertaker instance asself.stedronsky
(Stedronsky was the local undertaker when I was growing up in rural South Dakota - a bad programming joke).write_distributed_data
) for both atomic data and ensembles use a bulk write to reduce database traffic on a save. Both also add a new "post_elog" boolean that can be used to further reduce database traffic by posting error log data as subdocuments in the wf documents. Note that feature has not yet been added to the standard schema definition.There are probably others I have forgotten. Point is there are huge changes here.
I'm submitting this pull request, but currently there are some additional things that most definitely will need to be done: