Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve OriginTimeMatcher #592

Merged
merged 6 commits into from
Jan 14, 2025
Merged

Improve OriginTimeMatcher #592

merged 6 commits into from
Jan 14, 2025

Conversation

pavlis
Copy link
Collaborator

@pavlis pavlis commented Jan 5, 2025

In working with the extended usarray data set I discovered a gap in our implementation of the OriginTimeMatcher class. I tried to run the bulk_normalize function on a database with around 3 million wf documents. The aim was to produce a clean database with "channel_id" and "source_id" set so I could use id matching for processing this large data set. It turned out bulk_normalize required OriginTimeMatcher to implement the "find_doc" method. It did not have that previously.

This revision removes the find_doc deficiency in OriginTimeMatcher but I went one step further. That is, I realized that a generic version of find_doc was possible in the base cass BasicMatcher. I implemented that. However, I had to also create and override of the generic method in OriginTimeMatcher due to several issues in that class that did not mesh with the simple concepts of the generic method. I ended up also overriding find_one in OriginTimeMatcher after writing the new find_doc method. The previous version did not handle a common issue with this matcher. That is, a time interval match is soft and if there was a far from zero probability that a time interval in a match contained multiple earthquakes. find_one and find_doc now both contain a resolution to that ambiguity that is the obvious choice: select for the unique match the one that comes closest to matching the time projected from the waveform start time - the basic idea behind this class.

@pavlis
Copy link
Collaborator Author

pavlis commented Jan 5, 2025

Postscript to above: this first push has a known deficiency. The pytests for the OriginTimeMatcher are grossly inadequate. In fact, all the tests for the normalize module are highly deficient.

@pavlis pavlis force-pushed the improve_otmatcher branch from 873beeb to 9791396 Compare January 13, 2025 15:36
@pavlis
Copy link
Collaborator Author

pavlis commented Jan 13, 2025

I think this branch is ready to merge. I even partially addresses the inadequacies of the tests for OriginTimeMatcher. Still far from complete but better than it was. The main thing is this branch now allows this tool to be run with bulk_normalize to create source_id matching values on wf documents. That probably remains the best way to normalize large data sets rather than doing it on the fly. The advantage of normalization before processing is you can evaluate how well it worked with MongoDB instead of post mortem.

@wangyinz wangyinz merged commit f1ec089 into master Jan 14, 2025
10 checks passed
@wangyinz wangyinz deleted the improve_otmatcher branch January 14, 2025 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants