Improve OriginTimeMatcher #592

pavlis · 2025-01-05T15:25:28Z

In working with the extended usarray data set I discovered a gap in our implementation of the OriginTimeMatcher class. I tried to run the bulk_normalize function on a database with around 3 million wf documents. The aim was to produce a clean database with "channel_id" and "source_id" set so I could use id matching for processing this large data set. It turned out bulk_normalize required OriginTimeMatcher to implement the "find_doc" method. It did not have that previously.

This revision removes the find_doc deficiency in OriginTimeMatcher but I went one step further. That is, I realized that a generic version of find_doc was possible in the base cass BasicMatcher. I implemented that. However, I had to also create and override of the generic method in OriginTimeMatcher due to several issues in that class that did not mesh with the simple concepts of the generic method. I ended up also overriding find_one in OriginTimeMatcher after writing the new find_doc method. The previous version did not handle a common issue with this matcher. That is, a time interval match is soft and if there was a far from zero probability that a time interval in a match contained multiple earthquakes. find_one and find_doc now both contain a resolution to that ambiguity that is the obvious choice: select for the unique match the one that comes closest to matching the time projected from the waveform start time - the basic idea behind this class.

pavlis · 2025-01-05T15:27:04Z

Postscript to above: this first push has a known deficiency. The pytests for the OriginTimeMatcher are grossly inadequate. In fact, all the tests for the normalize module are highly deficient.

…heritance

…normalize

pavlis · 2025-01-13T21:33:41Z

I think this branch is ready to merge. I even partially addresses the inadequacies of the tests for OriginTimeMatcher. Still far from complete but better than it was. The main thing is this branch now allows this tool to be run with bulk_normalize to create source_id matching values on wf documents. That probably remains the best way to normalize large data sets rather than doing it on the fly. The advantage of normalization before processing is you can evaluate how well it worked with MongoDB instead of post mortem.

pavlis added 6 commits January 10, 2025 07:58

Replace private function with smpler isinstance test that exploits in…

39bc724

…heritance

Major enhancement of OriginTimeMatcher to allow it to work with bulk_…

e5466d9

…normalize

Add error handler and update test script

0b7b7ca

Fix bugs found with updated test script

dd8f226

fix test script for OriginTimeMatcher

92dd627

reformat with black

9791396

pavlis force-pushed the improve_otmatcher branch from 873beeb to 9791396 Compare January 13, 2025 15:36

wangyinz merged commit f1ec089 into master Jan 14, 2025
10 checks passed

wangyinz deleted the improve_otmatcher branch January 14, 2025 17:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve OriginTimeMatcher #592

Improve OriginTimeMatcher #592

pavlis commented Jan 5, 2025

pavlis commented Jan 5, 2025

pavlis commented Jan 13, 2025

Improve OriginTimeMatcher #592

Improve OriginTimeMatcher #592

Conversation

pavlis commented Jan 5, 2025

pavlis commented Jan 5, 2025

pavlis commented Jan 13, 2025