Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FC management scheme for processing #319

Open
kkappler opened this issue Jan 27, 2024 · 0 comments
Open

FC management scheme for processing #319

kkappler opened this issue Jan 27, 2024 · 0 comments
Assignees

Comments

@kkappler
Copy link
Collaborator

kkappler commented Jan 27, 2024

FCs have been stored according to their run_ids in aurora. This can cause some unwanted behaviour:

Consider a case where we are processing a remote reference dataset, where station runs do not overlap cleanly. Below is a screengab from processing CAS04 with NVR08. The first table is a simple run summary of available data. The second table is the kernel dataset dataframe.
Note that in the first table, run b is unique at station CAS04, and that run is approximately 10 days long. However, in the kernel dataset, there are two instances where run b from CAS04 is referenced ..

  1. Rows 0,1 pair run a from the reference station (NVR08) with run b from CAS04.
  2. Rows 2, 3, pair run b from the reference station (NVR08) with run b from CAS04.

So ... when these data are being processed according to the current flow, we encounter the following flow in the logic:
Iteration begins over kernel_dataset.df and in the first row a 2860 second chunk of run b is extracted from the mth5, STFT-ed, and then it gets stored under:
/Experiment/Surveys/CONUS_South/Stations/NVR08/Fourier_Coefficients/b/
but then, on row 2, a 769090 second chunk of run b is extracted from the mth5, STFT-ed, and then it gets stored under the same level, overwriting the previous data. This might process correctly the first time, but it will likely fail the second time.

The second time we process the file FCs will be detected and on rows 0,1 of the df, the STFT objects loaded will be 2860s from NVR08, 769090 from CAS04.

image

There are workarounds for this, but it is not clear what is the best.

  1. Remove the save_fc option that saves on the Fly. We would replace this with a separate (optional) step to "build_fcs". FCs would be build for complete runs, and then the spectrogram loader would use indexing on the stored FCs to load the appropriate sub-run.

Perhaps the cleanest way to do this would be to process separately each station using "single station" processing, with save_fc=True. Then all runs will get stored completely.

The only thing that needs to be checked/fixed for RR is then to use start and end times when loading FCs from MTH5.

  1. Each processing_run (row of kernel_dataset.df) could be given an id. The FCs could be saved under this "processing_run_id" rather than under the "acquisition_run_id"

  2. The Hackaround:
    Restrict the runs to only show up 1x in the kernel_dataset dataframe
    image

@kkappler kkappler self-assigned this Jan 27, 2024
kkappler added a commit that referenced this issue Mar 16, 2024
- add temporary handling for issue #319
- add simple_solve and direct_solve tf helpers
- add tests for helpersy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant