You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FCs have been stored according to their run_ids in aurora. This can cause some unwanted behaviour:
Consider a case where we are processing a remote reference dataset, where station runs do not overlap cleanly. Below is a screengab from processing CAS04 with NVR08. The first table is a simple run summary of available data. The second table is the kernel dataset dataframe.
Note that in the first table, run b is unique at station CAS04, and that run is approximately 10 days long. However, in the kernel dataset, there are two instances where run b from CAS04 is referenced ..
Rows 0,1 pair run a from the reference station (NVR08) with run b from CAS04.
Rows 2, 3, pair run b from the reference station (NVR08) with run b from CAS04.
So ... when these data are being processed according to the current flow, we encounter the following flow in the logic:
Iteration begins over kernel_dataset.df and in the first row a 2860 second chunk of run b is extracted from the mth5, STFT-ed, and then it gets stored under: /Experiment/Surveys/CONUS_South/Stations/NVR08/Fourier_Coefficients/b/
but then, on row 2, a 769090 second chunk of run b is extracted from the mth5, STFT-ed, and then it gets stored under the same level, overwriting the previous data. This might process correctly the first time, but it will likely fail the second time.
The second time we process the file FCs will be detected and on rows 0,1 of the df, the STFT objects loaded will be 2860s from NVR08, 769090 from CAS04.
There are workarounds for this, but it is not clear what is the best.
Remove the save_fc option that saves on the Fly. We would replace this with a separate (optional) step to "build_fcs". FCs would be build for complete runs, and then the spectrogram loader would use indexing on the stored FCs to load the appropriate sub-run.
Perhaps the cleanest way to do this would be to process separately each station using "single station" processing, with save_fc=True. Then all runs will get stored completely.
The only thing that needs to be checked/fixed for RR is then to use start and end times when loading FCs from MTH5.
Each processing_run (row of kernel_dataset.df) could be given an id. The FCs could be saved under this "processing_run_id" rather than under the "acquisition_run_id"
The Hackaround:
Restrict the runs to only show up 1x in the kernel_dataset dataframe
The text was updated successfully, but these errors were encountered:
FCs have been stored according to their run_ids in aurora. This can cause some unwanted behaviour:
Consider a case where we are processing a remote reference dataset, where station runs do not overlap cleanly. Below is a screengab from processing CAS04 with NVR08. The first table is a simple run summary of available data. The second table is the kernel dataset dataframe.
Note that in the first table, run
b
is unique at station CAS04, and that run is approximately 10 days long. However, in the kernel dataset, there are two instances where runb
from CAS04 is referenced ..a
from the reference station (NVR08) with runb
from CAS04.b
from the reference station (NVR08) with runb
from CAS04.So ... when these data are being processed according to the current flow, we encounter the following flow in the logic:
Iteration begins over
kernel_dataset.df
and in the first row a 2860 second chunk of runb
is extracted from the mth5, STFT-ed, and then it gets stored under:/Experiment/Surveys/CONUS_South/Stations/NVR08/Fourier_Coefficients/b/
but then, on row 2, a 769090 second chunk of run
b
is extracted from the mth5, STFT-ed, and then it gets stored under the same level, overwriting the previous data. This might process correctly the first time, but it will likely fail the second time.The second time we process the file FCs will be detected and on rows 0,1 of the df, the STFT objects loaded will be 2860s from NVR08, 769090 from CAS04.
There are workarounds for this, but it is not clear what is the best.
save_fc
option that saves on the Fly. We would replace this with a separate (optional) step to "build_fcs". FCs would be build for complete runs, and then the spectrogram loader would use indexing on the stored FCs to load the appropriate sub-run.Perhaps the cleanest way to do this would be to process separately each station using "single station" processing, with
save_fc=True
. Then all runs will get stored completely.The only thing that needs to be checked/fixed for RR is then to use start and end times when loading FCs from MTH5.
Each
processing_run
(row of kernel_dataset.df) could be given an id. The FCs could be saved under this "processing_run_id
" rather than under the "acquisition_run_id
"The Hackaround:
Restrict the runs to only show up 1x in the kernel_dataset dataframe
The text was updated successfully, but these errors were encountered: