-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report model and DataStore failure modes #268
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. There are a lot of small comments -- some are questions/musing, a few are things that should be fixed. There is some merge work to do to make it consistent with the merge from the stress test (esp. with regard to reference finding).
I have a couple of bigger questions/comments, mostly about what we expect the Report class to be for.
@@ -320,7 +320,7 @@ def string_to_bitflag(value, dictionary): | |||
original_keyword = keyword | |||
keyword = EnumConverter.c(keyword) | |||
if keyword not in dictionary: | |||
raise ValueError(f'Keyword "{original_keyword}" not recognized in dictionary') | |||
raise ValueError(f'Keyword "{original_keyword.strip()}" not recognized in dictionary') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the error, perhaps we don't want to strip the spaces, just in case it's spurious leading/trailing spaces that are the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's what I was thinking but we also remove all spaces (leading or otherwise) in the convert
method. So this is actually going to lead you down the wrong path if you leave the spaces (as it did for me).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the .strip()
in convert
. (I looked for it before making this comment.) What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a replace(' ', '')
command in the c
function.
6: 'detection', # creates a SourceList from a subtraction Image | ||
7: 'cutting', # creates Cutouts from a subtraction Image | ||
8: 'measuring', # creates Measurements from Cutouts | ||
# TODO: add R/B scores and maybe an extra step for finalizing a report |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That extra step in the TODO would also include alert production.
(My plan is not to save the full alert text, because we can reconstruct them from what's already in the database. What's more, we'll probably have something like kowalski running which will itself be saving all the alert text.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I view alerts as transitory things (as the name suggests), but I have heard a lot of people talk about alerts as if they are a database.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we've had this discussion lately. I think users expect to be able to search through alerts on a database, it is just not clear who will be managing it. I don't think it will cost us much to have an alerts database since we already keep all the data on the objects and measurements, and the alerts will just point to those objects.
8: 'detections', | ||
9: 'cutouts', | ||
10: 'measurements', | ||
# 11: 'rb_scores', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should ref_image be in here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. I was thinking about this list as a progress bar that fills up with things made by the pipeline. The reference would be a fixture that is needed before you even start.
@@ -338,6 +340,32 @@ def get_code_version(cls, session=None): | |||
code_version = session.scalars(sa.select(CodeVersion).order_by(CodeVersion.id.desc())).first() | |||
return code_version | |||
|
|||
def merge_concurrent(self, session=None, commit=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming there's no updating like happens with the Exposure merge because a provenance with a given ID will always be identical (by construction).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes in principle it would be the same, although it might have a different is_testing
but I don't think that would be different between instances.
@@ -121,6 +123,31 @@ def __init__(self, **kwargs): | |||
"verbose", 0, int, "Level of verbosity (0=quiet).", critical=False | |||
) | |||
|
|||
self.inject_warnings = self.add_par( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These, and do_warning_exception_hangup_injection_here
, are just for testing purposes, yes?
(I have failed to come up with a reason why we might ever want to use this in production.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for testing.
raise RuntimeError('Failed to create the provenance tree!') from e | ||
|
||
try: # must make sure the report is on the DB | ||
report = Report(exposure=ds.exposure, section_id=ds.section_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if I understand correctly, every time we run the top level pipeline, a new report gets created, yes?
(This sounds like a reasonable way to do it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good.
@@ -59,4 +59,5 @@ jobs: | |||
|
|||
- name: run test | |||
run: | | |||
TEST_SUBFOLDER=models docker compose run runtests | |||
shopt -s nullglob | |||
TEST_SUBFOLDER=$(ls tests/models/test_{a..l}*.py) docker compose run runtests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this split is because the model tests have been slow?
I think the reason is the NOIRLab server; I've been seeing hangs trying to get images from there. One thing we might want to do is separate out the tests of pulling exposures from that server, and have them run only when an env var is set; then, by default, disable that env var, so the github tests won't hang up when the NOIRLab server is slow. The exposure(s) we need for the bulk of the decam tests can go to one of our archives where we store our other test images and such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I'm downloading anything from NOIRLab anymore. The tests are slow because there are a lot of them and many of them produce an entire datastore for the test. I might look into shortening that.
@@ -311,7 +319,7 @@ def scan_datastore(self, ds, process_step, session=None): | |||
if self.warnings is None or self.warnings == '': | |||
self.warnings = new_string | |||
else: | |||
self.warnings += ', ' + new_string | |||
self.warnings += '\n***|***|***\n' + new_string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I understand what you meant in that comment; I was confused at first, because I thought "***"
meant a warning, not literal "***"
. I was confused as to why three warnings separated by |, then a newline. This here makes much more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I didn't think I made it very clear. But I think this will be good.
@@ -59,4 +59,5 @@ jobs: | |||
|
|||
- name: run test | |||
run: | | |||
TEST_SUBFOLDER=models docker compose run runtests | |||
shopt -s nullglob | |||
TEST_SUBFOLDER=$(ls tests/models/test_{a..l}*.py) docker compose run runtests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this is because the model tests have been slow? I think it may be because the NOIRLab server has been having problems. It's been slow for the last several days; the problems we had last week with the gaia stars may have been all part of the same thing.
I made Issue #289 suggesting that we remove NOIRLab dependence from most of the tests, and hid the ones that still have it behind either an env var that skips them by default, or a mark that we expect it to fail sometimes.
We add the Report model that tracks the progress of each pipeline run on a single section of an Exposure.
The report first and foremost tracks the start and (optional) end time of the pipeline.
If the pipeline didn't finish (because it is still running or hanging) it will have a Null end time.
If the pipeline had an exception, the end time should also be recorded.
There is also room to track any error that stopped the pipeline, warnings that come up, and the memory/runtime of each step.
We also use bitflags to record which processing steps were done, which pipeline products were already instantiated when the pipeline was finished, and which were successfully saved to disk/database (this one is a little tricky!).
To get this to work we also need to change how pipeline objects handle errors and make sure they always return a datastore with appended warnings and errors, so they can be saved to the report.