Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report model and DataStore failure modes #268

Merged
merged 24 commits into from
May 23, 2024

Conversation

guynir42
Copy link
Member

@guynir42 guynir42 commented May 9, 2024

We add the Report model that tracks the progress of each pipeline run on a single section of an Exposure.

The report first and foremost tracks the start and (optional) end time of the pipeline.
If the pipeline didn't finish (because it is still running or hanging) it will have a Null end time.
If the pipeline had an exception, the end time should also be recorded.

There is also room to track any error that stopped the pipeline, warnings that come up, and the memory/runtime of each step.

We also use bitflags to record which processing steps were done, which pipeline products were already instantiated when the pipeline was finished, and which were successfully saved to disk/database (this one is a little tricky!).

To get this to work we also need to change how pipeline objects handle errors and make sure they always return a datastore with appended warnings and errors, so they can be saved to the report.

@guynir42 guynir42 marked this pull request as ready for review May 16, 2024 17:50
@guynir42 guynir42 requested a review from rknop May 16, 2024 17:50
Copy link
Contributor

@rknop rknop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. There are a lot of small comments -- some are questions/musing, a few are things that should be fixed. There is some merge work to do to make it consistent with the merge from the stress test (esp. with regard to reference finding).

I have a couple of bigger questions/comments, mostly about what we expect the Report class to be for.

@@ -320,7 +320,7 @@ def string_to_bitflag(value, dictionary):
original_keyword = keyword
keyword = EnumConverter.c(keyword)
if keyword not in dictionary:
raise ValueError(f'Keyword "{original_keyword}" not recognized in dictionary')
raise ValueError(f'Keyword "{original_keyword.strip()}" not recognized in dictionary')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the error, perhaps we don't want to strip the spaces, just in case it's spurious leading/trailing spaces that are the problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's what I was thinking but we also remove all spaces (leading or otherwise) in the convert method. So this is actually going to lead you down the wrong path if you leave the spaces (as it did for me).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the .strip() in convert. (I looked for it before making this comment.) What am I missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a replace(' ', '') command in the c function.

6: 'detection', # creates a SourceList from a subtraction Image
7: 'cutting', # creates Cutouts from a subtraction Image
8: 'measuring', # creates Measurements from Cutouts
# TODO: add R/B scores and maybe an extra step for finalizing a report
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That extra step in the TODO would also include alert production.

(My plan is not to save the full alert text, because we can reconstruct them from what's already in the database. What's more, we'll probably have something like kowalski running which will itself be saving all the alert text.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I view alerts as transitory things (as the name suggests), but I have heard a lot of people talk about alerts as if they are a database.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we've had this discussion lately. I think users expect to be able to search through alerts on a database, it is just not clear who will be managing it. I don't think it will cost us much to have an alerts database since we already keep all the data on the objects and measurements, and the alerts will just point to those objects.

8: 'detections',
9: 'cutouts',
10: 'measurements',
# 11: 'rb_scores',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ref_image be in here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I was thinking about this list as a progress bar that fills up with things made by the pipeline. The reference would be a fixture that is needed before you even start.

models/exposure.py Outdated Show resolved Hide resolved
@@ -338,6 +340,32 @@ def get_code_version(cls, session=None):
code_version = session.scalars(sa.select(CodeVersion).order_by(CodeVersion.id.desc())).first()
return code_version

def merge_concurrent(self, session=None, commit=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming there's no updating like happens with the Exposure merge because a provenance with a given ID will always be identical (by construction).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes in principle it would be the same, although it might have a different is_testing but I don't think that would be different between instances.

pipeline/measuring.py Show resolved Hide resolved
@@ -121,6 +123,31 @@ def __init__(self, **kwargs):
"verbose", 0, int, "Level of verbosity (0=quiet).", critical=False
)

self.inject_warnings = self.add_par(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These, and do_warning_exception_hangup_injection_here , are just for testing purposes, yes?

(I have failed to come up with a reason why we might ever want to use this in production.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for testing.

raise RuntimeError('Failed to create the provenance tree!') from e

try: # must make sure the report is on the DB
report = Report(exposure=ds.exposure, section_id=ds.section_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I understand correctly, every time we run the top level pipeline, a new report gets created, yes?

(This sounds like a reasonable way to do it.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

pipeline/top_level.py Outdated Show resolved Hide resolved
requirements.txt Show resolved Hide resolved
@guynir42 guynir42 requested a review from rknop May 23, 2024 13:36
Copy link
Contributor

@rknop rknop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

@@ -59,4 +59,5 @@ jobs:

- name: run test
run: |
TEST_SUBFOLDER=models docker compose run runtests
shopt -s nullglob
TEST_SUBFOLDER=$(ls tests/models/test_{a..l}*.py) docker compose run runtests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this split is because the model tests have been slow?

I think the reason is the NOIRLab server; I've been seeing hangs trying to get images from there. One thing we might want to do is separate out the tests of pulling exposures from that server, and have them run only when an env var is set; then, by default, disable that env var, so the github tests won't hang up when the NOIRLab server is slow. The exposure(s) we need for the bulk of the decam tests can go to one of our archives where we store our other test images and such.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I'm downloading anything from NOIRLab anymore. The tests are slow because there are a lot of them and many of them produce an entire datastore for the test. I might look into shortening that.

@@ -311,7 +319,7 @@ def scan_datastore(self, ds, process_step, session=None):
if self.warnings is None or self.warnings == '':
self.warnings = new_string
else:
self.warnings += ', ' + new_string
self.warnings += '\n***|***|***\n' + new_string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I understand what you meant in that comment; I was confused at first, because I thought "***" meant a warning, not literal "***". I was confused as to why three warnings separated by |, then a newline. This here makes much more sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I didn't think I made it very clear. But I think this will be good.

@@ -59,4 +59,5 @@ jobs:

- name: run test
run: |
TEST_SUBFOLDER=models docker compose run runtests
shopt -s nullglob
TEST_SUBFOLDER=$(ls tests/models/test_{a..l}*.py) docker compose run runtests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this is because the model tests have been slow? I think it may be because the NOIRLab server has been having problems. It's been slow for the last several days; the problems we had last week with the gaia stars may have been all part of the same thing.

I made Issue #289 suggesting that we remove NOIRLab dependence from most of the tests, and hid the ones that still have it behind either an env var that skips them by default, or a mark that we expect it to fail sometimes.

@guynir42 guynir42 merged commit 7b625fd into c3-time-domain:main May 23, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants