Report model and DataStore failure modes #268

guynir42 · 2024-05-09T11:23:56Z

We add the Report model that tracks the progress of each pipeline run on a single section of an Exposure.

The report first and foremost tracks the start and (optional) end time of the pipeline.
If the pipeline didn't finish (because it is still running or hanging) it will have a Null end time.
If the pipeline had an exception, the end time should also be recorded.

There is also room to track any error that stopped the pipeline, warnings that come up, and the memory/runtime of each step.

We also use bitflags to record which processing steps were done, which pipeline products were already instantiated when the pipeline was finished, and which were successfully saved to disk/database (this one is a little tricky!).

To get this to work we also need to change how pipeline objects handle errors and make sure they always return a datastore with appended warnings and errors, so they can be saved to the report.

rknop

This looks good. There are a lot of small comments -- some are questions/musing, a few are things that should be fixed. There is some merge work to do to make it consistent with the merge from the stress test (esp. with regard to reference finding).

I have a couple of bigger questions/comments, mostly about what we expect the Report class to be for.

rknop · 2024-05-21T17:30:54Z

models/enums_and_bitflags.py

@@ -320,7 +320,7 @@ def string_to_bitflag(value, dictionary):
            original_keyword = keyword
            keyword = EnumConverter.c(keyword)
            if keyword not in dictionary:
-                raise ValueError(f'Keyword "{original_keyword}" not recognized in dictionary')
+                raise ValueError(f'Keyword "{original_keyword.strip()}" not recognized in dictionary')


For the error, perhaps we don't want to strip the spaces, just in case it's spurious leading/trailing spaces that are the problem.

Yeah that's what I was thinking but we also remove all spaces (leading or otherwise) in the convert method. So this is actually going to lead you down the wrong path if you leave the spaces (as it did for me).

I don't see the .strip() in convert. (I looked for it before making this comment.) What am I missing?

There is a replace(' ', '') command in the c function.

rknop · 2024-05-21T17:32:16Z

models/enums_and_bitflags.py

+    6: 'detection',      # creates a SourceList from a subtraction Image
+    7: 'cutting',        # creates Cutouts from a subtraction Image
+    8: 'measuring',      # creates Measurements from Cutouts
+    # TODO: add R/B scores and maybe an extra step for finalizing a report


That extra step in the TODO would also include alert production.

(My plan is not to save the full alert text, because we can reconstruct them from what's already in the database. What's more, we'll probably have something like kowalski running which will itself be saving all the alert text.)

(I view alerts as transitory things (as the name suggests), but I have heard a lot of people talk about alerts as if they are a database.)

Yeah we've had this discussion lately. I think users expect to be able to search through alerts on a database, it is just not clear who will be managing it. I don't think it will cost us much to have an alerts database since we already keep all the data on the objects and measurements, and the alerts will just point to those objects.

rknop · 2024-05-21T17:33:09Z

models/enums_and_bitflags.py

+    8: 'detections',
+    9: 'cutouts',
+    10: 'measurements',
+    # 11: 'rb_scores',


Should ref_image be in here too?

I don't think so. I was thinking about this list as a progress bar that fills up with things made by the pipeline. The reference would be a fixture that is needed before you even start.

models/exposure.py

rknop · 2024-05-21T17:37:21Z

models/provenance.py

@@ -338,6 +340,32 @@ def get_code_version(cls, session=None):
                code_version = session.scalars(sa.select(CodeVersion).order_by(CodeVersion.id.desc())).first()
        return code_version

+    def merge_concurrent(self, session=None, commit=True):


I'm assuming there's no updating like happens with the Exposure merge because a provenance with a given ID will always be identical (by construction).

yes in principle it would be the same, although it might have a different is_testing but I don't think that would be different between instances.

pipeline/measuring.py

rknop · 2024-05-21T18:38:26Z

pipeline/parameters.py

@@ -121,6 +123,31 @@ def __init__(self, **kwargs):
            "verbose", 0, int, "Level of verbosity (0=quiet).", critical=False
        )

+        self.inject_warnings = self.add_par(


These, and do_warning_exception_hangup_injection_here , are just for testing purposes, yes?

(I have failed to come up with a reason why we might ever want to use this in production.)

Just for testing.

rknop · 2024-05-21T18:40:46Z

pipeline/top_level.py

+            raise RuntimeError('Failed to create the provenance tree!') from e
+
+        try:  # must make sure the report is on the DB
+            report = Report(exposure=ds.exposure, section_id=ds.section_id)


So, if I understand correctly, every time we run the top level pipeline, a new report gets created, yes?

(This sounds like a reasonable way to do it.)

pipeline/top_level.py

requirements.txt

rknop

This looks good.

rknop · 2024-05-23T13:46:36Z

.github/workflows/run-model-tests-1.yml

@@ -59,4 +59,5 @@ jobs:

      - name: run test
        run: |
-          TEST_SUBFOLDER=models docker compose run runtests
+          shopt -s nullglob
+          TEST_SUBFOLDER=$(ls tests/models/test_{a..l}*.py) docker compose run runtests


I'm guessing this split is because the model tests have been slow?

I think the reason is the NOIRLab server; I've been seeing hangs trying to get images from there. One thing we might want to do is separate out the tests of pulling exposures from that server, and have them run only when an env var is set; then, by default, disable that env var, so the github tests won't hang up when the NOIRLab server is slow. The exposure(s) we need for the bulk of the decam tests can go to one of our archives where we store our other test images and such.

I don't think I'm downloading anything from NOIRLab anymore. The tests are slow because there are a lot of them and many of them produce an entire datastore for the test. I might look into shortening that.

rknop · 2024-05-23T13:49:08Z

models/report.py

@@ -311,7 +319,7 @@ def scan_datastore(self, ds, process_step, session=None):
            if self.warnings is None or self.warnings == '':
                self.warnings = new_string
            else:
-                self.warnings += ', ' + new_string
+                self.warnings += '\n***|***|***\n' + new_string


Now I understand what you meant in that comment; I was confused at first, because I thought "***" meant a warning, not literal "***". I was confused as to why three warnings separated by |, then a newline. This here makes much more sense.

Yeah I didn't think I made it very clear. But I think this will be good.

rknop · 2024-05-23T13:57:23Z

.github/workflows/run-model-tests-1.yml

@@ -59,4 +59,5 @@ jobs:

      - name: run test
        run: |
-          TEST_SUBFOLDER=models docker compose run runtests
+          shopt -s nullglob
+          TEST_SUBFOLDER=$(ls tests/models/test_{a..l}*.py) docker compose run runtests


I'm guessing this is because the model tests have been slow? I think it may be because the NOIRLab server has been having problems. It's been slow for the last several days; the problems we had last week with the gaia stars may have been all part of the same thing.

I made Issue #289 suggesting that we remove NOIRLab dependence from most of the tests, and hid the ones that still have it behind either an env var that skips them by default, or a mark that we expect it to fail sometimes.

guynir42 added 12 commits May 9, 2024 14:19

add model for Report

3939bfd

minor adjustments to datastore

b286fbb

add make_provenance_tree to top_level Pipeline

9016133

fix caching of subtraction images

060ea82

add reports

7cd3ef0

fix test

71101d5

add report tests

aead6eb

add memory tracking test

42cb674

inject errors and warnings

a7128dc

add more tests

dce797f

fix tests

c269863

fix merging properties of exposure

a22a712

guynir42 marked this pull request as ready for review May 16, 2024 17:50

guynir42 requested a review from rknop May 16, 2024 17:50

rknop reviewed May 21, 2024

View reviewed changes

guynir42 added 12 commits May 22, 2024 10:14

reviewer comments

d51031d

merge main

83bd122

fix merging of reference fetching tool

49b5135

fix tests

c767028

add parse_bool

2ac7ee7

split models tests

7c607a2

add second test workflow

8e63a0b

rename model tests

679db26

fix workflow split

78b484a

more fixing

709af1b

modify docker compose entrypoint

1580c26

fix other workflows

8178176

guynir42 requested a review from rknop May 23, 2024 13:36

rknop approved these changes May 23, 2024

View reviewed changes

guynir42 merged commit 7b625fd into c3-time-domain:main May 23, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report model and DataStore failure modes #268

Report model and DataStore failure modes #268

guynir42 commented May 9, 2024

rknop left a comment

rknop May 21, 2024

guynir42 May 22, 2024

rknop May 22, 2024

guynir42 May 23, 2024

rknop May 21, 2024

rknop May 21, 2024

guynir42 May 22, 2024

rknop May 21, 2024

guynir42 May 22, 2024

rknop May 21, 2024

guynir42 May 22, 2024

rknop May 21, 2024

guynir42 May 22, 2024

rknop May 21, 2024

guynir42 May 22, 2024

rknop left a comment

rknop May 23, 2024

guynir42 May 23, 2024

rknop May 23, 2024

guynir42 May 23, 2024

rknop May 23, 2024

Report model and DataStore failure modes #268

Report model and DataStore failure modes #268

Conversation

guynir42 commented May 9, 2024

rknop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rknop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment