ci: Remember failed workflows/td files and run them first in CI #31263

def- · 2025-01-31T17:47:02Z

We have three priorities to determine the order of parts (workflows, td/slt files):

2: Parts that failed in this PR in previous runs
1: Parts that failed before on main or any PR
0: Parts that haven't failed
When fetching the order fails (timeout 15 seconds), we just run in the same order as before.

This uses the Materialize Production Analytics database, so also another nice use case for dog fooding

Fixes: https://github.com/MaterializeInc/database-issues/issues/8935

Verification runs:
https://buildkite.com/materialize/test/builds/98151
https://buildkite.com/materialize/nightly/builds/11053
https://buildkite.com/materialize/nightly/builds/11051
https://buildkite.com/materialize/release-qualification/builds/739

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

def- · 2025-02-05T09:57:14Z

misc/python/materialize/mzcompose/composition.py

+                    # raise
+                    # We could also keep running, but then runtime is still
+                    # slow when a test fails, and the annotation only shows up
+                    # after the test finished:
+                    exceptions.append(e)


This part is what I'm unsure about in this PR. We collect the test failures at the end of the test run and only then mark the test as failed and write the annotation. Changing this would be a really major change.

So we currently have two options:

Stop the run after the first part (workflow, td/slt file) failed, and then immediately get feedback on that one, but not on the remaining parts.

Keep running all parts, and get later feedback with all.
I'm currently going with the second approach, but that means you only notice an early failure if you check the logs manually.

So I don't see a way around changing the annotation logic to be able to annotate during runs already, if we want faster feedback and still run all parts after failures. Opinions?

I think it's worth running all tests, to avoid death by a thousand cuts.

If someone cancels a workflow, can we still collect any failures that happened so far?

Okay, perfect. I already tend to watch CI for early failures and can cancel as necessary, so this seems like the best option. Nice!

mgree

I can't really comment on the code quality of all the runner stuff, but the policy described seems really sensible to me. Thank you for the quick turnaround!

def- · 2025-02-05T17:00:57Z

We can give merging this a try and see if it works well, will merge it tomorrow if there are no complaints. Another test run meanwhile:
https://buildkite.com/materialize/test/builds/98168
https://buildkite.com/materialize/nightly/builds/11057

to be able to run them first in the next run on a PR

def- force-pushed the pr-faster-turnaround branch 17 times, most recently from b09a6c1 to cf37060 Compare February 5, 2025 09:48

def- changed the title ~~ci: Remember failed workflows/td files~~ ci: Remember failed workflows/td files and run them first in CI Feb 5, 2025

def- marked this pull request as ready for review February 5, 2025 09:51

def- requested a review from a team as a code owner February 5, 2025 09:51

def- requested review from mgree, pH14, teskje and bobbyiliev February 5, 2025 09:52

def- commented Feb 5, 2025

View reviewed changes

def- force-pushed the pr-faster-turnaround branch from cf37060 to 7558f85 Compare February 5, 2025 10:47

mgree approved these changes Feb 5, 2025

View reviewed changes

ci: Remember failed workflows/td files

a7e3622

to be able to run them first in the next run on a PR

def- force-pushed the pr-faster-turnaround branch from 7558f85 to a7e3622 Compare February 5, 2025 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: Remember failed workflows/td files and run them first in CI #31263

ci: Remember failed workflows/td files and run them first in CI #31263

def- commented Jan 31, 2025 •

edited

Loading

def- Feb 5, 2025

mgree Feb 5, 2025

def- Feb 5, 2025 •

edited

Loading

mgree Feb 5, 2025

mgree left a comment

def- commented Feb 5, 2025 •

edited

Loading

ci: Remember failed workflows/td files and run them first in CI #31263

Are you sure you want to change the base?

ci: Remember failed workflows/td files and run them first in CI #31263

Conversation

def- commented Jan 31, 2025 • edited Loading

Checklist

def- Feb 5, 2025

Choose a reason for hiding this comment

mgree Feb 5, 2025

Choose a reason for hiding this comment

def- Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

mgree Feb 5, 2025

Choose a reason for hiding this comment

mgree left a comment

Choose a reason for hiding this comment

def- commented Feb 5, 2025 • edited Loading

def- commented Jan 31, 2025 •

edited

Loading

def- Feb 5, 2025 •

edited

Loading

def- commented Feb 5, 2025 •

edited

Loading