Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Remember failed workflows/td files and run them first in CI #31263

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

def-
Copy link
Contributor

@def- def- commented Jan 31, 2025

We have three priorities to determine the order of parts (workflows, td/slt files):

  • 2: Parts that failed in this PR in previous runs
  • 1: Parts that failed before on main or any PR
  • 0: Parts that haven't failed
    When fetching the order fails (timeout 15 seconds), we just run in the same order as before.

This uses the Materialize Production Analytics database, so also another nice use case for dog fooding

Fixes: https://github.com/MaterializeInc/database-issues/issues/8935

Verification runs:
https://buildkite.com/materialize/test/builds/98151
https://buildkite.com/materialize/nightly/builds/11053
https://buildkite.com/materialize/nightly/builds/11051
https://buildkite.com/materialize/release-qualification/builds/739

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@def- def- force-pushed the pr-faster-turnaround branch 17 times, most recently from b09a6c1 to cf37060 Compare February 5, 2025 09:48
@def- def- changed the title ci: Remember failed workflows/td files ci: Remember failed workflows/td files and run them first in CI Feb 5, 2025
@def- def- marked this pull request as ready for review February 5, 2025 09:51
@def- def- requested a review from a team as a code owner February 5, 2025 09:51
@def- def- requested review from mgree, pH14, teskje and bobbyiliev February 5, 2025 09:52
Comment on lines +1602 to +1606
# raise
# We could also keep running, but then runtime is still
# slow when a test fails, and the annotation only shows up
# after the test finished:
exceptions.append(e)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is what I'm unsure about in this PR. We collect the test failures at the end of the test run and only then mark the test as failed and write the annotation. Changing this would be a really major change.

So we currently have two options:

  • Stop the run after the first part (workflow, td/slt file) failed, and then immediately get feedback on that one, but not on the remaining parts.
  • Keep running all parts, and get later feedback with all.
    I'm currently going with the second approach, but that means you only notice an early failure if you check the logs manually.

So I don't see a way around changing the annotation logic to be able to annotate during runs already, if we want faster feedback and still run all parts after failures. Opinions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth running all tests, to avoid death by a thousand cuts.

If someone cancels a workflow, can we still collect any failures that happened so far?

Copy link
Contributor Author

@def- def- Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, perfect. I already tend to watch CI for early failures and can cancel as necessary, so this seems like the best option. Nice!

@def- def- force-pushed the pr-faster-turnaround branch from cf37060 to 7558f85 Compare February 5, 2025 10:47
Copy link
Contributor

@mgree mgree left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really comment on the code quality of all the runner stuff, but the policy described seems really sensible to me. Thank you for the quick turnaround!

@def-
Copy link
Contributor Author

def- commented Feb 5, 2025

We can give merging this a try and see if it works well, will merge it tomorrow if there are no complaints. Another test run meanwhile:
https://buildkite.com/materialize/test/builds/98168
https://buildkite.com/materialize/nightly/builds/11057

to be able to run them first in the next run on a PR
@def- def- force-pushed the pr-faster-turnaround branch from 7558f85 to a7e3622 Compare February 5, 2025 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants