Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make a new Github issue when a nightly run fails #6689

Open
danielhollas opened this issue Jan 9, 2025 · 9 comments
Open

Make a new Github issue when a nightly run fails #6689

danielhollas opened this issue Jan 9, 2025 · 9 comments

Comments

@danielhollas
Copy link
Collaborator

danielhollas commented Jan 9, 2025

We're running nightly tests periodically once per day:

- cron: 0 0 * * * # Run every day at midnight

When the workflow fails, the failure is posted in the aiida-core-dev Slack channel.

There are several issues with this:

  1. The Slack channel is not public and easy to miss (as we discussed with @GeigerJ2 during coding week, people were generally not aware of it).
  2. It's hard to track, discuss and resolve the failures in a consistent manner.

Instead of posting to a Slack channel, I would propose that failing workflow would automatically create Github issue. That might be noisy at fist, but it would force us to deal with the issues transparently.

The implementation of this is actually quite simple, I took this idea from the ruff repository:
https://github.com/astral-sh/ruff/blob/d0b2bbd55ee6435bc3dad8db2898aec216d85121/.github/workflows/daily_fuzz.yaml#L60

  create-issue-on-failure:
    name: Create an issue if the daily fuzz surfaced any bugs
    runs-on: ubuntu-latest
    needs: fuzz
    if: ${{ github.repository == 'astral-sh/ruff' && always() && github.event_name == 'schedule' && needs.fuzz.result == 'failure' }}
    permissions:
      issues: write
    steps:
      - uses: actions/github-script@v7
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          script: |
            await github.rest.issues.create({
              owner: "astral-sh",
              repo: "ruff",
              title: `Daily parser fuzz failed on ${new Date().toDateString()}`,
              body: "Run listed here: https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}",
              labels: ["bug", "parser", "fuzzer"],
            })

CC @unkcpz @agoscinski

@unkcpz
Copy link
Member

unkcpz commented Jan 12, 2025

@danielhollas this would be helpful. I am testing which slow tests can goes to nightly without lower the test coverage in #6701. Hope these two changes can combine to make our life more efficient.

@GeigerJ2
Copy link
Contributor

Thanks for looking into this, @danielhollas! I was quite surprised when I heard the previous way was via Slack channel notifications ^^ So I'm in favor of this! More transparent, and better to deal with it directly on GitHub.
Just wondering what happens if the nightly build fails more than once because of the same problem. Judging from the YAML, I assume we would end up with one issue per day when nightly fails.

@danielhollas
Copy link
Collaborator Author

Just wondering what happens if the nightly build fails more than once because of the same problem. Judging from the YAML, I assume we would end up with one issue per day when nightly fails.

Good questions. Indeed, if there is a failing test that fails reproducibly, it will generate a new issue every day. A good reminder to fix it! :-D I think a more common situation will be a flaky test that fails from time to time. In the case the new tickets should be marked as duplicates and closed. WDYT? (btw: This is how the slack messages behave already).

@GeigerJ2
Copy link
Contributor

I think it's fine to try this out now. If it becomes too noisy, we can still always easily revert it. Being forced to quickly fix reliably failing tests is also good, I agree ^^
Though, the other situation you described seems more annoying... it's still a mystery to me why flaky tests sometimes fail. Just when I fixed the failing du test in #6702, some test-amd64 started failing for no reason, and re-running by @unkcpz solved it. What are the reasons tests fail in such an unpredictable manner? I'm aware of, e.g., GHA runners being slow sometimes and tests timing out, anything else comes to mind? (just asking out of personal curiosity)

@unkcpz
Copy link
Member

unkcpz commented Jan 15, 2025

Just when I fixed the failing du test in #6702, some test-amd64 started failing for no reason, and re-running by @unkcpz solved it.

docker test can fail for many reasons, but followings reasons are out of our hands and I think we can just rerun

  • when we reach the rate limit on dockerhub registry. I think it happened last week for aiidateam because we run many CI actions. "the pull limit is 100 pulls per 21600 seconds (6 hours)," (https://docs.docker.com/docker-hub/download-rate-limit/)
  • problem from github, I think when we had yesterday after your PR is problem of github, since for other CI test and readthedocs build failed with exactly the same reason.

For other reasons, we need take a look at the changes.

@danielhollas
Copy link
Collaborator Author

Looks like we're in agreement here, but I'd wait before the CI stabilizes a bit, we've been getting really unlucky with various issues for the past weeks. 😅

@sphuber
Copy link
Contributor

sphuber commented Jan 18, 2025

Talking about failing nightlies, the nightly has been failing for a while now: https://github.com/aiidateam/aiida-core/commit/b43261143a136a58afddcfa1438036c99b39f8b7/checks
Apparently the uv lock --check check fails and the lockfile needs to be updated. I tried running uv lock locally but that didn't change anything, so not sure what is going wrong. @danielhollas I think you added uv, could you have a look?

@sphuber
Copy link
Contributor

sphuber commented Jan 18, 2025

Thanks for looking into this, @danielhollas! I was quite surprised when I heard the previous way was via Slack channel notifications ^^ So I'm in favor of this! More transparent, and better to deal with it directly on GitHub.

Bit surprising since you are part of the channel where these are published. Can't you see those messages?

I assume we would end up with one issue per day when nightly fails.

This would be my main gripe. It would ultra annoying to have N duplicate issues. Sure, the intention to have that motivate us to address it quickly is admirable, but let's be honest, I don't think that is going to happen :D

@danielhollas
Copy link
Collaborator Author

Apparently the uv lock --check check fails and the lockfile needs to be updated. I tried running uv lock locally but that didn't change anything, so not sure what is going wrong. @danielhollas I think you added uv, could you have a look?

I think this should now be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants