Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Re-run failed jobs" will not work with a parallel workflow #574

Closed
mellis481 opened this issue Jun 22, 2022 · 17 comments
Closed

"Re-run failed jobs" will not work with a parallel workflow #574

mellis481 opened this issue Jun 22, 2022 · 17 comments

Comments

@mellis481
Copy link

mellis481 commented Jun 22, 2022

After implementing a custom build ID to ensure I could re-run workflows which I've integrated with Cypress Dashboard and configured to run parallelly, I've run into an issue (oddly different than this one).

Here is my job FWIW:

- id: cypress-mocked-api-tests
  uses: cypress-io/github-action@v2
  with:
    wait-on: 'https://localhost:9001/index.js'
    start: npm run start:${{ env.NODE_ENV }}
    config-file: cypress/config/${{ env.NODE_ENV }}.json
    config: video=true,videoUploadOnPasses=false
    spec: '**/*.spec.ts'
    install: false
    record: true
    parallel: true
    group: 'Mocked-API'
    ci-build-id: ${{ needs.prepare.outputs.uuid }}

This job will load balance all my spec files across five containers under a "Mocked-API" group. This works great and I can re-run all jobs without issue.

On a recent run, one of the five containers failed because one test failed. I thought I'd test how "Re-run failed jobs" worked on just the failed container job. My hope/expectation was that it would be smart enough to know which spec files it ran when the entire workflow executed originally (which would have been six spec files which included 22 test) and run those. Instead it ran zero spec files and completed successfully. It seems like the matrix-level orchestration that is needed is not occurring when only a failed container job is re-run. It looks like someone else has run into this issue too and is trying to solve it by disabling the "Re-run failed jobs" option in Github (which doesn't seem possible).

This is a fairly big problem because it resulted in the group (which I've configured as a status check in my trunk branch protection rule) passing and the PR being able to be merged when it had never successfully run all tests.

@conversayShawn
Copy link
Contributor

@mellis481 We recommend passing the GITHUB_TOKEN secret (created by the GH Action automatically) as an environment variable. This will allow correctly identifying every build and avoid confusion when re-running a build.

You can find an example here: https://github.com/cypress-io/github-action#record-test-results-on-cypress-dashboard

@mellis481
Copy link
Author

mellis481 commented Jun 28, 2022

@conversaShawn That did nothing. This is what happened:

  • I added the GITHUB_TOKEN as an env variable to my cypress-io/github-action@v2 job in my PR workflow.
  • I added a failing test to my suite.
  • I ran the workflow which is configured to run in parallel using 5 containers. The test failed on Machine 5.
  • I executed "Re-run failed jobs".
  • On the second workflow run, Machine 5 executed 0 tests and passed.

@ilovegithub2
Copy link

We are seeing exactly the same issue - rerunning failed jobs only will not run any tests but mark each as passed.

Here is our configuration

      - name: Run integration tests
        timeout-minutes: 20
        uses: cypress-io/github-action@v4
        env:
          CYPRESS_RECORD_KEY: ${{ secrets.CYPRESS_RECORD_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          ci-build-id: ${{ needs.prepare.outputs.uuid }}
          config: baseUrl=${{ format('https://pr{0}-www.build.{1}', github.event.number, env.CBR_PROJECT_DOMAIN) }}
          wait-on: ${{ format('https://pr{0}-www.build.{1}', github.event.number, env.CBR_PROJECT_DOMAIN) }}
          wait-on-timeout: 120
          browser: chrome
          record: true
          parallel: true
          group: merge
          install: false
          working-directory: tests/web
          ```

@mgambati
Copy link

Same case here.
Tests pass without execution after retrying failed jobs.

@modern-sapien
Copy link

@admah
Copy link
Contributor

admah commented Sep 13, 2022

There were recently some changes in our services repo that may have taken care of this issue. Can someone retest with 10.7.0 or later and post results? Thanks!

@mellis481
Copy link
Author

There were recently some changes in our services repo that may have taken care of this issue. Can someone retest with 10.7.0 or later and post results? Thanks!

@admah Thanks for contributing to this thread! I just tested with 10.8.0 and it did NOT work correctly. What I'm seeing now is different and not nearly as problematic as the initially-reported issue (the most egregious part of which was passing a workflow after re-running a workflow with a failing Cypress test), but still incorrect. To provide more details...

I added a failing test to my repo that is currently configured to balance my 39 Cypress spec files across five containers. As expected, the job for the container that had the new failing test failed while all other jobs completed successfully.
image

I then selected to "Re-run failed jobs". When I did this, it created a new workflow run which essentially copied the jobs that completed successfully in the first run and started re-running the one failing job. When I went into Cypress Dashboard to inspect this re-run further, I found that it was running specs in only one container (good), but it was running all 39 specs in that container (bad/whacky).
image

It should have re-run only the specs that it originally ran in the first run in that container (in my case 7 specs). The failing test in this workflow run did end up failing the job and, subsequently, workflow as desired, but it's, of course, undesirable for "Re-run failed jobs" to re-run all Cypress specs. It's not re-running failed (Cypress) jobs at that point; it's re-running all Cypress tests using the number of containers that failed in the original run.

@admah
Copy link
Contributor

admah commented Sep 14, 2022

@mellis481 thanks for the screenshots and additional context. That's very helpful. I was able to get some more clarity on this from our Cloud team.

Here is the current status:

  • Before, there was an issue where all re-runs got a PASS, regardless of actual status. This issue has been fixed.
  • Currently, if a re-run is initiated, all specs get run on the machines available. That is not optimal. The Cloud team is looking into the connection between GH Actions and Cypress in order to set up re-runs to be accurate and efficient.

@mellis481
Copy link
Author

mellis481 commented Sep 14, 2022

@admah I'm glad the update from your Cloud team matches my findings (in many less words 😄).

Hoping additional info on the second bullet will be shared in this thread when available.

@admah
Copy link
Contributor

admah commented Sep 14, 2022

@mellis481 yes, I will be providing updates as they're available.

I will be closing this and updating in #531 since this is a duplicate of that issue.

@admah
Copy link
Contributor

admah commented Sep 14, 2022

Duplicate of #531

@cgraham-rs
Copy link

I do not believe this ticket is a dupe of #531

This issue documents a scenario where using Re-run failed jobs runs exactly 0 tests and then emits a false pass.

#531 documents a scenario where re-run executes ALL tests in the suite instead of JUST the failed tests. This is a completely separate issue.

@jennifer-shehane
Copy link
Member

@cgraham-rs If you're experiencing this behavior, it is because there is not a unique ci-build-id associated with the rerun. We try to interpret unique buildIds on our side, but if you're encountering this you can pass a unique buildId via the ci-build-id flag: https://docs.cypress.io/cloud/features/smart-orchestration/parallelization#Linking-CI-machines-for-parallelization-or-grouping

@cgraham-rs
Copy link

@jennifer-shehane as I mentioned in #431 the Robust custom build id documentation instructs us to manually craft a ci-build-id as a separate job from the test job. Though it mentions ...If the user re-runs the workflow a new unique build id is generated... which is true. But causes problems when a user chooses to Re-run failed jobs which would ONLY re-run the test job and not the setup job, thus the test job has the same ci-build-id from the previous run.

@jennifer-shehane
Copy link
Member

@cgraham-rs This is an issue on our radar - that re-running failed jobs has a less than ideal experience in most CIs. We intend to look more into addressing this.

@cgraham-rs
Copy link

@jennifer-shehane A false pass is a really serious issue. AFAIK there's currently no open ticket tracking this specific problem. My suggestion is that this ticket be re-opened until such a time the specific scenario of Github Re-run failed jobs cannot run 0 tests and throw a false pass.

@danjohansenconsulting
Copy link

@cgraham-rs we have an initiative in Cypress Cloud that is a precursor for support of Re-run failed jobs. That initial work is scheduled for Q1. Once we have a solution that launches we will announce the release of that in Cloud.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants