Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize gradle-check CI Pipeline to Handle Rapid PR Updates #5008

Closed
5 tasks done
rishabh6788 opened this issue Sep 10, 2024 · 5 comments
Closed
5 tasks done

Optimize gradle-check CI Pipeline to Handle Rapid PR Updates #5008

rishabh6788 opened this issue Sep 10, 2024 · 5 comments

Comments

@rishabh6788
Copy link
Collaborator

rishabh6788 commented Sep 10, 2024

Description:

We're experiencing inefficiencies in our gradle-check CI pipeline due to multiple workflow triggers on rapid PR updates. This results in redundant GitHub Actions runs and Jenkins job executions that cause unneccessary resource consumption and queue exhaustion.

Current Behavior:

  1. Each new commit to a PR triggers a GitHub Actions workflow.
  2. The workflow submits a Jenkins job via generic webhook trigger.
  3. GitHub Actions waits for Jenkins job completion (SUCCESS/FAILURE).
  4. Multiple commits in quick succession lead to parallel, redundant workflow/job executions, thus, resulting in queue exhaustion as there can be only 30 runs in parallel.
  5. Multiple force-push commits in quick succession also result in git ref failures in gradle-check while checking out the code.

Problem:

  • Resource wastage due to multiple concurrent runs.
  • Only the latest run's results are relevant.
  • Potential for confusing or conflicting CI feedback.

Desired Behavior:

  • Only the most recent commit should trigger a full gradle-check CI pipeline run.
  • Earlier runs, should be cancelled/aborted.

Proposed Solutions:

  1. GitHub Actions: Implement workflow concurrency to cancel outdated runs at individual PR level. Github provides a mechanism to control concurrency at each workflow level. For e.g. in our case if a gradle-check is triggered due commit push and then a few minutes later a new commit is force-pushed on the same PR, there will be a new instance of gradle-check workflow.
    At this point we are no longer interested in the outcome of previous instance of the running workflow. Github provides out-of-the-box solution to cancel all the previous running instances the same workflow.
    In the example below, the workflow has a concurrency check, where for a particular PR if there is already a github actions workflow running, and new one is submitted, it will auto-cancel the former one.
concurrency:
  # For a given workflow, if we push to the same PR, cancel all previous builds on that PR.
  group: "${{ github.workflow }}-${{ github.event.pull_request.number}}"
  cancel-in-progress: true
  1. Jenkins: Modify pipeline to abort previous builds of the same job based on PR description. Even if we auto-cancel the github actions workflow when a new one is submitted for the same PR, it will not cancel/abort the already submitted jenkins job. To handle this we need to have custom solution apart from above to clean up stale/outdated gradle-check runs.
    I am proposing an extra stage in the gradle-check jenkinsfile which will run through the builds for the job, check if it is for the same PR, may be based on build description, and abort all the previously running builds.
pipeline {
    agent any
    
    stages {
        stage('Abort previous builds') {
            steps {
                script {
                    def jobName = env.JOB_NAME
                    def buildNumber = env.BUILD_NUMBER.toInteger()
                    def currentJob = Jenkins.instance.getItemByFullName(jobName)
                    
                    for (build in currentJob.builds) {
                        if (!build.isBuilding()) { continue; }
                        if (buildNumber.equals(build.getNumber().toInteger())) { continue; }
                        
                        build.doStop()
                        echo "Aborted build #${build.getNumber()}"
                    }
                }
            }
        }
        
        stage('Your actual job steps') {
            steps {
                echo 'Your job steps go here'
                // Add your actual job steps
            }
        }
    }
}

This can be a generic groovy script which can take job name as an argument to be easily extended to other jobs if needed. A common script to handle many other operational tasks or generic checks which could be useful to multiple jobs.

Action Items:

  • Research GitHub Actions concurrency controls.
  • Investigate Jenkins job management via API or pipeline scripts.
  • Prototype solution for cancelling outdated runs.
  • Update documentation for new CI/CD behavior.
  • Plan testing strategy for the optimized pipeline.

Additional Considerations:

  • Impact on PR checks and status reporting.
  • Performance overhead of iterating over several builds of the job and aborting them.
  • Potential race conditions in cancellation logic.

We welcome input from the team on these proposed solutions and any additional ideas to streamline our CI/CD process.

@rishabh6788
Copy link
Collaborator Author

rishabh6788 commented Sep 10, 2024

@peterzhuamazon @prudhvigodithi @dblock @getsaurabh02 @gaiksaya Appreciate your feedback and comments.

@reta
Copy link
Contributor

reta commented Sep 10, 2024

Thanks a lot @rishabh6788 it totally makes sense to cancel such workflows. It seems cancelling the Jenkins job is the most needed and at the same time, most difficult part of that. However, if we take a step back and tackle splitting Gradle check into isolated tasks / phases, I suspect some of those could be moved from Jenkins to "native" Github action (so we would get the cancellation etc. out of the box).

[1] #5010

@peterzhuamazon
Copy link
Member

Yeah @rishabh6788 it totally worth invest in this, so we can reduce duplicate runs and save resources.

The github actions concurrency settings work out of the box, and I am ok with the custom solutions in jenkinsfile.

Assuming no other things needs to be touched, this should be the quickest way to address this issue for now. Thanks.

@rishabh6788
Copy link
Collaborator Author

Thanks a lot @rishabh6788 it totally makes sense to cancel such workflows. It seems cancelling the Jenkins job is the most needed and at the same time, most difficult part of that. However, if we take a step back and tackle splitting Gradle check into isolated tasks / phases, I suspect some of those could be moved from Jenkins to "native" Github action (so we would get the cancellation etc. out of the box).

[1] #5010

100% agree on splitting gradle-check into multiple CI workflows, we can start by just splitting integ-tests and bwc-tests and build upon that. We will be adding this to our sprint and try to prioritize it.

Meanwhile the stale/out-dated runs remain a significant problem and cause unnecessary resource consumption, till we have a clear way forward on gradle-check split I will try to solve this problem and once we have a clear vision on split, this can be easily cleaned up.

@gaiksaya
Copy link
Member

Hi @rishabh6788
I came across this https://archive.kabisa.nl/tech/building-github-pull-requests-with-jenkins/
Instead of webhooks maybe we can refactor how to run CI directly? We wont need to integrate jenkins and GHA then.

@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Engineering Effectiveness Board Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ✅ Done
Status: Action items ✍
Development

No branches or pull requests

5 participants