Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci.jenkins.io] Move ACI agents to ephemeral Windows containers to AWS #4318

Closed
6 tasks done
Tracked by #4313
dduportal opened this issue Sep 28, 2024 · 8 comments
Closed
6 tasks done
Tracked by #4313

Comments

@dduportal
Copy link
Contributor

dduportal commented Sep 28, 2024

Requires #4319

Goal: stop running Windows container agents in ACI and use Kubernetes instead

(edit) Given we chose to use EKS with Karpenter, the following tasks are required, following what is described in https://docs.aws.amazon.com/eks/latest/userguide/windows-support.html

@dduportal
Copy link
Contributor Author

Adding this issue to the milestone:

@dduportal
Copy link
Contributor Author

Update: The initial EKS requirements have been implemented with success. A single Pod extracted from a local Jenkins controller with the upcoming JCasc setup definition has been tested:

apiVersion: v1
kind: Pod
metadata:
  labels:
    jenkins/label: test-ddu
  name: jnlp-maven-21-windows-test-ddu
  namespace: jenkins-agents
spec:
  containers:
  - command:
    - pwsh.exe
    - -f
    - C:/ProgramData/Jenkins/jenkins-agent.ps1
    env:
    - name: PATH
      value: C:/tools/jdk-21/bin;$PATH
    - name: JENKINS_SECRET
      value: SuperSecret
    - name: ARTIFACT_CACHING_PROXY_SERVERID
      value: http://127.0.0.1:8080/
    - name: JAVA_HOME
      value: C:/tools/jdk-21
    - name: JENKINS_DIRECT_CONNECTION
      value: ci.jenkins.io:50000
    - name: JENKINS_AGENT_WORKDIR
      value: C:/Jenkins/agent
    - name: JENKINS_JAVA_OPTS
      value: -XX:+PrintCommandLineFlags
    - name: JENKINS_JAVA_BIN
      value: C:/tools/jdk-17/bin/java
    - name: JENKINS_PROTOCOLS
      value: JNLP4-connect
    - name: JENKINS_AGENT_NAME
      value: jnlp-maven-21-windows-test-ddu
    - name: TEMP
      value: C:/Windows/Temp
    - name: TMP
      value: C:/Windows/Temp
    - name: REMOTING_OPTS
      value: -noReconnectAfter 1d
    - name: JENKINS_INSTANCE_IDENTITY
      value: Secret
    - name: JENKINS_NAME
      value: jnlp-maven-21-windows-2llr2
    image: jenkinsciinfra/inbound-agent-maven:jdk21-nanoserver@sha256:302f2003cc52011cac1b9769b983dfd1490a520d8cdca7b0519c9f26e07c3802
    # image: jenkins/inbound-agent:3283.v92c105e0f819-8-jdk17-nanoserver-ltsc2022
    imagePullPolicy: IfNotPresent
    name: jnlp
    resources:
      limits:
        cpu: "2"
        memory: 4G
      requests:
        cpu: "2"
        memory: 4G
    volumeMounts:
    - mountPath: C:/Users/Administrator/.m2/repository
      name: volume-1
    - mountPath: C:/Windows/Temp
      name: volume-0
    - mountPath: C:/Jenkins/agent
      name: workspace-volume
    workingDir: C:/Jenkins/agent
  nodeSelector:
    kubernetes.io/arch: amd64
    kubernetes.io/os: windows
  restartPolicy: Never
  serviceAccount: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: ci.jenkins.io/agents
    operator: Equal
    value: "true"
  - effect: NoSchedule
    key: ci.jenkins.io/windows-2019
    operator: Equal
    value: "true"
  volumes:
  - emptyDir:
      medium: Memory
    name: volume-0
  - emptyDir: {}
    name: volume-1
  - emptyDir: {}
    name: workspace-volume

@dduportal
Copy link
Contributor Author

Update: Real life tests in progress with the first set of pod templates are set up with "test" labels as they need a bit more testing.

  • Initial test shows pods not scheduled because there was no node pools with the proper taints and labels. Forgot to set up the "ci.jenkins.io/agents" toleration.

  • Then, pods are scheduled but are failing with the error below. Scheduling (both node and pod) is good, and the default command (e.g. container entrypoint) is good, but there is an inbound agent script setup error:

    Start-Process: C:\ProgramData\Jenkins\jenkins-agent.ps1:162
    Line |
     162 |      Start-Process -FilePath $JAVA_BIN -Wait -NoNewWindow -ArgumentLis …
         |      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         | This command cannot be run due to the error: The system cannot find the
         | file specified.
    
  • With the Agent Java Bin fixed, new startup error:

    Unrecognized VM option ''
    Error: Could not create the Java Virtual Machine.
    Error: A fatal exception has occurred. Program will exit.
    
    • Need to check the value passed to JENKINS_JAVA_OPTS. Currently the pod specifies:
      - name: JENKINS_JAVA_OPTS
        value: -XX:+PrintCommandLineFlags
    • Manual "hack" is to remove the env var and continue (to make sure all errors are caught). Gotta work on the real fix.
  • Then, with the correct Agent Java bin and without any JENKINS_JAVA_OPTS env. var, we have the following error:

    INFO: Connecting to ci.jenkins.io:50000
    Feb 17, 2025 6:06:39 PM hudson.remoting.Launcher$CuiListener error
    SEVERE: null
    java.nio.channels.UnresolvedAddressException
            at java.base/sun.nio.ch.Net.checkAddress(Net.java:149)
            at java.base/sun.nio.ch.Net.checkAddress(Net.java:157)
            at java.base/sun.nio.ch.SocketChannelImpl.checkRemote(SocketChannelImpl.java:816)
            at java.base/sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:839)
            at java.base/java.nio.channels.SocketChannel.open(SocketChannel.java:285)
            at org.jenkinsci.remoting.engine.JnlpAgentEndpoint.open(JnlpAgentEndpoint.java:231)
            at hudson.remoting.Engine.connectTcp(Engine.java:1131)
            at hudson.remoting.Engine.innerRun(Engine.java:999)
            at hudson.remoting.Engine.run(Engine.java:586)
    
    • This is a DNS resolution error. It's weird because the instance profile shows the eks:kube-proxy-windows IAM permissions (required for DNS resolution) is present. We'll have to diagnose this further

@dduportal
Copy link
Contributor Author

Update:

DNS resolution error.

This one was tricky. Fixed in jenkins-infra/terraform-aws-sponsorship#138.
TL;DR;

Next steps:

  • Fix the "JENKINS_JAVA_OPTS" issue (see comment above)
  • Set up the PATH for Windows containers in Puppet
  • Validate the existing 4 templates
  • Start using ECR for the 4 templates and validate them again

Future optimizations (might need another issue) from AWS:

@dduportal
Copy link
Contributor Author

Update:

@dduportal
Copy link
Contributor Author

Real life test in progress with the Remoting build:

  • https://ci.jenkins.io/job/Core/job/remoting/job/master/792 is a manual replay forced to only build on Windows (JDK17) with the platform set to the test agent label maven-17-windows-test.
    • Node is already up with the JDK17 container image already pulled: agent allocation was fast (of course).

=> lets compare with previous build 791 which tests were in timeout around 15 min after the build start

@dduportal
Copy link
Contributor Author

Real life test in progress with the Remoting build:

* https://ci.jenkins.io/job/Core/job/remoting/job/master/792 is a manual replay forced to only build on Windows (JDK17) with the platform set to the test agent label `maven-17-windows-test`.
  
  * Node is already up with the JDK17 container image already pulled: agent allocation was fast (of course).

=> lets compare with previous build 791 which tests were in timeout around 15 min after the build start

Test was successful! Let's roll!

@dduportal
Copy link
Contributor Author

Update:

Tested with success one last time on both remoting and infra/acceptance tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants