Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use 4h timeout default rh-advisories and rh-push-to-registry-redhat-io pipelines #792

Open
wants to merge 1 commit into
base: development
Choose a base branch
from

Conversation

creydr
Copy link

@creydr creydr commented Jan 28, 2025

By default the task timeout in the release pipelines are 2h. This is especially short for larger applications without having RELEASE-1291 in place.
This PR sets the timeout for the tasks in the rh-advisories pipeline to 4h (what we've done for the OpenShift Serverless release).
Also updated rh-push-to-registry-redhat-io as you try to keep those pipelines in sync.

An alternative would be to set the 4h timeout globally for all tasks in https://github.com/redhat-appstudio/infra-deployments

@creydr creydr requested a review from a team as a code owner January 28, 2025 08:00
Copy link

openshift-ci bot commented Jan 28, 2025

Hi @creydr. Thanks for your PR.

I'm waiting for a konflux-ci member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@creydr
Copy link
Author

creydr commented Jan 28, 2025

/cc @ralphbean

@openshift-ci openshift-ci bot requested a review from ralphbean January 28, 2025 08:00
@mmalina
Copy link
Contributor

mmalina commented Jan 28, 2025

/ok-to-test

@mmalina
Copy link
Contributor

mmalina commented Jan 28, 2025

This is especially short for larger applications with having RELEASE-1291 in place.

You mean "without having RELEASE-1291 in place", right?

It's fine with me, let's see what Johnny thinks. But could you please do the same for the push-to-registry-redhat-io pipeline? We try to keep them in sync with the exception of having/not-having the advisory stuff.

Also, you'll need to fix your commit message - see here: https://github.com/konflux-ci/release-service-catalog/blob/development/CONTRIBUTING.md#commit-message-formatting-and-standards

@creydr creydr force-pushed the use-4h-timeout-default-rh-advisories-pipeline branch from 3235f13 to b051eb7 Compare January 28, 2025 11:07
@creydr
Copy link
Author

creydr commented Jan 28, 2025

This is especially short for larger applications with having RELEASE-1291 in place.

You mean "without having RELEASE-1291 in place", right?

Oh, yes. Updated the PR description

It's fine with me, let's see what Johnny thinks. But could you please do the same for the push-to-registry-redhat-io pipeline? We try to keep them in sync with the exception of having/not-having the advisory stuff.

Updated rh-push-to-registry-redhat-io as well

Also, you'll need to fix your commit message - see here: https://github.com/konflux-ci/release-service-catalog/blob/development/CONTRIBUTING.md#commit-message-formatting-and-standards

Updated

@creydr creydr changed the title Use 4h timeout default rh-advisories pipeline feat: Use 4h timeout default rh-advisories and rh-push-to-registry-redhat-io pipelines Jan 28, 2025
@ralphbean
Copy link
Member

/ok-to-test

@johnbieren
Copy link
Collaborator

I don't really get doing this for all tasks. Can you explain? For example, I would never expect verify-access-to-resources to not pass after 5 mins but pass after 200 minutes. Also, gitlint is failing. I think it is due to the capital U in Use in the commit title

@johnbieren johnbieren force-pushed the use-4h-timeout-default-rh-advisories-pipeline branch from b051eb7 to 3142d3d Compare January 28, 2025 16:19
@davidmogar
Copy link
Collaborator

Agree with Johnny. Having a general timeout sounds bad to me to be honest. Maybe timeouts have to be raised but they shouldn't be treated in the same way.

@mmalina
Copy link
Contributor

mmalina commented Jan 28, 2025

I don't really get doing this for all tasks. Can you explain? For example, I would never expect verify-access-to-resources to not pass after 5 mins but pass after 200 minutes. Also, gitlint is failing. I think it is due to the capital U in Use in the commit title

The current default is 2 hours. You could say the same today - that 2 hours don't make sense in some cases. Raising all of them is just the easiest way to improve the situation - otherwise it's a lot more work to analyse how much is reasonable for each task. Besides, as was discussed in Slack recently, the main issue is currently that taskruns can be waiting for the PV to be freed up for a long time and this will eat away from the timeout value. So it doesn't matter if a task itself never takes more than 5 minutes - it can still time out if some other task blocks it (uses the shared PV). All of this is meant as a temporary measure until that thing is addressed. Of course in the case of verify-access-to-resources you could argue that it runs before any of the time consuming tasks start, so it's not likely that anything else will ever block it, but again, in that case you would have to analyse all the tasks to decide where it makes sense or not.

mmalina
mmalina previously approved these changes Jan 28, 2025
@mmalina
Copy link
Contributor

mmalina commented Jan 28, 2025

/ok-to-test

@johnbieren
Copy link
Collaborator

I don't really get doing this for all tasks. Can you explain? For example, I would never expect verify-access-to-resources to not pass after 5 mins but pass after 200 minutes. Also, gitlint is failing. I think it is due to the capital U in Use in the commit title

The current default is 2 hours. You could say the same today - that 2 hours don't make sense in some cases. Raising all of them is just the easiest way to improve the situation - otherwise it's a lot more work to analyse how much is reasonable for each task. Besides, as was discussed in Slack recently, the main issue is currently that taskruns can be waiting for the PV to be freed up for a long time and this will eat away from the timeout value. So it doesn't matter if a task itself never takes more than 5 minutes - it can still time out if some other task blocks it (uses the shared PV). All of this is meant as a temporary measure until that thing is addressed. Of course in the case of verify-access-to-resources you could argue that it runs before any of the time consuming tasks start, so it's not likely that anything else will ever block it, but again, in that case you would have to analyse all the tasks to decide where it makes sense or not.

We do not have 2 hour timeouts set for every task in our pipeline definitions. If we did, then the diff would just be changing a 2 to a 4. So I disagree on that. But I do agree that setting a 2 hour timeout for most of the tasks makes no sense. Nowhere in the commit or PR does it say this is a temporary workaround. So, for those reasons, I did not approve.

@konflux-ci-qe-bot
Copy link

@creydr: The following test has Failed, say /retest to rerun failed tests.

PipelineRun Name Status Rerun command Build Log Test Log
konflux-e2e-tests-catalog-44tht Failed /retest View Pipeline Log View Test Logs

Inspecting Test Artifacts

To inspect your test artifacts, follow these steps:

  1. Install ORAS (see the ORAS installation guide).
  2. Download artifacts with the following commands:
mkdir -p oras-artifacts
cd oras-artifacts
oras pull quay.io/konflux-test-storage/konflux-team/release-service-catalog:konflux-e2e-tests-catalog-44tht

Test results analysis

🚨 Failed to provision a cluster, see the log for more details:

Click to view logs
INFO: Log in to your Red Hat account...
INFO: Configure AWS Credentials...
WARN: The current version (1.2.47) is not up to date with latest rosa cli released version (1.2.49).
WARN: It is recommended that you update to the latest version.
INFO: Logged in as 'konflux-ci-418295695583' on 'https://api.openshift.com'
INFO: Create ROSA with HCP cluster...
WARN: The current version (1.2.47) is not up to date with latest rosa cli released version (1.2.49).
WARN: It is recommended that you update to the latest version.
INFO: Creating cluster 'kx-d0b8565e81'
INFO: To view a list of clusters and their status, run 'rosa list clusters'
INFO: Cluster 'kx-d0b8565e81' has been created.
INFO: Once the cluster is installed you will need to add an Identity Provider before you can login into the cluster. See 'rosa create idp --help' for more information.

Name: kx-d0b8565e81
Domain Prefix: kx-d0b8565e81
Display Name: kx-d0b8565e81
ID: 2gjeqaouo8nd0q86u8e0hi6s5l6p4qk6
External ID: caf1922d-5bdb-458b-97f1-f6abf53f9c46
Control Plane: ROSA Service Hosted
OpenShift Version: 4.15.43
Channel Group: stable
DNS: Not ready
AWS Account: 418295695583
AWS Billing Account: 418295695583
API URL:
Console URL:
Region: us-east-1
Availability:

  • Control Plane: MultiAZ
  • Data Plane: SingleAZ

Nodes:

  • Compute (desired): 3
  • Compute (current): 0
    Network:
  • Type: OVNKubernetes
  • Service CIDR: 172.30.0.0/16
  • Machine CIDR: 10.0.0.0/16
  • Pod CIDR: 10.128.0.0/14
  • Host Prefix: /23
  • Subnets: subnet-05b9daa0609597f68, subnet-04cf6376374bf9e09
    EC2 Metadata Http Tokens: optional
    Role (STS) ARN: arn:aws:iam::418295695583:role/ManagedOpenShift-HCP-ROSA-Installer-Role
    Support Role ARN: arn:aws:iam::418295695583:role/ManagedOpenShift-HCP-ROSA-Support-Role
    Instance IAM Roles:
  • Worker: arn:aws:iam::418295695583:role/ManagedOpenShift-HCP-ROSA-Worker-Role
    Operator IAM Roles:
  • arn:aws:iam::418295695583:role/rosa-hcp-openshift-image-registry-installer-cloud-credentials
  • arn:aws:iam::418295695583:role/rosa-hcp-openshift-ingress-operator-cloud-credentials
  • arn:aws:iam::418295695583:role/rosa-hcp-kube-system-kms-provider
  • arn:aws:iam::418295695583:role/rosa-hcp-kube-system-kube-controller-manager
  • arn:aws:iam::418295695583:role/rosa-hcp-kube-system-capa-controller-manager
  • arn:aws:iam::418295695583:role/rosa-hcp-kube-system-control-plane-operator
  • arn:aws:iam::418295695583:role/rosa-hcp-openshift-cluster-csi-drivers-ebs-cloud-credentials
  • arn:aws:iam::418295695583:role/rosa-hcp-openshift-cloud-network-config-controller-cloud-credent
    Managed Policies: Yes
    State: waiting (Waiting for user action)
    Private: No
    Delete Protection: Disabled
    Created: Jan 28 2025 20:16:48 UTC
    User Workload Monitoring: Enabled
    Details Page: https://console.redhat.com/openshift/details/s/2sGwqKW71ErCwf7CPvomtShzu0g
    OIDC Endpoint URL: https://oidc.op1.openshiftapps.com/2du11g36ejmoo4624pofphlrgf4r9tf3 (Managed)
    Etcd Encryption: Disabled
    Audit Log Forwarding: Disabled
    External Authentication: Disabled
    Zero Egress: Disabled

INFO: Preparing to create operator roles.
INFO: Operator Roles already exists
INFO: Preparing to create OIDC Provider.
INFO: OIDC provider already exists
INFO: To determine when your cluster is Ready, run 'rosa describe cluster -c kx-d0b8565e81'.
INFO: To watch your cluster installation logs, run 'rosa logs install -c kx-d0b8565e81 --watch'.
INFO: Track the progress of the cluster creation...
WARN: The current version (1.2.47) is not up to date with latest rosa cli released version (1.2.49).
WARN: It is recommended that you update to the latest version.
�[0;33mW:�[m Region flag will be removed from this command in future versions
INFO: Cluster 'kx-d0b8565e81' is in waiting state waiting for installation to begin. Logs will show up within 5 minutes
0001-01-01 00:00:00 +0000 UTC hostedclusters kx-d0b8565e81 Version
2025-01-28 20:21:47 +0000 UTC hostedclusters kx-d0b8565e81 ValidAWSIdentityProvider StatusUnknown
2025-01-28 20:21:50 +0000 UTC hostedclusters kx-d0b8565e81 Condition not found in the CVO.
2025-01-28 20:21:50 +0000 UTC hostedclusters kx-d0b8565e81 Condition not found in the CVO.
2025-01-28 20:21:50 +0000 UTC hostedclusters kx-d0b8565e81 The hosted control plane is not found
2025-01-28 20:21:50 +0000 UTC hostedclusters kx-d0b8565e81 The hosted control plane is not found
2025-01-28 20:21:50 +0000 UTC hostedclusters kx-d0b8565e81 The hosted control plane is not found
2025-01-28 20:21:50 +0000 UTC hostedclusters kx-d0b8565e81 The hosted control plane is not found
2025-01-28 20:21:50 +0000 UTC hostedclusters kx-d0b8565e81 The hosted control plane is not found
2025-01-28 20:21:50 +0000 UTC hostedclusters kx-d0b8565e81 Condition not found in the CVO.
2025-01-28 20:21:50 +0000 UTC hostedclusters kx-d0b8565e81 Waiting for hosted control plane to be healthy
2025-01-28 20:21:50 +0000 UTC hostedclusters kx-d0b8565e81 Condition not found in the CVO.
2025-01-28 20:21:50 +0000 UTC hostedclusters kx-d0b8565e81 Condition not found in the CVO.
2025-01-28 20:21:50 +0000 UTC hostedclusters kx-d0b8565e81 The hosted control plane is not found
2025-01-28 20:21:50 +0000 UTC hostedclusters kx-d0b8565e81 Ignition server deployment not found
2025-01-28 20:21:50 +0000 UTC hostedclusters kx-d0b8565e81 Configuration passes validation
2025-01-28 20:21:50 +0000 UTC hostedclusters kx-d0b8565e81 HostedCluster is supported by operator configuration
2025-01-28 20:21:50 +0000 UTC hostedclusters kx-d0b8565e81 Release image is valid
2025-01-28 20:21:50 +0000 UTC hostedclusters kx-d0b8565e81 The hosted control plane is not found
2025-01-28 20:21:50 +0000 UTC hostedclusters kx-d0b8565e81 Reconciliation active on resource
2025-01-28 20:21:52 +0000 UTC hostedclusters kx-d0b8565e81 Required platform credentials are found
2025-01-28 20:21:52 +0000 UTC hostedclusters kx-d0b8565e81 failed to get referenced secret ocm-production-2gjeqaouo8nd0q86u8e0hi6s5l6p4qk6/cluster-api-cert: Secret "cluster-api-cert" not found
2025-01-28 20:21:52 +0000 UTC hostedclusters kx-d0b8565e81 HostedCluster is at expected version
2025-01-28 20:23:27 +0000 UTC hostedclusters kx-d0b8565e81 OIDC configuration is valid
2025-01-28 20:23:27 +0000 UTC hostedclusters kx-d0b8565e81 Reconciliation completed successfully
2025-01-28 20:23:28 +0000 UTC hostedclusters kx-d0b8565e81 WebIdentityErr
2025-01-28 20:23:29 +0000 UTC hostedclusters kx-d0b8565e81 All is well
2025-01-28 20:23:29 +0000 UTC hostedclusters kx-d0b8565e81 lookup api.kx-d0b8565e81.4we6.p3.openshiftapps.com on 172.30.0.10:53: no such host
2025-01-28 20:23:29 +0000 UTC hostedclusters kx-d0b8565e81 capi-provider deployment has 1 unavailable replicas
2025-01-28 20:23:29 +0000 UTC hostedclusters kx-d0b8565e81 Configuration passes validation
2025-01-28 20:23:29 +0000 UTC hostedclusters kx-d0b8565e81 AWS KMS is not configured
2025-01-28 20:23:29 +0000 UTC hostedclusters kx-d0b8565e81 EtcdAvailable StatefulSetNotFound
2025-01-28 20:23:29 +0000 UTC hostedclusters kx-d0b8565e81 Kube APIServer deployment not found
2025-01-28 20:23:37 +0000 UTC hostedclusters kx-d0b8565e81 All is well
2025-01-28 20:24:37 +0000 UTC hostedclusters kx-d0b8565e81 EtcdAvailable QuorumAvailable
2025-01-28 20:25:41 +0000 UTC hostedclusters kx-d0b8565e81 Kube APIServer deployment is available
2025-01-28 20:25:49 +0000 UTC hostedclusters kx-d0b8565e81 All is well
2025-01-28 20:26:28 +0000 UTC hostedclusters kx-d0b8565e81 The hosted control plane is available
INFO: Cluster 'kx-d0b8565e81' is now ready
INFO: ROSA with HCP cluster is ready, create a cluster admin account for accessing the cluster
WARN: The current version (1.2.47) is not up to date with latest rosa cli released version (1.2.49).
WARN: It is recommended that you update to the latest version.
INFO: Storing login command...
INFO: Check if it's able to login to OCP cluster...
Retried 1 times...
Retried 2 times...
Retried 3 times...
INFO: Check if apiserver is ready...
[INFO] Checking cluster operators' status...
[INFO] Attempt 1/10
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
console
csi-snapshot-controller 4.15.43 True False False 4m39s
dns 4.15.43 False False True 4m39s DNS "default" is unavailable.
image-registry False True True 4m Available: The deployment does not have available replicas...
ingress False True True 4m5s The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.)
insights
kube-apiserver 4.15.43 True False False 4m28s
kube-controller-manager 4.15.43 True False False 4m28s
kube-scheduler 4.15.43 True False False 4m28s
kube-storage-version-migrator
monitoring
network 4.15.43 True True False 4m13s DaemonSet "/openshift-multus/multus-additional-cni-plugins" is not available (awaiting 2 nodes)...
node-tuning 4.15.43 True True False 35s Waiting for 2/3 Profiles to be applied
openshift-apiserver 4.15.43 True False False 4m28s
openshift-controller-manager 4.15.43 True False False 4m28s
openshift-samples
operator-lifecycle-manager 4.15.43 True False False 4m30s
operator-lifecycle-manager-catalog 4.15.43 True False False 4m25s
operator-lifecycle-manager-packageserver 4.15.43 True False False 4m28s
service-ca
storage 4.15.43 True False False 23s
[INFO] Cluster operators are accessible.
[INFO] Waiting for cluster operators to be in 'Progressing=false' state...
clusteroperator.config.openshift.io/console condition met
clusteroperator.config.openshift.io/csi-snapshot-controller condition met
clusteroperator.config.openshift.io/dns condition met
clusteroperator.config.openshift.io/image-registry condition met
clusteroperator.config.openshift.io/ingress condition met
clusteroperator.config.openshift.io/insights condition met
clusteroperator.config.openshift.io/kube-apiserver condition met
clusteroperator.config.openshift.io/kube-controller-manager condition met
clusteroperator.config.openshift.io/kube-scheduler condition met
clusteroperator.config.openshift.io/kube-storage-version-migrator condition met
clusteroperator.config.openshift.io/monitoring condition met
clusteroperator.config.openshift.io/network condition met
clusteroperator.config.openshift.io/node-tuning condition met
clusteroperator.config.openshift.io/openshift-apiserver condition met
clusteroperator.config.openshift.io/openshift-controller-manager condition met
clusteroperator.config.openshift.io/openshift-samples condition met
clusteroperator.config.openshift.io/operator-lifecycle-manager condition met
clusteroperator.config.openshift.io/operator-lifecycle-manager-catalog condition met
clusteroperator.config.openshift.io/operator-lifecycle-manager-packageserver condition met
clusteroperator.config.openshift.io/service-ca condition met
clusteroperator.config.openshift.io/storage condition met


…egistry-rh-io pipelines

Explicitly set the timeout for taskruns in the rh-advisories and
rh-push-to-registry-redhat-io pipelines to 4h and thus override the
cluster default of (currently) 2h.
This is especially helpful for larger components which are running
into issues related to RELEASE-1291.

Signed-off-by: Christoph Stäbler <[email protected]>
@creydr creydr force-pushed the use-4h-timeout-default-rh-advisories-pipeline branch from 3142d3d to 4f17876 Compare January 29, 2025 07:22
@openshift-ci openshift-ci bot removed the lgtm label Jan 29, 2025
Copy link

openshift-ci bot commented Jan 29, 2025

New changes are detected. LGTM label has been removed.

@creydr creydr changed the title feat: Use 4h timeout default rh-advisories and rh-push-to-registry-redhat-io pipelines feat: use 4h timeout default rh-advisories and rh-push-to-registry-redhat-io pipelines Jan 29, 2025
@mmalina
Copy link
Contributor

mmalina commented Jan 29, 2025

We do not have 2 hour timeouts set for every task in our pipeline definitions. If we did, then the diff would just be changing a 2 to a 4. So I disagree on that. But I do agree that setting a 2 hour timeout for most of the tasks makes no sense.

What I meant is that 2 hours is the cluster default, so we currently have 2 hours even for tasks that are not expected to take more than a minute. But I guess that won't change your stance :)

Nowhere in the commit or PR does it say this is a temporary workaround. So, for those reasons, I did not approve.

I think it's implied by mentioning that this is needed because RELEASE-1291 is not fixed. But I might be wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants