vSphere Session expires after 86400 seconds #2781

hzproe · 2024-02-28T14:00:09Z

/kind bug

What steps did you take and what happened:
We upgraded CAPV to 1.9.0 from 1.6.1
Connection to vCenter fails after 24hours (86400s) the only hint we can find is in the Informal Logs of CAPV provider

Our assumption is, that the session handler doesn't reestablish a connection after it fails after 1 day.

2024-02-27 08:56:59.635 | {"logtag":"F","logstash_prefix":"logstream-*********-capv-system","message":"I0227 07:56:59.635697 1 session.go:298] \"REST client session expired, clearing session\" controller=\"vspherevm\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"VSphereVM\" VSphereVM=\"z1 -- | --

What did you expect to happen:
Reconcile for all vSphere machines stops, as the connection to the vCenter will not get reestablished anymore

Anything else you would like to add:
We upgraded from 1.6.1 where this was working to 1.9.0 where we observed the first error.
We tried downgrading to 1.8.7 with no improvement
We tried upgrading to 1.9.1 with no improvement

Environment:

Cluster-api-provider-vsphere version: 1.9.1
Kubernetes version: (use kubectl version): Server Version: v1.28.6
OS (e.g. from /etc/os-release):
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"

The text was updated successfully, but these errors were encountered:

chrischdi · 2024-02-28T14:08:53Z

Normally the package related to creating new sessions should create a new session / login when a client gets requested.

As this is just a more informal message from the keepalive handler, are there other messages which indicate that CAPV does not work anymore?

Do you still get reconciliation log messages?

Just to be sure: are you using v1.9.1 or v1.9.0? (the issue states both)

hzproe · 2024-02-28T14:15:53Z

Hi, we have 0 context around it, we just see the "REST client session expired, clearing session" twice and that's it.

Afterward we see the message below repeated - but no further activities in the log which contacts the vCenter.

On TCP dump we can see the keep alive is still happening but the session to the vCenter is already dead an can't be reestablished.

2024-02-28 08:33:01.870 | {"message":"I0228 07:33:01.870579 1 vimmachine.go:385] \"Updated VSphereVM\" controller=\"vspheremachine\" controllerGroup=\"infrastructure.cluster.x-k8s.io\" controllerKind=\"VSphereMachine -- | --

We are currently running on 1.9.1

sempex · 2024-02-28T15:25:05Z

I face a similar Issue, when I try to delete a Cluster nothing happens until I use Clusterctl to delete the vSphere Provider and reinstall it. I use v1.9.1

hzproe · 2024-02-28T15:27:01Z

I face a similar Issue, when I try to delete a Cluster nothing happens until I use Clusterctl to delete the vSphere Provider and reinstall it. I use v1.9.1

For us it is good enough when we restart the capv controller as this reestablishes the connection

sempex · 2024-02-28T15:28:41Z

I face a similar Issue, when I try to delete a Cluster nothing happens until I use Clusterctl to delete the vSphere Provider and reinstall it. I use v1.9.1

For us it is good enough when we restart the capv controller as this reestablishes the connection

Hi thank for your fast response, so you just delete the pod?

hzproe · 2024-02-28T15:32:44Z

I face a similar Issue, when I try to delete a Cluster nothing happens until I use Clusterctl to delete the vSphere Provider and reinstall it. I use v1.9.1

For us it is good enough when we restart the capv controller as this reestablishes the connection

Hi thank for your fast response, so you just delete the pod?

yes just delete the pod or restart the deployment

sempex · 2024-02-28T15:34:24Z

I will try that, I'm going to dig through logs and check if i gain more insights

chrischdi · 2024-02-28T15:51:51Z

To figure out if it is the keepalive which breaks the functionality for both of you:

You could try to disable the keepalive handler by adding the flag --enable-keep-alive=false.

chrischdi · 2024-02-28T15:58:34Z

Propably related change:

🐛 Force logout of sessions in case of errors #2530

Which got backported to >= v1.8.5 and >= v1.75

hzproe · 2024-02-28T15:59:05Z

To figure out if it is the keepalive which breaks the functionality for both of you:

You could try to disable the keepalive handler by adding the flag --enable-keep-alive=false.

Thanks, will try that - seems obvious but we were going in the other direction of having a more aggressive keep alive ... 5minutes down to 3minutes - this didn't help ... but disable might be a better solution

We are trying and will come back after 24h + 1minute

Thank you so much

zhanggbj · 2024-02-29T07:44:39Z

The session logout issue is caused by the underneath govmomi package, and more details here vmware/govmomi#3240

In CAPV, to mitigate the impact, fixes are discussed in an ongoing PR #2601

And the current workaround is to restart CAPV when its got stuck or disable KeepAlive.

chrischdi · 2024-02-29T07:51:15Z

The session logout issue is caused by the underneath govmomi package, and more details here vmware/govmomi#3240

In CAPV, to mitigate the impact, fixes are discussed in an ongoing PR #2601

And the current workaround is to restart CAPV when its got stuck or disable KeepAlive.

Totally forgot about that one. Thanks for linking it!

hzproe · 2024-03-01T07:26:50Z

Hi @chrischdi @zhanggbj you were correct, this workaround fixed the issue for us. Thank you very much.

sempex · 2024-03-01T07:43:46Z

Hi @chrischdi @zhanggbj you were correct, this workaround fixed the issue for us. Thank you very much.

Works for me too, thank you verry much!

chrischdi · 2024-03-01T09:46:37Z

Still curious how this happened though (to reproduce it).

Local installation over here does not hit that issue (only has a single workload cluster, running CAPV v1.9.1)

hzproe · 2024-03-07T11:59:28Z

hi, we did try a demo install ... kind cluster + capv 1.9.1 and we observed the same issue when the

--enable-keep-alive flag is set in the deployment.

you get the session expired message and you can no longer create or delete machines as they will be stuck indefinitely until someone reestablishes the session by restarting the capv controller

BR
Heinz

erkanerol · 2024-06-07T09:31:24Z

Summary

Setting --enable-keep-alive=false solves this issue.
The flag is true by default in 1.9.3 and the previous versions. See https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/v1.9.3/pkg/constants/constants.go#L56
The flag is false by default in 1.10.0. See https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/v1.10.0/pkg/constants/constants.go#L56
The flag is deprecated in 1.10.0 https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/v1.10.0/main.go#L151
The flag is removed in the master branch now. It will not exist in the upcoming versions a696b5b

k8s-triage-robot · 2024-09-05T10:03:31Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-10-05T10:39:22Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-11-04T11:23:57Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-11-04T11:24:02Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 28, 2024

sbueringer added the area/govmomi Issues or PRs related to the govmomi mode label Apr 18, 2024

erkanerol mentioned this issue Jun 7, 2024

Disable keep alive flag to fix session expiration bug giantswarm/cluster-api-provider-vsphere-app#99

Merged

5 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 5, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 5, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vSphere Session expires after 86400 seconds #2781

vSphere Session expires after 86400 seconds #2781

hzproe commented Feb 28, 2024

chrischdi commented Feb 28, 2024

hzproe commented Feb 28, 2024

sempex commented Feb 28, 2024

hzproe commented Feb 28, 2024

sempex commented Feb 28, 2024

hzproe commented Feb 28, 2024

sempex commented Feb 28, 2024

chrischdi commented Feb 28, 2024

chrischdi commented Feb 28, 2024 •

edited

Loading

hzproe commented Feb 28, 2024

zhanggbj commented Feb 29, 2024 •

edited

Loading

chrischdi commented Feb 29, 2024

hzproe commented Mar 1, 2024

sempex commented Mar 1, 2024

chrischdi commented Mar 1, 2024 •

edited

Loading

hzproe commented Mar 7, 2024

erkanerol commented Jun 7, 2024

k8s-triage-robot commented Sep 5, 2024

k8s-triage-robot commented Oct 5, 2024

k8s-triage-robot commented Nov 4, 2024

k8s-ci-robot commented Nov 4, 2024

vSphere Session expires after 86400 seconds #2781

vSphere Session expires after 86400 seconds #2781

Comments

hzproe commented Feb 28, 2024

chrischdi commented Feb 28, 2024

hzproe commented Feb 28, 2024

sempex commented Feb 28, 2024

hzproe commented Feb 28, 2024

sempex commented Feb 28, 2024

hzproe commented Feb 28, 2024

sempex commented Feb 28, 2024

chrischdi commented Feb 28, 2024

chrischdi commented Feb 28, 2024 • edited Loading

hzproe commented Feb 28, 2024

zhanggbj commented Feb 29, 2024 • edited Loading

chrischdi commented Feb 29, 2024

hzproe commented Mar 1, 2024

sempex commented Mar 1, 2024

chrischdi commented Mar 1, 2024 • edited Loading

hzproe commented Mar 7, 2024

erkanerol commented Jun 7, 2024

Summary

k8s-triage-robot commented Sep 5, 2024

k8s-triage-robot commented Oct 5, 2024

k8s-triage-robot commented Nov 4, 2024

k8s-ci-robot commented Nov 4, 2024

chrischdi commented Feb 28, 2024 •

edited

Loading

zhanggbj commented Feb 29, 2024 •

edited

Loading

chrischdi commented Mar 1, 2024 •

edited

Loading