Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE storageClass Multi-Attach error #390

Open
D1StrX opened this issue Oct 18, 2024 · 10 comments · Fixed by #401
Open

GKE storageClass Multi-Attach error #390

D1StrX opened this issue Oct 18, 2024 · 10 comments · Fixed by #401
Labels
bug Something isn't working

Comments

@D1StrX
Copy link
Contributor

D1StrX commented Oct 18, 2024

The Helm chart version

5.0.0-beta.112

Environment Versions

Kubernetes 1.29
GKE

Custom chart values

persistence:
  enabled: true
  storageClass: "rwo"
  subPath: ""
  accessMode: ReadWriteOnce
  size: 1Gi
  existingClaim: ""
  annotations: {}

Current Behavior & Steps to Reproduce

Regarding the storage behavior mentioned in #357, I took a deeper look into the issue. Even RWO is an issue for us when your run a cluster with multiple K8s Worker nodes. The PVC, Netbox and Netbox-worker must reside on the same Worker node, otherwise you get Multi-Attach error for volume <pvc> Volume is already used by pod(s) netbox-worker-xxx. RWX isn't available on GKE, because pd.csi.storage.gke.io doesn't support it.

And why does Netbox-worker need access to Netbox-media?

Expected Behavior

An alternative or perhaps improved documentation.

NetBox Logs

No response

@D1StrX D1StrX added the bug Something isn't working label Oct 18, 2024
@LeoColomb
Copy link
Member

Thanks for filing this issue, @D1StrX.

And why does Netbox-worker need access to Netbox-media?

That's a good point.
Would removing this mount resolve the issue you're facing?

@RangerRick
Copy link
Contributor

Wondering this as well. I'd think you'd still have problems if you're running multiple nb replicas and they end up on different nodes.

@D1StrX
Copy link
Contributor Author

D1StrX commented Oct 23, 2024

As long as we don't use scriptsPersistence and reportsPersistence ... but this wouldn't resolve the main issue. When scaling up the replicas this would indeed create the same issue. A couple of solutions/directions I can think of;

  • FileStore, which is way to expensive for the solution to achieve
  • (S3/Cloud) buckets. Netbox already leverages this option for Media an scripts? perhaps extend this to reports. Makes the Netbox installation harder and maybe more expensive.
  • RWX is nearly essential with the current setup, while RWO is more suitable when binding to a single Worker node.
  • Tie Netbox dynamically to a Worker node based on tolerations/affinity... still doesn't support multiple replicas.
  • Removing the mount for Netbox-worker wouldn't resolve the issue, since multiple replicas reintroduce the problem.

@LeoColomb
Copy link
Member

Makes sense.
Then what's blocking to use the proper ReadWriteMany access mode?
That would be the exact use case for this.

@D1StrX
Copy link
Contributor Author

D1StrX commented Oct 23, 2024

GKE doesn't support RWX at all. And trying this to work, is not succeeding: #394.

@LeoColomb
Copy link
Member

As far as I understand, it does, just not when using Compute Engine disks.
I might be wrong, but if so, do you have any reference?

@D1StrX
Copy link
Contributor Author

D1StrX commented Oct 28, 2024

Several reference points:

  1. Error I am getting: failed to provision volume with StorageClass "<storageclass>": rpc error: code = InvalidArgument desc = VolumeCapabilities is invalid: specified multi writer with mount access type

  2. In Google Cloud Platform, the default storage class uses gce-persistent disk as the provisioner. However gce-persistent disk does not allow RWX mode. By default, gcePersistentDisk volume only permits readonly for multiple consumers.

  3. https://www.googlecloudcommunity.com/gc/Google-Kubernetes-Engine-GKE/pod-failed-to-use-pvc-with-standard-rwx-storageclass/m-p/796156 Solution is going straight for FileStore.

@LeoColomb
Copy link
Member

I honestly don't know what can be done in this repository for this case.
The fact GKE doesn't support ReadWriteMany volumes for some context is quite outside our expertise/ability to fix.
And NetBox is not even special is that case, as an app with a database backend. All the others I've checked defined the exact same behavior.

I'm not even sure to see the use case, where ReadWriteOnce, or even ReadOnlyMany, won't cover the needs.
The "active" replication should be let to the database only, I'm not sure NetBox would be very suitable for full active replicas across nodes.

@D1StrX
Copy link
Contributor Author

D1StrX commented Oct 30, 2024

Databases are not relevant in this context. "External" databases uses StatefulSets, where each pod has its own PVC/PV.
If only the Netbox container (not Worker or Housekeeping) would attach to Media, Scripts and Reports PVC the issue would be fixed. When you want to run Netbox HA; use for Media, Scripts and Reports external datasources like Git or S3.

@LeoColomb
Copy link
Member

Please give version 5.0.0-beta.137 (or above) a try.
A new option has been added to allow read only volume mounts (housekeeping.readOnlyPersistence, worker.readOnlyPersistence).
I believe ReadOnlyMany should then be an adequate option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants