Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VFs are not created in v1.4.0 #786

Open
jslouisyou opened this issue Oct 8, 2024 · 3 comments
Open

VFs are not created in v1.4.0 #786

jslouisyou opened this issue Oct 8, 2024 · 3 comments

Comments

@jslouisyou
Copy link

Hi, I'm facing an issue while creating VFs in v1.4.0 version - IB devices disappears at the end of VF creation (It works in v1.3.0 btw).

I used same configuration (e.g. SriovNetworkNodePolicy) for creating VFs.

Here's SriovNetworkNodePolicy that I used:

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-gpu2-ib2
  namespace: sriov-network-operator
spec:
  isRdma: true
  linkType: ib
  nicSelector:
    deviceID: "1021"
    pfNames:
    - ibp157s0
    vendor: 15b3
  nodeSelector:
    node-role.kubernetes.io/gpu: ""
  numVfs: 8
  priority: 10
  resourceName: gpu2_mlnx_ib2
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-gpu2-ib3
  namespace: sriov-network-operator
spec:
  isRdma: true
  linkType: ib
  nicSelector:
    deviceID: "1021"
    pfNames:
    - ibp211s0
    vendor: 15b3
  nodeSelector:
    node-role.kubernetes.io/gpu: ""
  numVfs: 8
  priority: 10
  resourceName: gpu2_mlnx_ib3

And I'm using H100 node with ConnectX-7 IB:

$ mst status -v
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA            NET                       NUMA  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf5      e5:00.0   mlx5_5          net-ibp229s0              1  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf4      d3:00.0   mlx5_4          net-ibp211s0              1  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf3      c1:00.0   mlx5_3          net-ibp193s0              1  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf2      9d:00.0   mlx5_2          net-ibp157s0              1  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf1      54:00.0   mlx5_1          net-ibp84s0               0   
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf0      41:00.0   mlx5_0          net-ibp65s0               0

$ lspci -s 41:00.0 -vvn
41:00.0 0207: 15b3:1021
	Subsystem: 15b3:0041
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 18
	NUMA node: 0
	Region 0: Memory at 23e044000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at <ignored> [disabled]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 4096 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 32GT/s, Width x16, ASPM not supported
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 32GT/s (ok), Width x16 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit+ 64bit+ 128bitCAS+
		DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- OBFF Disabled,
			 AtomicOpsCtl: ReqEn+
		LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 32GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [48] Vital Product Data
		Product Name: Nvidia ConnectX-7 Single Port Infiniband NDR OSFP Adapter
		Read-only fields:
			[PN] Part number: 0RYMTY
			[EC] Engineering changes: A02
			[MN] Manufacture ID: 1028
			[SN] Serial number: IN0RYMTYJBNM43BRJ4KF
			[VA] Vendor specific: DSV1028VPDR.VER2.1
			[VB] Vendor specific: FFV28.39.10.02
			[VC] Vendor specific: NPY1
			[VD] Vendor specific: PMTD
			[VE] Vendor specific: NMVNvidia, Inc.
			[VH] Vendor specific: L1D0
			[VU] Vendor specific: IN0RYMTYJBNM43BRJ4KFMLNXS0D0F0 
			[RV] Reserved: checksum good, 0 byte(s) reserved
		End
	Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Vendor Specific Information: Len=18 <?>
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable+ DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
		AERCap:	First Error Pointer: 04, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [1c0 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [320 v1] Lane Margining at the Receiver <?>
	Capabilities: [370 v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [3b0 v1] Extended Capability ID 0x2a
	Capabilities: [420 v1] Data Link Feature <?>
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core

And I pulled v1.3.0 and v1.4.0 Helm charts from oci://ghcr.io/k8snetworkplumbingwg/sriov-network-operator-chart and image tags are different:

  1. v1.3.0
images:
  operator: ghcr.io/k8snetworkplumbingwg/sriov-network-operator:v1.3.0
  sriovConfigDaemon: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-config-daemon:v1.3.0
  sriovCni: ghcr.io/k8snetworkplumbingwg/sriov-cni:v2.8.0
  ibSriovCni: ghcr.io/k8snetworkplumbingwg/ib-sriov-cni:v1.1.1
  ovsCni: ghcr.io/k8snetworkplumbingwg/ovs-cni-plugin:v0.34.0
  sriovDevicePlugin: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.7.0
  resourcesInjector: ghcr.io/k8snetworkplumbingwg/network-resources-injector:v1.6.0
  webhook: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-webhook:v1.3.0
  1. v1.4.0
images:
  operator: ghcr.io/k8snetworkplumbingwg/sriov-network-operator:v1.4.0
  sriovConfigDaemon: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-config-daemon:v1.4.0
  sriovCni: ghcr.io/k8snetworkplumbingwg/sriov-cni:v2.8.1
  ibSriovCni: ghcr.io/k8snetworkplumbingwg/ib-sriov-cni:v1.1.1
  ovsCni: ghcr.io/k8snetworkplumbingwg/ovs-cni-plugin:v0.34.2
  rdmaCni: ghcr.io/k8snetworkplumbingwg/rdma-cni:v1.2.0
  sriovDevicePlugin: ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.7.0
  resourcesInjector: ghcr.io/k8snetworkplumbingwg/network-resources-injector:v1.6.0
  webhook: ghcr.io/k8snetworkplumbingwg/sriov-network-operator-webhook:v1.4.0
  metricsExporter: ghcr.io/k8snetworkplumbingwg/sriov-network-metrics-exporter:v1.1.0
  metricsExporterKubeRbacProxy: gcr.io/kubebuilder/kube-rbac-proxy:v0.15.0

As you know, sriov-device-plugin pods are creating when SriovNetworkNodePolicy deployed.
After then, my H100 nodes' status are changed from sriovnetwork.openshift.io/state: Idle to sriovnetwork.openshift.io/state: Reboot_Required and rebooted after elapsed some time.

But in v1.4.0, it seems that VFs were created but eventually these were not shown and even PF disappeared. Here's the logs from dmesg:

[  115.692158] pci 0000:41:00.1: [15b3:101e] type 00 class 0x020700
[  115.692321] pci 0000:41:00.1: enabling Extended Tags
[  115.694112] mlx5_core 0000:41:00.1: enabling device (0000 -> 0002)
[  115.694789] mlx5_core 0000:41:00.1: firmware version: 28.39.1002
[  115.867939] mlx5_core 0000:41:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  115.867943] mlx5_core 0000:41:00.1: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  115.892812] pci 0000:41:00.2: [15b3:101e] type 00 class 0x020700
[  115.892967] pci 0000:41:00.2: enabling Extended Tags
[  115.894706] mlx5_core 0000:41:00.2: enabling device (0000 -> 0002)
[  115.895344] mlx5_core 0000:41:00.2: firmware version: 28.39.1002
[  115.895423] mlx5_core 0000:41:00.1 ibp65s0v0: renamed from ib0
[  116.065557] mlx5_core 0000:41:00.2: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  116.065561] mlx5_core 0000:41:00.2: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  116.090478] pci 0000:41:00.3: [15b3:101e] type 00 class 0x020700
[  116.090634] pci 0000:41:00.3: enabling Extended Tags
[  116.093559] mlx5_core 0000:41:00.3: enabling device (0000 -> 0002)
[  116.093993] mlx5_core 0000:41:00.2 ibp65s0v1: renamed from ib0
[  116.094189] mlx5_core 0000:41:00.3: firmware version: 28.39.1002
[  116.293582] mlx5_core 0000:41:00.3: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  116.293587] mlx5_core 0000:41:00.3: MLX5E: StrdRq(0) RqSz(1024) StrdSz(256) RxCqeCmprss(0)
[  116.318209] pci 0000:41:00.4: [15b3:101e] type 00 class 0x020700
[  116.318368] pci 0000:41:00.4: enabling Extended Tags
[  116.320079] mlx5_core 0000:41:00.4: enabling device (0000 -> 0002)
[  116.320712] mlx5_core 0000:41:00.4: firmware version: 28.39.1002
[  116.320871] mlx5_core 0000:41:00.3 ibp65s0v2: renamed from ib0
.....
[  446.036867] mlx5_core 0000:41:01.0 ibp65s0v7: renamed from ib0
[  446.464555] mlx5_core 0000:41:00.0: mlx5_wait_for_pages:898:(pid 6868): Skipping wait for vf pages stage
[  448.848149] mlx5_core 0000:41:00.0: driver left SR-IOV enabled after remove                                               <----------- weird
[  449.108562] mlx5_core 0000:41:00.2: poll_health:955:(pid 0): Fatal error 3 detected
[  449.108602] mlx5_core 0000:41:00.4: poll_health:955:(pid 0): Fatal error 3 detected
[  449.108620] mlx5_core 0000:41:00.2: mlx5_health_try_recover:375:(pid 1478): handling bad device here
[  449.108627] mlx5_core 0000:41:00.2: mlx5_handle_bad_state:326:(pid 1478): starting teardown
[  449.108629] mlx5_core 0000:41:00.2: mlx5_error_sw_reset:277:(pid 1478): start
[  449.108646] mlx5_core 0000:41:00.4: mlx5_health_try_recover:375:(pid 2283): handling bad device here
[  449.108660] mlx5_core 0000:41:00.4: mlx5_handle_bad_state:326:(pid 2283): starting teardown
[  449.108661] mlx5_core 0000:41:00.4: mlx5_error_sw_reset:277:(pid 2283): start
[  449.108672] mlx5_core 0000:41:00.2: mlx5_error_sw_reset:310:(pid 1478): end
[  449.108694] mlx5_core 0000:41:00.4: mlx5_error_sw_reset:310:(pid 2283): end
[  449.876577] mlx5_core 0000:41:00.5: poll_health:955:(pid 0): Fatal error 3 detected
[  449.876642] mlx5_core 0000:41:00.5: mlx5_health_try_recover:375:(pid 1000): handling bad device here
[  449.876649] mlx5_core 0000:41:00.5: mlx5_handle_bad_state:326:(pid 1000): starting teardown
[  449.876651] mlx5_core 0000:41:00.5: mlx5_error_sw_reset:277:(pid 1000): start
[  449.877266] mlx5_core 0000:41:00.5: mlx5_error_sw_reset:310:(pid 1000): end
[  450.381036] mlx5_core 0000:41:00.2: mlx5_health_try_recover:381:(pid 1478): starting health recovery flow

** Above messages shown when I pointed out ibp65s0 to create VFs. Sorry for confusion. This behavior happens regardless of PF names.

After then, when I tried to execute mst status -v then even the node can't find PF itself:

$ mst status -v
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA            NET                       NUMA  
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf5      e5:00.0   mlx5_5          net-ibp229s0              1     
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf4      d3:00.0                             1                                  <---- it goes empty
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf3      c1:00.0   mlx5_3          net-ibp193s0              1     
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf2      9d:00.0                             1                                  <---- it goes empty     
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf1      54:00.0   mlx5_1          net-ibp84s0               0     
ConnectX7(rev:0)        /dev/mst/mt4129_pciconf0      41:00.0   mlx5_0          net-ibp65s0               0 

Do you know anything about this situation? Anything would be very helpful.

Thanks.

@adrianchiris
Copy link
Collaborator

maybe its related to: #797 ?

@SchSeba
Copy link
Collaborator

SchSeba commented Dec 11, 2024

Hi @jslouisyou ,

can you check latest sriov-operator that contains #797?

@SchSeba
Copy link
Collaborator

SchSeba commented Dec 31, 2024

Hi @jslouisyou any update?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants