Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Devlink Error message #413

Merged
merged 5 commits into from
Mar 2, 2022

Conversation

Eoghan1232
Copy link
Collaborator

@Eoghan1232 Eoghan1232 commented Feb 28, 2022

This commit fixes an IF condition being met incorrectly.

The GetPfName() in utils package returns the wrong error even when device is non-devlink.
line in utils.go
because it compares underlying netlink provider error message to different string than what actual implementation returns.
line in netlink_provider.go

signed off:
[email protected]

thanks to @ahalimx86 for spotting this.

@Eoghan1232
Copy link
Collaborator Author

I can see the golangci-lint due to "context deadline exceeded". This seems to be occurring on other PR's also.
The "build-test-lint / test" failed due to some dependency issues. I fixed this on my own branch when testing, by adding "go mod tidy" to the Makefile on line 98.

Will discuss in meeting.

@Eoghan1232
Copy link
Collaborator Author

@adrianchiris Could you review?

@adrianchiris
Copy link
Contributor

because it compares underlying netlink provider error message to different string than what actual implementation returns.
line in netlink_provider.go

please note that we embed (sort of) the netlink error with %v

@@ -67,7 +67,7 @@ func GetPfName(pciAddr string) (string, error) {
if pfEswitchMode == "" {
// If device doesn't support eswitch mode query or doesn't have sriov enabled,
// fall back to the default implementation
if err == nil || strings.Contains(strings.ToLower(fmt.Sprint(err)), "no such device") {
if err == nil || strings.Contains(strings.ToLower(fmt.Sprint(err)), "error getting devlink device attributes for net device") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we are intentionally matching on the internal error returned by netlink.

+@almaslennikov

@Eoghan1232 are you hitting an issue with intel NICs not returning "no such device" error when trying to get devlink device attributes ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrianchiris "no such device" maybe gets returned in real environment but in a test environment with mock sysfs you won't necessarily get this exact error.
For example, in some external code that uses this function and have unit test with mock sysfs they might get "no such file or directory" instead. In that case, it breaks the non-devlink detection logic here.

Since GetPfEswitchMode() returns error string from GetDevLinkDeviceEswitchAttrs() it's safe to rely on that error message only if the underlying netlink fails to find such device in a system - real or mock.

Copy link
Contributor

@adrianchiris adrianchiris Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrianchiris "no such device" maybe gets returned in real environment but in a test environment with mock sysfs you won't necessarily get this exact error.

if there is a mock sysfs ,cant you also mock GetDevLinkDeviceEswitchAttrs ?

we do not want to fallback to the regular flow if netlink failed for reason other than

  • device doesnt support eswitch mode query
  • device doesnt have sriov enabled

since in this case, if device is in switchdev mode but failed for a reason other than the above, we would want to fail else we risk returning the wrong PF name.

I can think of two things:

  1. adjust tests (netlink provider is an interface)
  2. change this to a list of expected strings and check if one is a substring of the returned error

do you have in house e2e tests that just use mock sysfs ? there are other calls to netlink in general are you not calling them ? (i.e GetLinkAttr)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adrianchiris "no such device" maybe gets returned in real environment but in a test environment with mock sysfs you won't necessarily get this exact error.

if there is a mock sysfs ,cant you also mock GetDevLinkDeviceEswitchAttrs ?

we do not want to fallback to the regular flow if netlink failed for reason other than

  • device doesnt support eswitch mode query
  • device doesnt have sriov enabled

This is where the problem is. It does not come to the 2nd option at all because the first option return the error as with GetPfEswitchMode(pciAddr) returned ("", err). So it evaluates error string comparison and it evaluates to false and return GetPfName() to caller with error. Never get a chance to check sriov enable part. Please review the flow again.

since in this case, if device is in switchdev mode but failed for a reason other than the above, we would want to fail else we risk returning the wrong PF name.

I can think of two things:

  1. adjust tests (netlink provider is an interface)
  2. change this to a list of expected strings and check if one is a substring of the returned error

do you have in house e2e tests that just use mock sysfs ? there are other calls to netlink in general are you not calling them ? (i.e GetLinkAttr)

The NetlinkProvider interface is internal to this package only. External client of this package cannot and should not have to mock any internal interfaces of this package. The utils.GetPfName() function should be testable with a minimal mock sysfs as it used to be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the problem is. It does not come to the 2nd option at all because the first option return the error as with GetPfEswitchMode(pciAddr) returned ("", err). So it evaluates error string comparison and it evaluates to false and return GetPfName() to caller with error. Never get a chance to check sriov enable part. Please review the flow again.

Do you mean the call to GetPfAddr() fails in GetPfEswitchMode() and returns error to GetPfName() ?
if GetPfAddr() fails, it means it failed to read symlink /sys/bus/pci/devices/<pci addr>/physfn
so in this case we would fail in GetPfName() even if we fallback to the previous flow since we try to read from the same base path. (unless in the mocked environment /sys/bus/pci/devices/<pci addr>/physfn is not a symlink)

or do you mean, GetPfAddr() call in GetPfEswitchMode() returns the PF addr of the VF (or its own address in case of PF) however the devlink call which follows this fails as the isnt actually a device ?

can you walk me through the flow to help me understand ? what does GetPfName gets for input ? what is the mocked sysfs ?

The NetlinkProvider interface is internal to this package only.

actually its public and you can get and set the implementation, defaulting to defaultNetlinkProvider which implements the NetlinkProvider interface, you can set a mock for it.

function should be testable with a minimal mock sysfs as it used to be.

I do not agree with this statement, i dont want to impose such constraints on implementation.
there are other places where we want to use devlink/netlink or other external libraries.

#399
#387
#306

Copy link
Contributor

@adrianchiris adrianchiris Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Eoghan1232 could you rebase the PR ? there are some changes introduced with #410

i think we can simplify the whole flow flow here:

	pfEswitchMode, err := GetPfEswitchMode(pciAddr)
	if pfEswitchMode == "" || err != nil {
			glog.Infof("Devlink query for eswitch mode is not supported for device %s. err: %v", pciAddr, err)
	} else if pfEswitchMode == eswitchModeSwitchdev {
		name, err := GetSriovnetProvider().GetUplinkRepresentor(pciAddr)
		if err != nil {
			return "", err
		}

		return name, nil
	}

	path := filepath.Join(sysBusPci, pciAddr, "physfn", "net")

which should work with a mocked sysfs test env

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR rebased.

@adrianchiris
Copy link
Contributor

@Eoghan1232

The GetPfName() in utils package returns the wrong error even when device is non-devlink.

the netlink error is embedded within the returned error, what is the full error you are getting.
the issue with netlink go impl is that there is no good way to understand programatically why it failed so we resort to string comparison :(

@adrianchiris
Copy link
Contributor

adrianchiris commented Mar 2, 2022

OK, was trying to remember why we matched on this specific string and not the one returned by netlink provider.

  1. i couldnt find a good reason for doing it. Intel i40e returned this error with devlink, but its safe to use the string proposed here.
  2. its generally not a good idea to rely in code path on error strings especially if those are provided by external libs which are an impl of an interface we use internally.

also we saw some wierd behaviour with netlink, hence the whole "checking for empty eswitch mode":

#398 (comment)

I left another comment on how we can further simplify this, LMK what you think.

@Eoghan1232 Eoghan1232 merged commit 5c77c77 into k8snetworkplumbingwg:master Mar 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants