Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete secondaryNetwork OVS ports correctly after an Agent restart #6853

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

KMAnju-2021
Copy link
Contributor

Closes: #6578

Store the SecondaryNetwork interface in interfacestore after Agent restart and delete secondaryNetwork OVS ports correctly.

@KMAnju-2021 KMAnju-2021 force-pushed the secondary-network-port branch 2 times, most recently from 63e028a to 3b7abe8 Compare December 11, 2024 19:15
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new tests are needed under test/e2e-secondary-network to validate this patch

cmd/antrea-agent/agent.go Outdated Show resolved Hide resolved
cmd/antrea-agent/agent.go Outdated Show resolved Hide resolved
cmd/antrea-agent/agent.go Outdated Show resolved Hide resolved
pkg/agent/cniserver/pod_configuration.go Outdated Show resolved Hide resolved
pkg/agent/agent.go Outdated Show resolved Hide resolved
pkg/agent/agent.go Outdated Show resolved Hide resolved
cmd/antrea-agent/agent.go Outdated Show resolved Hide resolved
cmd/antrea-agent/agent.go Outdated Show resolved Hide resolved
cmd/antrea-agent/agent.go Outdated Show resolved Hide resolved
@KMAnju-2021 KMAnju-2021 force-pushed the secondary-network-port branch 6 times, most recently from 3bf9b47 to 63d1231 Compare December 17, 2024 11:50
@rajnkamr rajnkamr added this to the Antrea v2.3 release milestone Dec 26, 2024
pkg/agent/secondarynetwork/podwatch/controller.go Outdated Show resolved Hide resolved
pkg/agent/secondarynetwork/podwatch/controller.go Outdated Show resolved Hide resolved
klog.InfoS("The container interface has been deleted ", "container id", Config.ContainerID)
return
}
event := types.PodUpdate{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this as desired to re-send the PodUpdate event after agent restart?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm.. I think we should re-construct cniCache using the information in the primary interface store (after its restoration), and then post a deletion event for each Pod in the secondary interface store, but not in cniCache / the primary interface store.
@wenyingd : thoughts?

Copy link
Contributor Author

@KMAnju-2021 KMAnju-2021 Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianjuns you mean to re-construct cniCache, post add event using primary interfacestore in https://github.com/antrea-io/antrea/blob/main/pkg/agent/cniserver/pod_configuration.go#L443, post delete event for each Pod in the secondary interfacestore( after its restoration) in reconcileSecondaryInterfaces() function ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm.. I think we should re-construct cniCache using the information in the primary interface store (after its restoration), and then post a deletion event for each Pod in the secondary interface store, but not in cniCache / the primary interface store. @wenyingd : thoughts?

Honestly, it is not easier to sync between the primary interface store and the secondary interface store, since the two interface store are independent, and mapping to different OVS bridges (OVSDB configurations).

If my understanding is correct, we have also configured the Pod's Namespace and Name, and the container ID in the secondary OVSDB, so we should already loaded the necessary configurations from the secondary network interface store restore logic after agent restart. Then we can use these OVSDB configurations to restore the cniCache and make diffs with the latest Pods from apiserver to remove the stale data, in this way we can remove t the dependency from the primary interface store, and don't need to re-send the PodUpdate event in restore stage. @KMAnju-2021 can help confirm and correct me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the normal workflow, cniCache is built with updates of the primary interfaces, not from the secondary interfaces. Esp. for the ContainerID field, the values saved in the secondary OVSDB may be stale. From the primary InterfaceStore, could we gurrantee the restored ContainerIDs are correct after reconciliation? @wenyingd

Copy link
Contributor

@wenyingd wenyingd Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the primary InterfaceStore, could we gurrantee the restored ContainerIDs are correct after reconciliation?

No. For the reconcile logic of the primary interface store, it does not cover the case that Pod's Namespaced name exists in the apiserver, but the sandbox container ID has updated during antrea's down time. We just load the known interfaces from OVSDB which were written when receiving CmdAdd call from kubelet, and check if the referred Pod still exists in apiserver. If no, remove the interfaces from OVS.

I can't think of the workflow to produce the case for the primary interface restore logic that Pod's Namespaced name still exists but the sandbox container ID has changed during agent downtime. For CNI case, the single source of the truth is the cmdAdd event sent from kubelet, For the case that a Pod with the same name has modified the sandbox container, it should be a Pod's restart. But if agent is down, kubelet should not successfully send the CmdAdd call. So cniserver should have no chance to get the new containerID from the existing logic.

pkg/agent/secondarynetwork/podwatch/controller.go Outdated Show resolved Hide resolved
@KMAnju-2021 KMAnju-2021 force-pushed the secondary-network-port branch 3 times, most recently from 7904bea to 20ca6f4 Compare January 3, 2025 10:33
pkg/agent/secondarynetwork/podwatch/controller.go Outdated Show resolved Hide resolved
klog.InfoS("The container interface has been deleted ", "container id", Config.ContainerID)
return
}
event := types.PodUpdate{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humm.. I think we should re-construct cniCache using the information in the primary interface store (after its restoration), and then post a deletion event for each Pod in the secondary interface store, but not in cniCache / the primary interface store.
@wenyingd : thoughts?

pkg/agent/secondarynetwork/podwatch/controller.go Outdated Show resolved Hide resolved
pkg/agent/secondarynetwork/podwatch/controller.go Outdated Show resolved Hide resolved
@KMAnju-2021 KMAnju-2021 force-pushed the secondary-network-port branch 3 times, most recently from 5e2b8e1 to 5e22752 Compare January 5, 2025 20:16
@luolanzone
Copy link
Contributor

@KMAnju-2021 I didn't see any new tests for the patch, re-post Antonin's comment:

new tests are needed under test/e2e-secondary-network to validate this patch

@KMAnju-2021 KMAnju-2021 force-pushed the secondary-network-port branch from 5e22752 to dd90f3c Compare January 16, 2025 10:26
pkg/agent/secondarynetwork/init.go Outdated Show resolved Hide resolved
cmd/antrea-agent/agent.go Outdated Show resolved Hide resolved
pkg/agent/secondarynetwork/podwatch/controller.go Outdated Show resolved Hide resolved
pkg/agent/secondarynetwork/podwatch/controller.go Outdated Show resolved Hide resolved
pkg/agent/secondarynetwork/podwatch/controller.go Outdated Show resolved Hide resolved
pkg/agent/secondarynetwork/podwatch/controller.go Outdated Show resolved Hide resolved
pkg/agent/secondarynetwork/podwatch/controller.go Outdated Show resolved Hide resolved
pkg/agent/secondarynetwork/podwatch/controller.go Outdated Show resolved Hide resolved
@KMAnju-2021 KMAnju-2021 force-pushed the secondary-network-port branch from dd90f3c to e61f080 Compare January 20, 2025 11:35
@KMAnju-2021 KMAnju-2021 force-pushed the secondary-network-port branch from e61f080 to d0a9c32 Compare January 20, 2025 11:37

err = pc.initializeSecondaryInterfaceStore()
if err != nil {
klog.ErrorS(err, "Failed to initialize the secondary bridge interface store")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should return the error and fail the agent startup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

secondary bridge interface store -> secondary interface store

}

if err := pc.reconcileSecondaryInterfaces(pIfaceStore); err != nil {
klog.ErrorS(err, "Failed to restore the cniCache")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should return the error and fail the agent startup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restore the cniCache -> restore CNI cache and reconcile secondary interfaces

return err
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can be reverted?

podNamespace := interfaceConfig.PodNamespace
klog.V(1).InfoS("Deleting secondary interface",
"Pod", klog.KRef(podNamespace, podName), "interface", interfaceConfig.IFDev)
// deleteInterfaceAndReleaseResources to delete interface and release resources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to delete interface and release -> deletes a secondary interface and releases

// secondaryInterfaces is the list of interfaces currently in the secondary local cache and delete ports not in the CNI cache.
secondaryInterfaces := pc.interfaceStore.GetInterfacesByType(interfacestore.ContainerInterface)
for _, containerConfig := range secondaryInterfaces {
containerID := containerConfig.ContainerID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems not necessary to add a new temporary variable?

containerID := containerConfig.ContainerID
_, exists := pIfaceStore.GetContainerInterface(containerID)
if !exists || containerConfig.OFPort == -1 {
pc.interfaceStore.DeleteInterface(containerConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removeInterfaces() will try to get the interface from the store (and will delete the interface from store)?

_, exists := pIfaceStore.GetContainerInterface(containerID)
if !exists || containerConfig.OFPort == -1 {
pc.interfaceStore.DeleteInterface(containerConfig)
err := pc.removeInterfaces(secondaryInterfaces, containerConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add all interfaces to be deleted to a list, and call removeInterfaces() to remove them together? Then no need to change removeInterfaces()

@@ -680,6 +680,18 @@ func run(o *Options) error {
return err
}

var secondaryNetworkController *secondarynetwork.Controller
if features.DefaultFeatureGate.Enabled(features.SecondaryNetwork) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving this block to line 629 to be closer to CNIServer init? Please also add a comment saying secondary network Controller should be created before CNIServer.Run() to make sure no Pod CNI updates will be missed.

@KMAnju-2021 KMAnju-2021 force-pushed the secondary-network-port branch 3 times, most recently from 8dac3a0 to af4d62a Compare January 24, 2025 09:19
@KMAnju-2021 KMAnju-2021 force-pushed the secondary-network-port branch from af4d62a to 7027718 Compare January 24, 2025 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SecondaryNetwork OVS ports cannot be deleted correctly after an Agent restart
6 participants