-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delete secondaryNetwork OVS ports correctly after an Agent restart #6853
base: main
Are you sure you want to change the base?
Conversation
63e028a
to
3b7abe8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new tests are needed under test/e2e-secondary-network
to validate this patch
3bf9b47
to
63d1231
Compare
klog.InfoS("The container interface has been deleted ", "container id", Config.ContainerID) | ||
return | ||
} | ||
event := types.PodUpdate{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this as desired to re-send the PodUpdate event after agent restart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm.. I think we should re-construct cniCache using the information in the primary interface store (after its restoration), and then post a deletion event for each Pod in the secondary interface store, but not in cniCache / the primary interface store.
@wenyingd : thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianjuns you mean to re-construct cniCache, post add event using primary interfacestore in https://github.com/antrea-io/antrea/blob/main/pkg/agent/cniserver/pod_configuration.go#L443, post delete event for each Pod in the secondary interfacestore( after its restoration) in reconcileSecondaryInterfaces() function ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm.. I think we should re-construct cniCache using the information in the primary interface store (after its restoration), and then post a deletion event for each Pod in the secondary interface store, but not in cniCache / the primary interface store. @wenyingd : thoughts?
Honestly, it is not easier to sync between the primary interface store and the secondary interface store, since the two interface store are independent, and mapping to different OVS bridges (OVSDB configurations).
If my understanding is correct, we have also configured the Pod's Namespace and Name, and the container ID in the secondary OVSDB, so we should already loaded the necessary configurations from the secondary network interface store restore logic after agent restart. Then we can use these OVSDB configurations to restore the cniCache and make diffs with the latest Pods from apiserver to remove the stale data, in this way we can remove t the dependency from the primary interface store, and don't need to re-send the PodUpdate event in restore stage. @KMAnju-2021 can help confirm and correct me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the normal workflow, cniCache is built with updates of the primary interfaces, not from the secondary interfaces. Esp. for the ContainerID field, the values saved in the secondary OVSDB may be stale. From the primary InterfaceStore, could we gurrantee the restored ContainerIDs are correct after reconciliation? @wenyingd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the primary InterfaceStore, could we gurrantee the restored ContainerIDs are correct after reconciliation?
No. For the reconcile logic of the primary interface store, it does not cover the case that Pod's Namespaced name exists in the apiserver, but the sandbox container ID has updated during antrea's down time. We just load the known interfaces from OVSDB which were written when receiving CmdAdd call from kubelet, and check if the referred Pod still exists in apiserver. If no, remove the interfaces from OVS.
I can't think of the workflow to produce the case for the primary interface restore logic that Pod's Namespaced name still exists but the sandbox container ID has changed during agent downtime. For CNI case, the single source of the truth is the cmdAdd event sent from kubelet, For the case that a Pod with the same name has modified the sandbox container, it should be a Pod's restart. But if agent is down, kubelet should not successfully send the CmdAdd call. So cniserver should have no chance to get the new containerID from the existing logic.
7904bea
to
20ca6f4
Compare
klog.InfoS("The container interface has been deleted ", "container id", Config.ContainerID) | ||
return | ||
} | ||
event := types.PodUpdate{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm.. I think we should re-construct cniCache using the information in the primary interface store (after its restoration), and then post a deletion event for each Pod in the secondary interface store, but not in cniCache / the primary interface store.
@wenyingd : thoughts?
5e2b8e1
to
5e22752
Compare
@KMAnju-2021 I didn't see any new tests for the patch, re-post Antonin's comment:
|
5e22752
to
dd90f3c
Compare
dd90f3c
to
e61f080
Compare
e61f080
to
d0a9c32
Compare
|
||
err = pc.initializeSecondaryInterfaceStore() | ||
if err != nil { | ||
klog.ErrorS(err, "Failed to initialize the secondary bridge interface store") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should return the error and fail the agent startup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
secondary bridge interface store -> secondary interface store
} | ||
|
||
if err := pc.reconcileSecondaryInterfaces(pIfaceStore); err != nil { | ||
klog.ErrorS(err, "Failed to restore the cniCache") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should return the error and fail the agent startup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restore the cniCache -> restore CNI cache and reconcile secondary interfaces
return err | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change can be reverted?
podNamespace := interfaceConfig.PodNamespace | ||
klog.V(1).InfoS("Deleting secondary interface", | ||
"Pod", klog.KRef(podNamespace, podName), "interface", interfaceConfig.IFDev) | ||
// deleteInterfaceAndReleaseResources to delete interface and release resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to delete interface and release -> deletes a secondary interface and releases
// secondaryInterfaces is the list of interfaces currently in the secondary local cache and delete ports not in the CNI cache. | ||
secondaryInterfaces := pc.interfaceStore.GetInterfacesByType(interfacestore.ContainerInterface) | ||
for _, containerConfig := range secondaryInterfaces { | ||
containerID := containerConfig.ContainerID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems not necessary to add a new temporary variable?
containerID := containerConfig.ContainerID | ||
_, exists := pIfaceStore.GetContainerInterface(containerID) | ||
if !exists || containerConfig.OFPort == -1 { | ||
pc.interfaceStore.DeleteInterface(containerConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removeInterfaces() will try to get the interface from the store (and will delete the interface from store)?
_, exists := pIfaceStore.GetContainerInterface(containerID) | ||
if !exists || containerConfig.OFPort == -1 { | ||
pc.interfaceStore.DeleteInterface(containerConfig) | ||
err := pc.removeInterfaces(secondaryInterfaces, containerConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not add all interfaces to be deleted to a list, and call removeInterfaces() to remove them together? Then no need to change removeInterfaces()
cmd/antrea-agent/agent.go
Outdated
@@ -680,6 +680,18 @@ func run(o *Options) error { | |||
return err | |||
} | |||
|
|||
var secondaryNetworkController *secondarynetwork.Controller | |||
if features.DefaultFeatureGate.Enabled(features.SecondaryNetwork) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving this block to line 629 to be closer to CNIServer init? Please also add a comment saying secondary network Controller should be created before CNIServer.Run() to make sure no Pod CNI updates will be missed.
8dac3a0
to
af4d62a
Compare
Signed-off-by: KMAnju-2021 <[email protected]>
af4d62a
to
7027718
Compare
Closes: #6578
Store the SecondaryNetwork interface in interfacestore after Agent restart and delete secondaryNetwork OVS ports correctly.