-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delay the initialization of ARP/NDP responders #6700
base: main
Are you sure you want to change the base?
Conversation
42d9434
to
dce62e5
Compare
dce62e5
to
2dd49d8
Compare
cmd/antrea-agent/agent.go
Outdated
if egressController != nil { | ||
go egressController.Run(stopCh) | ||
} | ||
|
||
if externalIPController != nil { | ||
go externalIPController.Run(stopCh) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern with changing the order is that it is a bit arbitrary, we can introduce new unexpected issues, and it limits what we can do in the future. For example, we could in the future want to introduce a dependency of flowRestoreCompleteWait
on the realization of Egress policies. It would make sense: delay the removal of flow-restore-wait
until Egress policy flows have been installed, in order to provide a more consistent datapath on (re)start. See #6342 for more context.
However, we know that there is already a dependency of SecondaryNetwork initialization on flowRestoreCompleteWait
. This dependency is important and AFAIK cannot be broken. So with the change described above, we would end up with a circular dependency:
EgressController
before flowRestoreCompleteWait
before SecondaryNetwork initialization
before EgressController
.
I would rather avoid "introducing" this new dependency (or rather enforcing this new dependency).
cc @tnqn for his opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a valid concern. we can check/watch for interface changes in the responders to avoid the hard dependencies. Waiting for Quan's insights.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not introducing the dependency makes sense to me. Actually I'm considering something similar (check/watch for interface changes) to support #6547, for which we might add an externalInterface
configuration and it could happen that not all nodes have the interface (and it's a valid case because user can select certain nodes as egress nodes, then raising error because the interface doesn't exist on non egress nodes doesn't make sense). If we can handle interface change in egress controller, it would solve two problems.
1945b5b
to
586807b
Compare
type Interface interface { | ||
LinkExists(linkName string) bool | ||
|
||
// Run starts the detector. | ||
Run(stopCh <-chan struct{}) | ||
|
||
// AddEventHandler registers an eventHandler of link updates. It's not thread-safe and should be called before | ||
// starting the detector. | ||
AddEventHandler(handler LinkEventHandler, linkName ...string) | ||
|
||
// HasSynced returns true if the cache has been initialized with the existing links. | ||
HasSynced() bool | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tnqn @antoninbas I have added this to the 2.2 release log, can you prioritize this PR's review as well? We'd probaboly better to include it considering it's a bug fix. Let me know if you have a different view. Thanks. |
cmd/antrea-agent/agent.go
Outdated
if linkDetector != nil { | ||
go linkDetector.Run(stopCh) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we start the link detector before the controllers that consume it (Egress / ServiceExternalIP), that would make more sense to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the motivation is to share the detector instance between ServiceExternalIP and Egress controller. Otherwise we may need to create separate instances for the two controllers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just suggesting calling go linkDetector.Run(stopCh)
before we call go externalIPController.Run(stopCh)
and go egressController.Run(stopCh)
. It should not impact the ability to share the instance across both controllers?
// remove this when https://github.com/mdlayher/ndp/pull/32 gets merged | ||
replace github.com/mdlayher/ndp => github.com/xliuxu/ndp v0.0.0-20240926134643-8cf547505092 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that the ndp patch is necessary for this change?
I see that you didn't hear back from the maintainer on this. If you can find him in the K8s Slack workspace, it may be worth sending him a DM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to ping Matt.
|
||
// AddEventHandler registers an eventHandler of link updates. It's not thread-safe and should be called before | ||
// starting the detector. | ||
AddEventHandler(handler LinkEventHandler, linkName ...string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/linkName/linkNames
// AddEventHandler registers an eventHandler of link updates. It's not thread-safe and should be called before | ||
// starting the detector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description should include that if no link name is provided, the handler will be called for all link updates.
return d.cacheSynced | ||
} | ||
|
||
func (d *linkDetector) AddEventHandler(handler LinkEventHandler, linkName ...string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: s/linkName/linkNames
func NewLinkDetector() *linkDetector { | ||
return &linkDetector{ | ||
eventHandlers: make(map[string][]LinkEventHandler), | ||
linkNames: make(set.Set[string]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to use https://pkg.go.dev/k8s.io/apimachinery/pkg/util/sets#New, as make
relies on the implementation detail that Set is just a built-in map
.
type linkEventHandler struct { | ||
watchLinkNames []string | ||
receivedEvents []string | ||
lock sync.Mutex | ||
expectedLinkEvents []string | ||
} | ||
|
||
func (l *linkEventHandler) onLinkEvent(linkName string) { | ||
l.lock.Lock() | ||
defer l.lock.Unlock() | ||
l.receivedEvents = append(l.receivedEvents, linkName) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend adding a getReceivedEvents
method, so that the test cases don't need to worry about having to lock the mutex correctly. getReceivedEvents
should lock the mutex and return a copy of the slice.
for _, l := range tt.eventHandlers { | ||
l.lock.Lock() | ||
defer l.lock.Unlock() | ||
if !reflect.DeepEqual(l.expectedLinkEvents, l.receivedEvents) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use assert.EventuallyWithT
and use a testify assertion instead of reflect.DeepEqual
directly
you can then also remove the t.Logf
call
for _, tt := range tests { | ||
t.Run(tt.name, func(t *testing.T) { | ||
ctrl := gomock.NewController(t) | ||
defer ctrl.Finish() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is no longer required with the gomock version that we use
var ( | ||
// map of transportInterfaceName to ARP responder | ||
arpResponders = make(map[string]*arpResponder) | ||
// map of transportInterfaceName to NDP responder | ||
ndpResponders = make(map[string]*ndpResponder) | ||
) | ||
|
||
// NewARPResponder creates a new ARP responder if it does not exist for the given transportInterfaceName. | ||
// This function is not thread-safe. | ||
func NewARPResponder(transportInterfaceName string, linkDetector linkdetector.Interface) *arpResponder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels like this is independent of this change itself
I see this in the PR description:
NDP responder may fail to bind to the new interface due to the Duplicate Address Detection process
I assume it is related, but I am still not clear what the issue was
I do agree that having multiple ARP responders for the same interface seems unnecessary, but was it actually an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is not described in the description. Having multiple responders will not cause issues.
The change of
NDP responder may fail to bind to the new interface due to the Duplicate Address Detection process
is to wrap the binding/accepting into wait.Until
to retry on binding errors.
I started reviewing the code. It may take a few days to get it merged based on the size. One concern right now is that it depends on an upstream change to the ndp library, and it seems that @xliuxu hasn't heard back from the maintainer in a while. We could make a temporary fork of ndp under the antrea-io organization; that would be slightly better than relying on a personal fork. |
For secondary-network scenarios, the transport interface can be changed after the agent is started. The ARP/NDP responders should be started after the initialization of secondary-network to bind to the transport interface of the new index. Besides, this change also addresses the following issues: - NDP responder may fail to bind to the new interface due to the Duplicate Address Detection process. - Golang caches the zone index for the interface, which may result in NDP responder binding on the stale interface Fixes: antrea-io#6623 Signed-off-by: Xu Liu <[email protected]>
586807b
to
27541cf
Compare
For secondary-network scenarios, the transport interface can be changed after the agent is started. The ARP/NDP responders should be started after the initialization of secondary-network to bind to the transport interface of the new index.
Besides, this change also addresses the following issues:
Fixes: #6623