Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[caclmgrd] Prevent unnecessary iptables updates #5312

Merged
merged 6 commits into from
Oct 19, 2020
Merged

[caclmgrd] Prevent unnecessary iptables updates #5312

merged 6 commits into from
Oct 19, 2020

Conversation

jleveque
Copy link
Contributor

@jleveque jleveque commented Sep 3, 2020

- Why I did it

When a large number of changes occur to the ACL table of Config DB, caclmgrd will get flooded with notifications, and previously, it would regenerate and apply the iptables rules for each change, which is unnecessary, as the iptables rules should only get applied once after the last change notification is received. If the ACL table contains a large number of control plane ACL rules, this could cause a large delay in caclmgrd getting the rules applied.

This patch causes caclmgrd to delay updating the iptables rules until it has not received a change notification for at least 0.5 seconds.

Resolves #5275

- How I did it

When an ACL table change notification is received for a particular namespace, we spawn an update thread for that namespace, which sits in a loop, sleeps 0.5 seconds and checks if any new change notifications were received while it was asleep. If new notification were received, it goes back to sleep and repeats the process until no new notifications were received while it was asleep. It then generates and applies the iptables rules.

- How to verify it

Load a large set of control plane ACL rules (see #5275 for an example)

- Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006

This PR will not cleanly cherry-pick backwards. Separate PRs will be necessary.

@jleveque
Copy link
Contributor Author

jleveque commented Sep 3, 2020

@Ezrickd: Please test this patch and verify that it resolves #5275

@Ezrickd
Copy link

Ezrickd commented Sep 10, 2020

Hi Joe.
I have tested the new caclmgrd config, it creates additional rules which is expected, but it takes ~45 sec for the rules to be applied. true the issue of fluctuated number is solved but installation of 360 rules take 45 sec ...

when I run the old caclmgrd which I verify with, the install take ~ 5 sec for 345 Rules.

where does the long delay is inserted, this create a security issue as every time you will run config reload the switch is vulnerable for long seconds.

I have supplied the old caclmgrd config i am using today when i have submited the issue.

please advise.

/Ezrick

@Ezrickd
Copy link

Ezrickd commented Sep 13, 2020

Hi Joe.
here is the device full config_db.json file
thank you for the help.
/Ezrick

4700-2-config_db.zip

@jleveque jleveque marked this pull request as ready for review September 15, 2020 20:28
@jleveque
Copy link
Contributor Author

Retest vsimage please

Copy link
Collaborator

@liat-grozovik liat-grozovik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fix has a major issue with performance as can be seen with the comments by Ezrik.
we dont know which version it started but as can be seen the acl manager version of working/not working has been provided.

solution should be something that a user can use with reasonable time. 45 seconds for 350 ACLs is not something that we should have.

@jleveque
Copy link
Contributor Author

@liat-grozovik: According to @Ezrickd, this patch will help eliminate fluctuation in the time. This fix helps prevent unnecessary iptables updates, so I believe it should be merged. However, there is apparently another underlying cause for the delay that @Ezrickd reported. It appears that notifications are "trickling in" to the application from the underlying Redis connection (~10 per second), rather that being received in a large batch. The root cause of the delay appears to be at a lower layer here.

@lguohan
Copy link
Collaborator

lguohan commented Oct 11, 2020

It appears that notifications are "trickling in" to the application from the underlying Redis connection (~10 per second), rather that being received in a large batch. The root cause of the delay appears to be at a lower layer here.

do we have this issue opened somewhere?

@jleveque
Copy link
Contributor Author

@lguohan: I have opened an issue here: sonic-net/sonic-swss-common#393

# A brief sleep appears necessary in this loop or any spawned
# update threads will get stuck. Appears to be due to the sel.select() call.
# TODO: Eliminate the need for this sleep.
time.sleep(0.1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jleveque Do we need this still ? Can we update SELECT_TIMEOUT_MS to take care of of this sleep ?

Copy link
Contributor Author

@jleveque jleveque Oct 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot explain the root cause of the need for this sleep. I wish I could. Adjusting SELECT_TIMEOUT_MS doesn't make a difference. It seems to be related to some internal Python context switching in conjunction with the sel.select() call. If we don't sleep a tiny bit here, any spawned update threads will not make progress. It's very strange. By commenting out the sel.select() call and mocking it up, the problem subsided, which is why I feel the sel.select() call is somehow triggering the problem. I was unable to find any similar complaints on the web. This will require further, deep investigation in order to eliminate this sleep.

.format(namespace, new_changes, self.UPDATE_DELAY_SECS))
num_changes = self.num_changes[namespace]
else:
if num_changes == self.num_changes[namespace] and num_changes > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jleveque do we need num_changes > 0 check ?

Copy link
Contributor Author

@jleveque jleveque Oct 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's basically a safeguard, in case num_changes == 0 and self.num_changes[namespace] == 0. We shouldn't get here in this situation, but if we do we will hit the else case and log an error message. It could be helpful for debugging.

Copy link
Contributor

@abdosi abdosi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jleveque jleveque merged commit edf4971 into sonic-net:master Oct 19, 2020
@jleveque jleveque deleted the caclmgrd_threaded branch October 19, 2020 18:11
abdosi pushed a commit that referenced this pull request Oct 21, 2020
When a large number of changes occur to the ACL table of Config DB, caclmgrd will get flooded with notifications, and previously, it would regenerate and apply the iptables rules for each change, which is unnecessary, as the iptables rules should only get applied once after the last change notification is received. If the ACL table contains a large number of control plane ACL rules, this could cause a large delay in caclmgrd getting the rules applied.

This patch causes caclmgrd to delay updating the iptables rules until it has not received a change notification for at least 0.5 seconds.
santhosh-kt pushed a commit to santhosh-kt/sonic-buildimage that referenced this pull request Feb 25, 2021
When a large number of changes occur to the ACL table of Config DB, caclmgrd will get flooded with notifications, and previously, it would regenerate and apply the iptables rules for each change, which is unnecessary, as the iptables rules should only get applied once after the last change notification is received. If the ACL table contains a large number of control plane ACL rules, this could cause a large delay in caclmgrd getting the rules applied.

This patch causes caclmgrd to delay updating the iptables rules until it has not received a change notification for at least 0.5 seconds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

COPP with ~350 rules take more than 10 min to install at iptables
5 participants