Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unnecessary DNS queries for FQDN rule of NetworkPolicy #6200

Merged
merged 1 commit into from
Apr 10, 2024

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Apr 8, 2024

The agent maintains DNS cache for FQDNs matching any FQDN rule of NetworkPolicy in two ways, one of which is that it periodically queries these FQDNs. However, it turned out that the current implementation made more queries than needed, for several reasons:

  1. addFQDNRule() always triggered a DNS query immediately even if the FQDN was already tracked, and the function was called by podReconciler every time the rule was updated. This means if a FQDN's resolution is changed by a proactive query or a packet-in event, the FQDN would always be queried another time immediately.
  2. When calculating the expiration time for proactively queried records and the delay for next query, the pre-query timestamp was used as the base. This caused the next query to likely return the same record with a very small TTL, leading to another unnecessary query eventually. Besides, this is inconsistent with the base timestamp used for records received from packet-in events.

The patch makes the following improvements:

  1. Do not trigger an immediate query if the FQDN added by addFQDNRule() is already tracked.
  2. Do not traverse dnsEntryCache when processing a FQDN matching name in addFQDNSelector().
  3. Use post-response timestamp as the base of expiration time consistently.
  4. Make logs about DNS query failure concentrated and clear.

Before:

I0408 16:26:14.259064       1 fqdn.go:696] "Making DNS request" fqdn="s3-1.amazonaws.com" dnsServer="10.96.0.10:53"
E0408 16:26:16.259953       1 fqdn.go:706] "DNS exchange failed" err="read udp 172.18.0.3:55590->10.96.0.10:53: i/o timeout"
E0408 16:26:16.261626       1 fqdn.go:649] "Error syncing FQDN, retrying" err="DNS request failed for at least one of type A or AAAA queries" fqdn="s3-1.amazonaws.com"
I0408 16:26:21.262270       1 fqdn.go:696] "Making DNS request" fqdn="s3-1.amazonaws.com" dnsServer="10.96.0.10:53"
I0408 16:26:21.288231       1 fqdn.go:617] "Received DNS Packet with valid Answer" IPs={...} TTL=5
I0408 16:26:21.288324       1 fqdn.go:496] "Reconciling dirty rule for FQDN address updates" ruleID="80d1db99cf0e001f"
I0408 16:26:21.288454       1 pod_reconciler.go:298] "Reconciling Pod NetworkPolicy rule" rule="80d1db99cf0e001f" policy="AntreaClusterNetworkPolicy:acnp-fqdn-all-foobar"
I0408 16:26:21.288858       1 fqdn.go:696] "Making DNS request" fqdn="s3-1.amazonaws.com" dnsServer="10.96.0.10:53"
I0408 16:26:21.294220       1 fqdn.go:617] "Received DNS Packet with valid Answer" IPs={...} TTL=5
I0408 16:26:26.262911       1 fqdn.go:696] "Making DNS request" fqdn="s3-1.amazonaws.com" dnsServer="10.96.0.10:53"
I0408 16:26:26.264271       1 fqdn.go:617] "Received DNS Packet with valid Answer" IPs={...} TTL=1
I0408 16:26:27.263944       1 fqdn.go:696] "Making DNS request" fqdn="s3-1.amazonaws.com" dnsServer="10.96.0.10:53"
I0408 16:26:27.273226       1 fqdn.go:617] "Received DNS Packet with valid Answer" IPs={...} TTL=5
I0408 16:26:27.273327       1 fqdn.go:496] "Reconciling dirty rule for FQDN address updates" ruleID="80d1db99cf0e001f"
I0408 16:26:27.273448       1 pod_reconciler.go:298] "Reconciling Pod NetworkPolicy rule" rule="80d1db99cf0e001f" policy="AntreaClusterNetworkPolicy:acnp-fqdn-all-foobar"
I0408 16:26:27.273830       1 fqdn.go:696] "Making DNS request" fqdn="s3-1.amazonaws.com" dnsServer="10.96.0.10:53"
I0408 16:26:27.275036       1 fqdn.go:617] "Received DNS Packet with valid Answer" IPs={...} TTL=5
I0408 16:26:32.264899       1 fqdn.go:696] "Making DNS request" fqdn="s3-1.amazonaws.com" dnsServer="10.96.0.10:53"
I0408 16:26:32.269028       1 fqdn.go:617] "Received DNS Packet with valid Answer" IPs={...} TTL=5

After:

I0408 16:21:02.457593       1 fqdn.go:692] "Making DNS request" fqdn="s3-1.amazonaws.com" dnsServer="10.96.0.10:53"
E0408 16:21:04.458702       1 fqdn.go:650] "Error syncing FQDN, retrying" err="DNS request failed for IPv4: read udp 172.18.0.3:39008->10.96.0.10:53: i/o timeout" fqdn="s3-1.amazonaws.com"
I0408 16:21:09.460192       1 fqdn.go:692] "Making DNS request" fqdn="s3-1.amazonaws.com" dnsServer="10.96.0.10:53"
I0408 16:21:09.477573       1 fqdn.go:618] "Received DNS Packet with valid Answer" IPs={...} TTL=5
I0408 16:21:09.477708       1 fqdn.go:497] "Reconciling dirty rule for FQDN address updates" ruleID="80d1db99cf0e001f"
I0408 16:21:09.477855       1 pod_reconciler.go:298] "Reconciling Pod NetworkPolicy rule" rule="80d1db99cf0e001f" policy="AntreaClusterNetworkPolicy:acnp-fqdn-all-foobar"
I0408 16:21:14.478790       1 fqdn.go:692] "Making DNS request" fqdn="s3-1.amazonaws.com" dnsServer="10.96.0.10:53"
I0408 16:21:14.495826       1 fqdn.go:618] "Received DNS Packet with valid Answer" IPs={...} TTL=5
I0408 16:21:14.495974       1 fqdn.go:497] "Reconciling dirty rule for FQDN address updates" ruleID="80d1db99cf0e001f"
I0408 16:21:14.496073       1 pod_reconciler.go:298] "Reconciling Pod NetworkPolicy rule" rule="80d1db99cf0e001f" policy="AntreaClusterNetworkPolicy:acnp-fqdn-all-foobar"
I0408 16:21:19.497120       1 fqdn.go:692] "Making DNS request" fqdn="s3-1.amazonaws.com" dnsServer="10.96.0.10:53"
I0408 16:21:19.511311       1 fqdn.go:618] "Received DNS Packet with valid Answer" IPs={...} TTL=5
I0408 16:21:19.511426       1 fqdn.go:497] "Reconciling dirty rule for FQDN address updates" ruleID="80d1db99cf0e001f"
I0408 16:21:19.511506       1 pod_reconciler.go:298] "Reconciling Pod NetworkPolicy rule" rule="80d1db99cf0e001f" policy="AntreaClusterNetworkPolicy:acnp-fqdn-all-foobar"

The agent maintains DNS cache for FQDNs matching any FQDN rule of
NetworkPolicy in two ways, one of which is that it periodically queries
these FQDNs. However, it turned out that the current implementation made
more queries than needed, for several reasons:

1. addFQDNRule() always triggered a DNS query immediately even if the
   FQDN was already tracked, and the function was called by
   podReconciler every time the rule was updated. This means if a FQDN's
   resolution is changed by a proactive query or a packet-in event, the
   FQDN would always be queried another time immediately.
2. When calculating the expiration time for proactively queried records
   and the delay for next query, the pre-query timestamp was used as the
   base. This caused the next query to likely return the same record
   with a very small TTL, leading to another unnecessary query
   eventually. Besides, this is inconsistent with the base timestamp
   used for records received from packet-in events.

The patch makes the following improvements:

1. Do not trigger an immediate query if the FQDN added by addFQDNRule()
   is already tracked.
2. Do not traverse dnsEntryCache when processing a FQDN matching name in
   addFQDNSelector().
3. Use post-response timestamp as the base of expiration time
   consistently.
4. Make logs about DNS query failure concentrated and clear.

Signed-off-by: Quan Tian <[email protected]>
@tnqn tnqn added area/network-policy Issues or PRs related to network policies. action/release-note Indicates a PR that should be included in release notes. labels Apr 8, 2024
@tnqn tnqn added this to the Antrea v2.0 release milestone Apr 8, 2024
@tnqn tnqn requested review from antoninbas and Dyanngg April 8, 2024 16:31
Copy link
Contributor

@Dyanngg Dyanngg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the enhancements. Checked mustCacheResponse implementation, I think the change would not make the controller miss sending a DNS request that it should

@tnqn
Copy link
Member Author

tnqn commented Apr 9, 2024

Thanks @Dyanngg @antoninbas

/test-all

@tnqn tnqn merged commit 29f1d0c into antrea-io:main Apr 10, 2024
53 of 57 checks passed
@tnqn tnqn deleted the reduce-dns-query branch April 10, 2024 05:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/release-note Indicates a PR that should be included in release notes. area/network-policy Issues or PRs related to network policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants