-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update OVS pipeline document #5412
Conversation
47f1121
to
f660c7e
Compare
Not finished all tables yet. |
f660c7e
to
611904f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick look, but I feel like it will be easier to keep reviewing once inital comments from other reviewers have been addressed
076fd8d
to
0226d1f
Compare
b99a639
to
319b724
Compare
46d475f
to
343a7e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the explanations on flows are too detailed...
docs/design/ovs-pipeline.md
Outdated
The first flow outputs all unicast packets to the correct port (the port was | ||
resolved by the "dmac" table, [L2ForwardingCalcTable]). IP packets for which | ||
[L2ForwardingCalcTable] did not set bit 16 of NXM_NX_REG0 will be dropped. | ||
- Flow 1 is to match new non-Service connections and commit them to the connection tracking module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we don't use "-dnat" state in the flows in this table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a ct mark to distinguish Service packets. The ct mark is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, a few nits
docs/design/ovs-pipeline.md
Outdated
port by rewriting the destination MAC (from the Global Virtual MAC to the | ||
local gateway's MAC). | ||
Antrea-native NetworkPolicy relies on the OVS built-in `conjunction` action to implement policies efficiently. This | ||
enables us to do a conjunctive match across multiple dimensions (source IP, destination IP, port) efficiently without |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, it's better to add etc.
enables us to do a conjunctive match across multiple dimensions (source IP, destination IP, port) efficiently without | |
enables us to do a conjunctive match across multiple dimensions (source IP, destination IP, port, etc.) efficiently without |
c6f8ce7
to
fa194b1
Compare
Resolves antrea-io#5200 Signed-off-by: Hongliang Liu <[email protected]>
fa194b1
to
9a2b76c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost done with reviewing, lots of comments but all of them minor so far
docs/design/ovs-pipeline.md
Outdated
|
||
This ACNP is applied to all Pods with the label `app: web` in all Namespaces. It allows only HTTP ingress traffic on | ||
port 8080 from Pods with the label `app: client`, limited to the `GET` method and `/api/v2/*` path. Any other HTTP | ||
ingress traffic on port 8080 from Pods the label `app: client` will be dropped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ingress traffic on port 8080 from Pods the label `app: client` will be dropped. | |
ingress traffic on port 8080 from Pods with the label `app: client` will be dropped. |
docs/design/ovs-pipeline.md
Outdated
Consider the following Egresses as examples for the remainder of this document. | ||
|
||
This is an Egress applied to Pods with the label `app: web`. For these Pods, all egress traffic will be SNAT'd on the | ||
Node `k8s-node-control-plane` from which we dumped flows in the document with the Egress IP `192.168.77.112`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am suggesting replacing the second sentence here with this for clarity:
For these Pods, all egress traffic (traffic leaving the cluster) will be SNAT'd on the Node
k8s-node-control-plane
using Egress IP192.168.77.112
. In this context,k8s-node-control-plane
is known as the "Egress Node" for this Egress resource. Note that the flows presented in the rest of this document were dumped on Nodek8s-node-control-plane
. Egress flows are different on the "source Node" (Node running a workload Pod to which the Egress resource is applied) and on the "Egress Node" (Node enforcing the SNAT policy).
docs/design/ovs-pipeline.md
Outdated
|
||
Flow 1 is designed for case 1, matching ARP request packets for the MAC address of a remote Antrea gateway with IP address | ||
`10.10.1.1`. It programs an ARP reply packet and sends it back to the port where the request packet was received. Note | ||
that both the source hardware address and the source MAC address in the ARP reply packet are set with the *Global Virtual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/set with/set to
docs/design/ovs-pipeline.md
Outdated
other side of the tunnel. | ||
A similar route is installed on the local Antrea gateway (antrea-gw0) interface every time the Antrea *Node Route Controller* | ||
is notified that a new Node has joined the cluster. The route must be marked as "onlink" since the kernel does not have | ||
a route to the peer gateway `10.10.1.1`. we trick the kernel into believing that `10.10.1.1` is directly connected to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a route to the peer gateway `10.10.1.1`. we trick the kernel into believing that `10.10.1.1` is directly connected to | |
a route to the peer gateway `10.10.1.1`. We "trick" the kernel into believing that `10.10.1.1` is directly connected to |
docs/design/ovs-pipeline.md
Outdated
|
||
### ServiceHairpinTable (23) | ||
This table is designed to determine the "category" of packets by matching the ingress port of the packets. It |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This table is designed to determine the "category" of packets by matching the ingress port of the packets. It | |
This table is designed to determine the "category" of IP packets by matching on their ingress port. It |
docs/design/ovs-pipeline.md
Outdated
Service Endpoint selection has been done. | ||
|
||
Flow 8 and flow 9 are for case 6. Flow 8 has the higher priority than that of flow 9, prioritizing matching the first | ||
packet of connection sourced from a local Pod or the local Node with `FromLocalRegMark` loaded in table [Classifier] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/of connection sourced/of connections sourced
docs/design/ovs-pipeline.md
Outdated
packet of connection sourced from a local Pod or the local Node with `FromLocalRegMark` loaded in table [Classifier] | ||
and destined for the sample [Service with ExternalTrafficPolicy Local]. The target of flow 8 is an OVS group that has | ||
all the Endpoints across the cluster, ensuring accessibility for Service connections originating from local Pods or | ||
Nodes, regardless that `externalTrafficPolicy` of the Service is `Local`. Due to the existence of flow 8, consequently, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nodes, regardless that `externalTrafficPolicy` of the Service is `Local`. Due to the existence of flow 8, consequently, | |
Nodes, even though `externalTrafficPolicy` is set to `Local` for the Service. Due to the existence of flow 8, consequently, |
docs/design/ovs-pipeline.md
Outdated
|
||
### EgressRuleTable (50) | ||
The table implements DNAT for Service connection after Endpoint selection is performed in table [ServiceLB]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Service connection/Service connections
1. table=EndpointDNAT, priority=200,reg0=0x4000/0x4000 actions=controller(reason=no_match,id=62373,userdata=04) | ||
2. table=EndpointDNAT, priority=200,tcp,reg3=0xa0a0018,reg4=0x20050/0x7ffff actions=ct(commit,table=AntreaPolicyEgressRule,zone=65520,nat(dst=10.10.0.24:80),exec(set_field:0x10/0x10->ct_mark,move:NXM_NX_REG0[0..3]->NXM_NX_CT_MARK[0..3])) | ||
3. table=EndpointDNAT, priority=200,tcp,reg3=0xa0a0106,reg4=0x20050/0x7ffff actions=ct(commit,table=AntreaPolicyEgressRule,zone=65520,nat(dst=10.10.1.6:80),exec(set_field:0x10/0x10->ct_mark,move:NXM_NX_REG0[0..3]->NXM_NX_CT_MARK[0..3])) | ||
4. table=EndpointDNAT, priority=190,reg4=0x20000/0x70000 actions=set_field:0x10000/0x70000->reg4,resubmit(,ServiceLB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that you didn't describe this flow below. Can you add a quick description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add. I forgot the description of flows 4 and 5 in this table.
2. table=EndpointDNAT, priority=200,tcp,reg3=0xa0a0018,reg4=0x20050/0x7ffff actions=ct(commit,table=AntreaPolicyEgressRule,zone=65520,nat(dst=10.10.0.24:80),exec(set_field:0x10/0x10->ct_mark,move:NXM_NX_REG0[0..3]->NXM_NX_CT_MARK[0..3])) | ||
3. table=EndpointDNAT, priority=200,tcp,reg3=0xa0a0106,reg4=0x20050/0x7ffff actions=ct(commit,table=AntreaPolicyEgressRule,zone=65520,nat(dst=10.10.1.6:80),exec(set_field:0x10/0x10->ct_mark,move:NXM_NX_REG0[0..3]->NXM_NX_CT_MARK[0..3])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc says:
nat(type=addrs[:ports][,flag]...)
The src and dst options take the following arguments:
addrs
The IP address addr or range addr1-addr2 from which the translated address should be selected. If only one address is given, then that address will always be selected, otherwise the address selection can be informed by the optional persistent flag as described below. Either IPv4 or IPv6 addresses can be provided, but both addresses must be of the same type, and the datapath behavior is undefined in case of providing IPv4 address range for an IPv6 packet, or IPv6 address range for an IPv4 packet. IPv6 addresses must be bracketed with [ and ] if a port range is also given.
So the answer should be no.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done with the review
please address comments using a new / separate commit, so it's easy for me to do a final round of review later
docs/design/ovs-pipeline.md
Outdated
|
||
For this table, you will need to keep in mind the Antrea-native NetworkPolicy | ||
[specification](#antrea-native-networkpolicy-implementation). Since the sample egress policy resides in the Application | ||
Tier. If you dump the flows of this table, you may see the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tier. If you dump the flows of this table, you may see the following: | |
Tier, if you dump the flows of this table, you may see the following: |
docs/design/ovs-pipeline.md
Outdated
``` | ||
- Flow 7 is used to match packets with the source IP address in set {10.10.0.24}, which is from the Pods selected | ||
by the label `app: web`, constituting the first dimension for `conjunction` with `conj_id` 5. | ||
- Flow 8 is used to match any packets, constituting the second dimension for `conjunction` with `conj_id` 5. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know why we even have this flow? Is it because we need at least 2 dimensions for a conjunctive match? If yes, maybe we should indicate it to avoid confusion from the reader. cc @wenyingd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it because we need at least 2 dimensions for a conjunctive match?
That was what I thought, and I didn't think more about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just add one sentence about it at the end of the bullet point:
This flow, which matches all packets, exists because we need at least 2 dimensions for a conjunctive match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add
[specification](#network-policy-implementation) that we are using. We have 2 | ||
Pods running on the same Node, with IP addresses 10.10.1.2 to 10.10.1.3. They | ||
are allowed to talk to each other using TCP on port 80, but nothing else. | ||
### AntreaPolicyIngressRule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please make sure that the comments I left for Egress security tables are also addressed for these tables, when applicable
docs/design/ovs-pipeline.md
Outdated
Flow 1 is for case 1. It matches packets with `L7NPRedirectCTMark` and `OutputToOFPortRegMark`, and then outputs them to | ||
the port `antrea-l7-tap0` specifically created for connecting to an application-aware engine. Notably, these packets are pushed | ||
with an 802.1Q header and loaded with the VLAN ID value persisted in `L7NPRuleVlanIDCTLabel` before being output, due to | ||
the implementation of `L7NetworkPolicy`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though it is a bit orthogonal to the description of this table, I would add the following sentence at the end of the paragraph:
The application-aware engine enforcing L7 policies (e.g., Suricata) can leverage the VLAN ID to determine which set of rules to apply to the packet.
docs/design/ovs-pipeline.md
Outdated
Flow 5 is for case 5, matching packets sent back from an application-aware engine through a specific port and forwarding | ||
them to table [L3Forwarding] to decide the egress port. Like flow 4, the purpose of forwarding the packets to table | ||
[L3Forwarding] is to load tunnel destination IP for packets destined for remote Nodes. `FromTCReturnRegMark` that will | ||
be used in table [TrafficControl] is also loaded to mark the packet source. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mention that the VLAN tag which was previously added for use by the L7 engine is also removed.
the value stored in `TargetOFPortField`. | ||
|
||
Flows 6-7 are for case 6. They match packets by matching `OutputToControllerRegMark` and the value stored in | ||
`PacketInOperationField`, then output them to the OpenFlow controller (Antrea Agent) with corresponding user data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the following at the end of the paragraph:
In practice, you will see additional flows similar to these ones to accommodate different scenarios (different
PacketInOperationField
values). Note that packets sent to controller are metered to avoid overrunning the Antrea Agent and using too many resources.
@antoninbas Not addressed all review comments. Will continue that several hours later. |
Signed-off-by: Hongliang Liu <[email protected]>
d738ce2
to
e2ef40f
Compare
Signed-off-by: Hongliang Liu <[email protected]>
@@ -1078,15 +1093,19 @@ Some bits of ct mark are persisted: | |||
- The value of `PktSourceField` is persisted to `ConnSourceCTMarkField`, storing the source of the connection for the | |||
current packet and subsequent packets of the connection. | |||
|
|||
Flow 4 is to resubmit the packets which are not matched by flows 1-3 back to table [ServiceLB] to select Endpoint again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may not have to add more information now as we are close to release time, but is this in case of an inconsistency between the OVS groups used for Endpoint section and this table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but is this in case of an inconsistency between the OVS groups used for Endpoint section and this table?
I guess this flow should be rarely used, unless the case you mentioned. Furthermore, it seems that this flows almost exists with AntreaProxy. I am not sure if there are other reasons for this flow.
docs/design/ovs-pipeline.md
Outdated
will first go through the gateway, get load-balanced by the kube-proxy data path (DNAT) then re-enter through the | ||
gateway. Then the packets are received on the gateway port with a source IP belonging to a local Pod. | ||
will first go through the gateway port, get load-balanced by the kube-proxy data path (undergo DNAT with a local Endpoint | ||
selected by the kube-proxy) then re-enter through the gateway port. Then the packets are received on the gateway port |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be wrong but I think that is also the case when kube-proxy selects a non-local Endpoint: I don't think kube-proxy will masquerade the traffic and the packet will go back through the gateway to be sent to the remote Pod over the appropriate tunnel. The only exception is the hairpin case.
So maybe replace "undergo DNAT with a local Endpoint selected by the kube-proxy" with "undergoes DNAT", unless I am missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You reminded me. When kube-proxy selects a non-local Endpoint, after DNAT, the packets will be forwarded to antrea-gw0 by the onlink routing added by Antrea NodeController.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added something to remind the reader like this in bold:
When Antrea is deployed with kube-proxy, and AntreaProxy
is not enabled, packets from local Pods destined for Services
will first go through the gateway port, get load-balanced by the kube-proxy data path (undergoes DNAT) then re-enter
the OVS pipeline through the gateway port (through a "onlink" route, installed by Antrea, directing the DNAT'd packets to the gateway port
docs/design/ovs-pipeline.md
Outdated
Services will get load-balanced by the kube-proxy data path (DNAT) and then routed to OVS through the gateway without SNAT. | ||
network destined for Services will be routed to OVS through the gateway port without changing the source IP. | ||
- When Antrea is deployed with kube-proxy, packets from the external network destined for Services whose | ||
`externalTrafficPolicy` is set to `Local` will get load-balanced by the kube-proxy data path (undergo DNAT with a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/undergo/undergoes
docs/design/ovs-pipeline.md
Outdated
@@ -1350,30 +1374,31 @@ If you dump the flows of this table, you may see the following: | |||
``` | |||
|
|||
Flows 1-2 match packets originating from local Pods and destined for the transport IP of remote Nodes, and then forward | |||
them to table [L2ForwardingCalc] to skip Egress SNAT. `ToGatewayRegMark` is loaded, indicating that the output port of | |||
the packets is the local Antrea gateway. | |||
them to table [L2ForwardingCalc] to bypass the Pod-to-Node traffic from Egress SNAT. `ToGatewayRegMark` is loaded, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to bypass the Pod-to-Node traffic from Egress SNAT" should be replaced with "to bypass Egress SNAT for Pod-to-Node traffic"
docs/design/ovs-pipeline.md
Outdated
The packets, matched by flows 1-3, are forwared to this table by flow 8 in table [L3Forwarding], as they are classified | ||
as part of traffic destined for the external network. However, these packets are not intended to undergo Egress SNAT. | ||
Consequently, flows 1-3 are used to bypass Egress SNAT for these packets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great clarification, thanks
7ab3cf9
to
5b4ec04
Compare
Signed-off-by: Hongliang Liu <[email protected]>
5b4ec04
to
1354a31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one comment
docs/design/ovs-pipeline.md
Outdated
bits [16..18] in NXM_NX_REG4 is set in [SessionAffinityTable] by flow `table=40, priority=0 actions=load:0x1->NXM_NX_REG4[16..18]`. | ||
For reply packets from SNAT'd connections, whose destination IP is the translated SNAT IP, after invoking action `ct`, | ||
the destination IP of the packets will be restored to the original IP before SNAT, stored in the connection tracking | ||
field `ct_nw_dst` before SNAT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
field `ct_nw_dst` before SNAT. | |
field `ct_nw_dst`. |
2. table=EndpointDNAT, priority=200,tcp,reg3=0xa0a0018,reg4=0x20050/0x7ffff actions=ct(commit,table=AntreaPolicyEgressRule,zone=65520,nat(dst=10.10.0.24:80),exec(set_field:0x10/0x10->ct_mark,move:NXM_NX_REG0[0..3]->NXM_NX_CT_MARK[0..3])) | ||
3. table=EndpointDNAT, priority=200,tcp,reg3=0xa0a0106,reg4=0x20050/0x7ffff actions=ct(commit,table=AntreaPolicyEgressRule,zone=65520,nat(dst=10.10.1.6:80),exec(set_field:0x10/0x10->ct_mark,move:NXM_NX_REG0[0..3]->NXM_NX_CT_MARK[0..3])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc says:
nat(type=addrs[:ports][,flag]...)
The src and dst options take the following arguments:
addrs
The IP address addr or range addr1-addr2 from which the translated address should be selected. If only one address is given, then that address will always be selected, otherwise the address selection can be informed by the optional persistent flag as described below. Either IPv4 or IPv6 addresses can be provided, but both addresses must be of the same type, and the datapath behavior is undefined in case of providing IPv4 address range for an IPv6 packet, or IPv6 address range for an IPv4 packet. IPv6 addresses must be bracketed with [ and ] if a port range is also given.
So the answer should be no.
Signed-off-by: Quan Tian <[email protected]>
Thanks you guys for reviewing this big PR! |
Thanks @hongliangl for working on this massive documentation PR! |
echo @antoninbas |
Resolves #5200