-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingress host/flow isolation difficulties #145
Comments
On 1 Nov, 2020, at 9:07 pm, xnoreq ***@***.***> wrote:
I have a router with
• eth0.2 ... connects to WAN
• eth0.1, wifi0 ... bridged to br-lan ... bridge for LAN
eth0.2 has an ingress qdisc and a tc filter with action mirred egress redirect dev ifb0.
ifb0 has a cake qdisc with a 100 Mbps bandwidth limit and the options ingress dual-dsthost nat besteffort.
Now having a heavy (that consumes all available bandwidth) and a light (a few kB/s) UDP flow from a single LAN host to two different hosts on the Internet results in packet drops in the light flow.
You don't mention having any *egress* qdisc on eth0.2 - that is what would control traffic in that direction. That's what you would get with "tc qdisc replace dev eth0.2 root cake …"
- Jonathan Morton
|
Sorry, my wording was bad. The flows are downloads and in my example only ingress is relevant. |
To clarify again: 2 flows which are basically "downloads". Looking at ingress only because that's where the packets are discarded because bandwidth limits are hit. (Of course there's also egress and I do have another cake instance set up there as well, but this is irrelevant for the example because data is sent at well below the egress limit.) One "connection" is greedy in the sense that it will use the full 100 Mbps if available while the other is fairly constant and uses less than ~0.5 Mbps. I am seeing packets getting dropped from the <0.5 Mbps connection. Why is this happening? |
Looking through the code, I do have some questions. Thanks. |
xnoreq <[email protected]> writes:
Looking through the code, I do have some questions.
For example, why is there no difference between one or zero bulk flows
in terms of deficit?
Eh? What do you mean?
On ack drop with CAKE_FLAG_INGRESS, why is deficit not decremented?
Same question for cake_drop function.
You have this backwards: deficit and shaper is only advanced on drop if
the ingress flag is set. The counters count how much data the flow
actually transmits, and the logic is that on ingress, the packet has
already traversed the bottleneck, so it's counted against what the flow
has transmitted even though we drop it.
Why is deficit only added to in case it was negative and not for every
flow for every time we dequeue/drop? Doesn't this cause unfairness?
No, it doesn't. The refilling of deficit is done one each rotation of
the queue order; this is how DRR scheduling works. See
https://en.wikipedia.org/wiki/Deficit_round_robin
Could a bursty flow constantly go into decaying state or be removed
and re-added as a new sparse flow, effectively ruining
fairness/isolation? If that is happening then it also doesn't help
that the deficit is reset for every "new" flow.
Yes, that's the idea. Each time a flow goes away it'll be 'sparse' the
next time it comes around. This is a feature; and the decay makes sure
that a flow can only do this if it uses *less* that its fair share of
the link, so it doesn't ruin fairness.
|
Therefore zero and one bulk_flow_count result in the same host_load. Why not start from zero and change the initialization code of quantum_div from: Lines 2894 to 2896 in 4897938
to something like this:
Also, looking at the access to quantum_div with an index based on bulk_flow_count, what prevents an out-of-bounds access here if bulk_flow_count goes above CAKE_QUEUES?
I understand that and I guess that it would be even better if that accounting could be done as soon as the packet is received in ingress mode rather than later when the packet is dropped/transmitted. Lines 1906 to 1917 in 4897938
cake_advance_shaper does not touch deficit either, hence why it's decremented separately e.g. in cake_dequeue after calling cake_advance_shaper: Lines 2295 to 2300 in 4897938
And it's not touched in cake_drop either: Lines 1650 to 1652 in 4897938
My point is that this seems to have been overlooked for the ACK-filtering. Also speaking of ACK-filtering, I assume that there is a good technical reason why ACK-filtering doesn't happen in the GSO case?
Reading through the code more carefully I think I now get how this is working. Adding to the deficit counter only after it's become <= 0 is a smart way to not have to remember whether the deficit was increased between consecutive visits of the flow, as would be necessary if it was implemented like the pseudo-code on Wikipedia.
But if that's the case, what's preventing such a bursty flow from starving a much more well-behaved and lower rate flow? That scenario would actually fit the behavior that I observed. |
xnoreq <[email protected]> writes:
>> xnoreq ***@***.***> writes:
>> Looking through the code, I do have some questions. For example, why is there no difference between one or zero bulk flows in terms of deficit?
> Eh? What do you mean?
```
host_load = 1;
host_load = max(host_load, dsthost->dsthost_bulk_flow_count);
flow->deficit += (b->flow_quantum * quantum_div[host_load] + (prandom_u32() >> 16)) >> 16;
```
Therefore zero and one bulk_flow_count result in the same host_load.
Why not start from zero and change the initialization code of quantum_div from:
https://github.com/dtaht/sch_cake/blob/48979385757f3408c3427b3ebbf5963efdada5aa/sch_cake.c#L2894-L2896
to something like this:
```
for (i = 0; i <= CAKE_QUEUES; i++)
quantum_div[i] = 65535 / (i+1);
```
*shrugs*, because that is now how the code was written?
Also, looking at the access to quantum_div with an index based on
bulk_flow_count, what prevents an out-of-bounds access here if
bulk_flow_count goes above CAKE_QUEUES?
How would bulk_flow_count go above CAKE_QUEUES?
>> On ack drop with CAKE_FLAG_INGRESS, why is deficit not decremented? Same question for cake_drop function.
> You have this backwards: deficit and shaper is only advanced on drop if the ingress flag is set. The counters count how much data the flow actually transmits, and the logic is that on ingress, the packet has already traversed the bottleneck, so it's counted against what the flow has transmitted even though we drop it.
I understand that and I guess that it would be even better if that accounting could be done as soon as the packet is received in ingress mode rather than later when the packet is dropped/transmitted.
But back to my point, where is deficit supposed to be touched here?
https://github.com/dtaht/sch_cake/blob/48979385757f3408c3427b3ebbf5963efdada5aa/sch_cake.c#L1906-L1917
cake_advance_shaper does not touch deficit either, hence why it's decremented separately e.g. in cake_dequeue after calling cake_advance_shaper:
https://github.com/dtaht/sch_cake/blob/48979385757f3408c3427b3ebbf5963efdada5aa/sch_cake.c#L2295-L2300
And it's not touched in cake_drop either:
https://github.com/dtaht/sch_cake/blob/48979385757f3408c3427b3ebbf5963efdada5aa/sch_cake.c#L1650-L1652
My point is that this seems to have been overlooked for the
ACK-filtering.
Why would you combine ingress mode and ACK filtering? That makes no
sense...
Also speaking of ACK-filtering, I assume that there is a good
technical reason why ACK-filtering doesn't happen in the GSO case?
The ACK filter only drops pure ACKs; those are not GSO packets...
>> Why is deficit only added to in case it was negative and not for every flow for every time we dequeue/drop? Doesn't this cause unfairness?
> No, it doesn't. The refilling of deficit is done one each rotation of the queue order; this is how DRR scheduling works. See https://en.wikipedia.org/wiki/Deficit_round_robin
Reading through the code more carefully I think I now get how this is
working. Adding to the deficit counter only after it's become <= 0 is
a smart way to not have to remember whether the deficit was increased
between consecutive visits of the flow, as would be necessary if it
was implemented like the pseudo-code on Wikipedia.
So what's left to ensure for fairness is that all flows are visited in
order, which may or may not be the case. I have to think this through
some more.
Yes, they are.
>> Could a bursty flow constantly go into decaying state or be removed and re-added as a new sparse flow, effectively ruining fairness/isolation? If that is happening then it also doesn't help that the deficit is reset for every "new" flow.
> Yes, that's the idea. Each time a flow goes away it'll be 'sparse' the next time it comes around. This is a feature; and the decay makes sure that a flow can only do this if it uses *less* that its fair share of the link, so it doesn't ruin fairness.
But if that's the case, what's preventing such a bursty flow from
starving a much more well-behaved and lower rate flow? That scenario
would actually fit the behavior that I observed.
The decay is the thing that prevents this, by delaying the removal of
the flow state...
|
... You could have just pointed me to cake_hash and how it caps src/dsthost_bulk_flow_count by decrementing those counters in case of hash collisions with bulk queues.
I don't and I don't know. But it is implemented partly. Why is it in the code like that?
Of course, thanks.
Are you talking about CAKE_SET_DECAYING here? |
xnoreq <[email protected]> writes:
> How would bulk_flow_count go above CAKE_QUEUES?
... You could have just pointed me to cake_hash and how it caps
src/dsthost_bulk_flow_count by decrementing those counters in case of
hash collisions with bulk queues.
Well, you seemed perfectly capable of looking at the code yourself :)
Which also seems to imply that a host with 1x CAKE_QUEUES flows is not
treated differently from a host with e.g. 5x CAKE_QUEUES flows. Though
as cake seems to be designed for (powerful) home routers this probably
doesn't matter.
Erm, not sure what you mean here...
> Why would you combine ingress mode and ACK filtering? That makes no sense...
I don't and I don't know. But it is implemented partly. Why is it in
the code like that?
Because we didn't think that anyone would turn on both ingress mode and
ack filter at the same time, I suppose?
> Yes, they are.
> The decay is the thing that prevents this, by delaying the removal of the flow state...
Are you talking about CAKE_SET_DECAYING here?
In cake_enqueue, such decaying flows are treated like new (sparse) flows and their deficit is reset. That's also something I was asking about earlier.
I see that when the flow is set to decaying, the `goto begin` restarts
dequeuing, so another flow might get a packet submitted or dropped,
but if a new packet arrives for the decaying flow in the meantime then
it's again processed first. Since the deficit was reset the flow can
now also send a full quantum of bytes again. Doesn't this give an
unfair advantage to such flows?
Yeah, this one is a bit subtle: The thing that prevents starvation is
that the state is not cleared immediately; once a queue runs empty,
it'll be put into the set of decaying flows, and stay there until its
codel 'count' has gone back down to zero. So if another packet for that
flow arrives before that happens, it'll keep the same deficit and won't
be able to gain any advantage.
|
I understand what you're saying but I don't see this reflected in the code. Let me repeat:
1 & 2 mean that in cake_dequeue it gets into the AQM part first, before any other old_flows. If the flow then sends a packet and keeps decaying the whole process repeats. I took openwrt's sch_cake and did a couple of modifications to the ingress mode. I probably will upload this to a new repo after more testing on a home router. With the modifications I can run in besteffort and without any iptables rules (to classify the traffic into different tins) which significantly reduces CPU load, actually overload, to a small fraction. Plus a couple of other hacks. One thing that is somewhat of a mystery to me is |
Huh, I think it's actually worse than that: By assumption, a sparse flow doesn't trigger the AQM at all. So a small packet (size < quantum) will get enqueued, flow will go to new_flows, and on dequeue that packet will get dequeued immediately, but the flow will stay at the head of new_queues. On the next dequeue, the flow will actually be empty, so cake_dequeue_one() will return NULL, and the logic will hit the else branch: Lines 2263 to 2284 in 4897938
This will take the flow straight back to empty, causing a new arriving packet to move it back to the front. So a sparse flow that times its packets right (to arrive just after the flow state expires), can potentially cause persistent starvation of the other flows. @chromi any reason why the above is wrong? As for the decaying mode, I'm not so worried about that clearing the deficit on re-activation. I mean, it's a bit weird (why are we keeping all the other state but clearing the deficit?), but at least it won't cause starvation because that flow has already been through the old flows rotation at least once...
Feel free to open pull requests with any changes you want to share.
I think this was added because of some potential starvation issues. @chromi ? |
@tohojo I think that's correct and that it's possible to construct malicious traffic patterns that cause total starvation, but I was just concerned with a practical case in my application where I saw a low bandwidth flow (e.g. from an online game with periodic updates in form of small UDP packets) dropping packets even though total bandwidth was not even remotely exhausted. I do have some other hacks like differentiation of bulk or sparse decaying flows, doing deficit accounting asap (in enqueue) in ingress mode, not resetting the deficit etc. running on an openwrt router, but as I'm not doing any scientifc measurements (wouldn't know where to start) I cannot make concrete statements about the effects of any of those. All I can say is "works better for me on my home router". I'll maybe upload something later in the evening. I guess that the deficit (counter) is cleared because that's what the DRR algorithm does when a queue is empty. Yeah, I know, that's another one of those "because" answers. |
On 10 Nov, 2020, at 11:52 am, Toke Høiland-Jørgensen ***@***.***> wrote:
> I understand what you're saying but I don't see this reflected in the code. Let me repeat:
> When a new packet arrives for a decaying flow, it is treated almost exactly like a new flow.
>
> 1. it's added/moved to the end of `new_flows`
>
> 2. its deficit is reset
>
> 1 & 2 mean that in cake_dequeue it gets into the AQM part first, before any other old_flows. If the flow then sends a packet and keeps decaying the whole process repeats.
> As I wrote before, after the flow is set to decaying, there is a goto that will result in another flow being processed, but this is potentially another such "new" flow
Huh, I think it's actually worse than that: By assumption, a sparse flow doesn't trigger the AQM at all. So a small packet (size < quantum) will get enqueued, flow will go to new_flows, and on dequeue that packet will get dequeued immediately, but the flow will stay at the head of new_queues. On the next dequeue, the flow will actually be empty, so cake_dequeue_one() will return NULL, and the logic will hit the else branch:
https://github.com/dtaht/sch_cake/blob/48979385757f3408c3427b3ebbf5963efdada5aa/sch_cake.c#L2263-L2284
This will take the flow straight back to empty, causing a new arriving packet to move it back to the front. So a sparse flow that times its packets right (to arrive just after the flow state expires), can potentially cause persistent starvation of the other flows.
@chromi any reason why the above is wrong?
If the there is time for the state to expire between packets, then the flow *is* sparse, and it is appropriate to prioritise it. The scenario you describe would actually require multiple flows to each have their timing fall in a pretty small window of opportunity, if it exists at all.
I'll freely admit that the logic here is more complex and harder to follow than I'd like. However, it's in a state that works well in practice, and which would requite a clean-slate rework to improve appreciably from this point without risking counter-intuitive effects. I would have to prototype that elsewhere in order to establish that a new design worked at least as well in a wide variety of conditions.
As for the decaying mode, I'm not so worried about that clearing the deficit on re-activation. I mean, it's a bit weird (why are we keeping all the other state but clearing the deficit?), but at least it won't cause starvation because that flow has already been through the old flows rotation at least once...
The purpose of the "decaying" state is to ensure that COBALT returns to a quiescent state soon after a flow goes away, rather than requiring more work when a new flow starts up in the same slot. It tries to take time that would normally be used to process bulk flows.
> I took openwrt's sch_cake and did a couple of modifications to the ingress mode. I probably will upload this to a new repo after more testing on a home router. With the modifications I can run in besteffort and without any iptables rules (to classify the traffic into different tins) which significantly reduces CPU load, actually overload, to a small fraction. Plus a couple of other hacks.
Feel free to open pull requests with any changes you want to share.
> One thing that is somewhat of a mystery to me is q->failsafe_next_packet. I see that this was introduced with ingress mode, to help in scenarios where a lot of packets are dropped?
> Wouldn't this just advance time_next_packet a bit further into the future? Could you please explain in which situation this failsafe operates? Thanks.
I think this was added because of some potential starvation issues. @chromi ?
Yes, without this feature, it's possible to get into a state where the AQM is dropping all the packets and making no actual deliveries. That can't happen without ingress mode, because normally a fresh packet would be chosen to fill the same downstream timeslot. But in ingress mode, the assumption is that a packet in the queue has *already* occupied some *upstream* timeslot, which cannot be taken back.
So the failsafe is just that - a way to keep *some* traffic moving in the case where the above might otherwise occur. NB: with ECN you can have congestion control without dropping packets, but there's still a lot of traffic that doesn't understand ECN yet.
- Jonathan Morton
|
Regarding the next_packet timestamps in ingress mode:
On the other hand, if packets are mostly dropped then As The issue I see here is that if the situation changes from mostly transmitting for a while to dropping a lot of packets. If failsafe was ahead by a lot (let's say tens of ms) and we start dropping lots of packets then you are exactly in the same situation that this is supposed to fix. Assuming my understanding of all of this is correct, why is |
I think something like this patch would help on the starvation issue (which I think don't is quite as benign as you make it sound; there's a reason fq_codel safeguards against this):
Will test this and submit a proper patch :) |
@tohojo CAKE_SET_SPARSE_WAIT is used for "sparse flows in the bulk rotation". If a packet is enqueued for this flow after hitting your code then it will be categorized as a BULK flow. I'm not sure that's what you want for a sparse flow that potentially didn't even enter the bulk rotation. |
xnoreq <[email protected]> writes:
@tohojo CAKE_SET_SPARSE_WAIT is used for "sparse flows in the bulk
rotation". If a packet is enqueued for this flow after hitting your
code then it will be categorized as a BULK flow. I'm not sure that's
what you want for a sparse flow that potentially didn't even enter the
bulk rotation.
Yeah, that is what we want: if a packet arrives for that flow while it
is going through the rotation, then it will continue being scheduled as
a bulk flow, because then it *wasn't* really sparse.
BTW, none of all of the above explains your original issue with packet
drops on the sparse flow, so I'll be interested to know which of your
changes actually fixed that... Feel free to just open a pull request
with a bunch of commits, and we can sort through them and pick out those
which appear to warrant more testing :)
|
On 11 Nov, 2020, at 1:06 am, Toke Høiland-Jørgensen ***@***.***> wrote:
> @tohojo CAKE_SET_SPARSE_WAIT is used for "sparse flows in the bulk
> rotation". If a packet is enqueued for this flow after hitting your
> code then it will be categorized as a BULK flow. I'm not sure that's
> what you want for a sparse flow that potentially didn't even enter the
> bulk rotation.
Yeah, that is what we want: if a packet arrives for that flow while it
is going through the rotation, then it will continue being scheduled as
a bulk flow, because then it *wasn't* really sparse.
Actually, no. For that to be true, you need the *additional* condition that the deficit has expired. In DRR++, that is what moves a flow from sparse to bulk status. So your suggested fix materially makes the sparseness condition more stringent than intended.
This is exactly the sort of unintended effect that I warned against earlier.
- Jonathan Morton
|
Jonathan Morton <[email protected]> writes:
> On 11 Nov, 2020, at 1:06 am, Toke Høiland-Jørgensen ***@***.***> wrote:
>
> > @tohojo CAKE_SET_SPARSE_WAIT is used for "sparse flows in the bulk
> > rotation". If a packet is enqueued for this flow after hitting your
> > code then it will be categorized as a BULK flow. I'm not sure that's
> > what you want for a sparse flow that potentially didn't even enter the
> > bulk rotation.
>
> Yeah, that is what we want: if a packet arrives for that flow while it
> is going through the rotation, then it will continue being scheduled as
> a bulk flow, because then it *wasn't* really sparse.
Actually, no. For that to be true, you need the *additional*
condition that the deficit has expired. In DRR++, that is what moves
a flow from sparse to bulk status. So your suggested fix materially
makes the sparseness condition more stringent than intended.
That would only impact sparse flows that send several tiny packets with
a combined size less than the quantum within one rotation. Which I
suppose could be a concern, actually, but it would also be fairly
straight-forward to fix, I think: Just re-promote the flows out from the
old queue rotation on enqueue if there's still a positive quantum.
Whereas with the existing code you can use careful timing to get half of
the link bandwidth regardless of the number of flows, like so:
Assume N bulk flows all using quantum-size packets, constantly
backlogged. Flow S packet comes in, goes into the sparse rotation, is
immediately dequeued. On the next dequeue, the state for flow S is
cleared, and a bulk packet is dequeued and will start transmission.
While that packet is dequeued, flow S sends another packet, which will
be put to the head of the queue and dequeued next.
So with this, flow S can time its packets so that it is serviced on
every other dequeue, which will give it a bandwidth share of s/n where s
is the sparse flow packet size and n is the bulk flow packet size -
regardless of the number of bulk flows.
The only pieces of information needed to do this is the wire time of its
own packet + that of a bulk flow. Which really means all it needs to
know is the physical link (or shaper) rate; not too difficult to guess.
And especially at low rates, the timing doesn't even need to be that
precise (ms granularity).
Going to try and prototype a packet generator that can exploit the
above; if it's as straight forward as I think, we *really* need to fix
it :)
This is exactly the sort of unintended effect that I warned against earlier.
It is not 'unintended', it's a deliberate tradeoff; what I proposed
above is literally what fq_codel does to protect against starvation:
https://elixir.bootlin.com/linux/latest/source/net/sched/sch_fq_codel.c#L308
As I said above, though, I don't think it's too hard to fix even this
issue...
|
My remark wasn't about moving the flow to the end of old_flows, that's fine. It was only about reusing CAKE_SET_SPARSE_WAIT which is not the right status. I think that's what Toke referred to as well. It's simply to fix. Just add another state and on enqueue add a case for this new state that just sets the state back to SPARSE. Also need to add this case to where empty queues are deleted. |
Yeah, as mentioned earlier I think it's also easy to starve old flows: a simple loop over some port or IP range, send a packet to each IP:port at a high enough rate. Repeat. That should do it. |
Was the fix ever implemented? BTW is there an easy way to see which packets CAKE actually drops? I'd love to be able to 'tcpdump' the dropped packets. |
Hmm, no, guess not :/
There's a |
Hello,
I have a router with
eth0.2
... connects to WANeth0.1
,wifi0
... bridged tobr-lan
... bridge for LANeth0.2
has aningress
qdisc and a tcfilter
withaction mirred egress redirect dev ifb0
.ifb0 has a cake qdisc with a 100 Mbps bandwidth limit and the options
ingress dual-dsthost nat besteffort
.Now having a heavy (that consumes all available bandwidth) and a light (a few kB/s) UDP flow from a single LAN host to two different hosts on the Internet results in packet drops in the light flow.
Why? Shouldn't
dual-dsthost
provide flow isolation and fairness over those two flows and therefore drop from the heavy flow instead of the light one?Is there a better workaround other than using iptables hashlimit to detect heavy flows (basically do what cake should do), mark them and use this with something like
diffserv3
and another tc filter to override the tin, to "de-prioritize" those flows?Or is something broken here?
The text was updated successfully, but these errors were encountered: