Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingress host/flow isolation difficulties #145

Open
xnoreq opened this issue Nov 1, 2020 · 23 comments
Open

Ingress host/flow isolation difficulties #145

xnoreq opened this issue Nov 1, 2020 · 23 comments

Comments

@xnoreq
Copy link

xnoreq commented Nov 1, 2020

Hello,

I have a router with

  • eth0.2 ... connects to WAN
  • eth0.1, wifi0 ... bridged to br-lan ... bridge for LAN

eth0.2 has an ingress qdisc and a tc filter with action mirred egress redirect dev ifb0.

ifb0 has a cake qdisc with a 100 Mbps bandwidth limit and the options ingress dual-dsthost nat besteffort.

Now having a heavy (that consumes all available bandwidth) and a light (a few kB/s) UDP flow from a single LAN host to two different hosts on the Internet results in packet drops in the light flow.
Why? Shouldn't dual-dsthost provide flow isolation and fairness over those two flows and therefore drop from the heavy flow instead of the light one?

Is there a better workaround other than using iptables hashlimit to detect heavy flows (basically do what cake should do), mark them and use this with something like diffserv3 and another tc filter to override the tin, to "de-prioritize" those flows?

Or is something broken here?

@chromi
Copy link
Collaborator

chromi commented Nov 1, 2020 via email

@xnoreq
Copy link
Author

xnoreq commented Nov 1, 2020

Sorry, my wording was bad. The flows are downloads and in my example only ingress is relevant.

@xnoreq
Copy link
Author

xnoreq commented Nov 2, 2020

To clarify again: 2 flows which are basically "downloads". Looking at ingress only because that's where the packets are discarded because bandwidth limits are hit. (Of course there's also egress and I do have another cake instance set up there as well, but this is irrelevant for the example because data is sent at well below the egress limit.)

One "connection" is greedy in the sense that it will use the full 100 Mbps if available while the other is fairly constant and uses less than ~0.5 Mbps.

I am seeing packets getting dropped from the <0.5 Mbps connection. Why is this happening?

@xnoreq
Copy link
Author

xnoreq commented Nov 4, 2020

Looking through the code, I do have some questions.
For example, why is there no difference between one or zero bulk flows in terms of deficit?
On ack drop with CAKE_FLAG_INGRESS, why is deficit not decremented? Same question for cake_drop function.
Why is deficit only added to in case it was negative and not for every flow for every time we dequeue/drop? Doesn't this cause unfairness?
Could a bursty flow constantly go into decaying state or be removed and re-added as a new sparse flow, effectively ruining fairness/isolation? If that is happening then it also doesn't help that the deficit is reset for every "new" flow.

Thanks.

@tohojo
Copy link
Collaborator

tohojo commented Nov 5, 2020 via email

@xnoreq
Copy link
Author

xnoreq commented Nov 5, 2020

xnoreq [email protected] writes:
Looking through the code, I do have some questions. For example, why is there no difference between one or zero bulk flows in terms of deficit?

Eh? What do you mean?

host_load = 1;
host_load = max(host_load, dsthost->dsthost_bulk_flow_count);
flow->deficit += (b->flow_quantum * quantum_div[host_load] + (prandom_u32() >> 16)) >> 16;

Therefore zero and one bulk_flow_count result in the same host_load.

Why not start from zero and change the initialization code of quantum_div from:

sch_cake/sch_cake.c

Lines 2894 to 2896 in 4897938

quantum_div[0] = ~0;
for (i = 1; i <= CAKE_QUEUES; i++)
quantum_div[i] = 65535 / i;

to something like this:

	for (i = 0; i <= CAKE_QUEUES; i++)
		quantum_div[i] = 65535 / (i+1);

Also, looking at the access to quantum_div with an index based on bulk_flow_count, what prevents an out-of-bounds access here if bulk_flow_count goes above CAKE_QUEUES?

On ack drop with CAKE_FLAG_INGRESS, why is deficit not decremented? Same question for cake_drop function.

You have this backwards: deficit and shaper is only advanced on drop if the ingress flag is set. The counters count how much data the flow actually transmits, and the logic is that on ingress, the packet has already traversed the bottleneck, so it's counted against what the flow has transmitted even though we drop it.

I understand that and I guess that it would be even better if that accounting could be done as soon as the packet is received in ingress mode rather than later when the packet is dropped/transmitted.
But back to my point, where is deficit supposed to be touched here?

sch_cake/sch_cake.c

Lines 1906 to 1917 in 4897938

if (ack) {
b->ack_drops++;
sch->qstats.drops++;
b->bytes += qdisc_pkt_len(ack);
len -= qdisc_pkt_len(ack);
q->buffer_used += skb->truesize - ack->truesize;
if (q->rate_flags & CAKE_FLAG_INGRESS)
cake_advance_shaper(q, b, ack, now, true);
qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(ack));
consume_skb(ack);
} else {

cake_advance_shaper does not touch deficit either, hence why it's decremented separately e.g. in cake_dequeue after calling cake_advance_shaper:

sch_cake/sch_cake.c

Lines 2295 to 2300 in 4897938

if (q->rate_flags & CAKE_FLAG_INGRESS) {
len = cake_advance_shaper(q, b, skb,
now, true);
flow->deficit -= len;
b->tin_deficit -= len;
}

And it's not touched in cake_drop either:

sch_cake/sch_cake.c

Lines 1650 to 1652 in 4897938

if (q->rate_flags & CAKE_FLAG_INGRESS)
cake_advance_shaper(q, b, skb, now, true);

My point is that this seems to have been overlooked for the ACK-filtering.

Also speaking of ACK-filtering, I assume that there is a good technical reason why ACK-filtering doesn't happen in the GSO case?

Why is deficit only added to in case it was negative and not for every flow for every time we dequeue/drop? Doesn't this cause unfairness?

No, it doesn't. The refilling of deficit is done one each rotation of the queue order; this is how DRR scheduling works. See https://en.wikipedia.org/wiki/Deficit_round_robin

Reading through the code more carefully I think I now get how this is working. Adding to the deficit counter only after it's become <= 0 is a smart way to not have to remember whether the deficit was increased between consecutive visits of the flow, as would be necessary if it was implemented like the pseudo-code on Wikipedia.
So what's left to ensure for fairness is that all flows are visited in order, which may or may not be the case. I have to think this through some more.

Could a bursty flow constantly go into decaying state or be removed and re-added as a new sparse flow, effectively ruining fairness/isolation? If that is happening then it also doesn't help that the deficit is reset for every "new" flow.

Yes, that's the idea. Each time a flow goes away it'll be 'sparse' the next time it comes around. This is a feature; and the decay makes sure that a flow can only do this if it uses less that its fair share of the link, so it doesn't ruin fairness.

But if that's the case, what's preventing such a bursty flow from starving a much more well-behaved and lower rate flow? That scenario would actually fit the behavior that I observed.

@tohojo
Copy link
Collaborator

tohojo commented Nov 5, 2020 via email

@xnoreq
Copy link
Author

xnoreq commented Nov 5, 2020

How would bulk_flow_count go above CAKE_QUEUES?

... You could have just pointed me to cake_hash and how it caps src/dsthost_bulk_flow_count by decrementing those counters in case of hash collisions with bulk queues.
Which also seems to imply that a host with 1x CAKE_QUEUES flows is not treated differently from a host with e.g. 5x CAKE_QUEUES flows. Though as cake seems to be designed for (powerful) home routers this probably doesn't matter.

Why would you combine ingress mode and ACK filtering? That makes no sense...

I don't and I don't know. But it is implemented partly. Why is it in the code like that?

The ACK filter only drops pure ACKs; those are not GSO packets...

Of course, thanks.

Yes, they are.
The decay is the thing that prevents this, by delaying the removal of the flow state...

Are you talking about CAKE_SET_DECAYING here?
In cake_enqueue, such decaying flows are treated like new (sparse) flows and their deficit is reset. That's also something I was asking about earlier.
I see that when the flow is set to decaying, the goto begin restarts dequeuing, so another flow might get a packet submitted or dropped, but if a new packet arrives for the decaying flow in the meantime then it's again processed first. Since the deficit was reset the flow can now also send a full quantum of bytes again. Doesn't this give an unfair advantage to such flows?

@tohojo
Copy link
Collaborator

tohojo commented Nov 6, 2020 via email

@xnoreq
Copy link
Author

xnoreq commented Nov 10, 2020

Are you talking about CAKE_SET_DECAYING here? In cake_enqueue, such decaying flows are treated like new (sparse) flows and their deficit is reset. That's also something I was asking about earlier. I see that when the flow is set to decaying, the goto begin restarts dequeuing, so another flow might get a packet submitted or dropped, but if a new packet arrives for the decaying flow in the meantime then it's again processed first. Since the deficit was reset the flow can now also send a full quantum of bytes again. Doesn't this give an unfair advantage to such flows?

Yeah, this one is a bit subtle: The thing that prevents starvation is that the state is not cleared immediately; once a queue runs empty, it'll be put into the set of decaying flows, and stay there until its codel 'count' has gone back down to zero. So if another packet for that flow arrives before that happens, it'll keep the same deficit and won't be able to gain any advantage.

I understand what you're saying but I don't see this reflected in the code. Let me repeat:
When a new packet arrives for a decaying flow, it is treated almost exactly like a new flow.

  1. it's added/moved to the end of new_flows
  2. its deficit is reset

1 & 2 mean that in cake_dequeue it gets into the AQM part first, before any other old_flows. If the flow then sends a packet and keeps decaying the whole process repeats.
As I wrote before, after the flow is set to decaying, there is a goto that will result in another flow being processed, but this is potentially another such "new" flow

I took openwrt's sch_cake and did a couple of modifications to the ingress mode. I probably will upload this to a new repo after more testing on a home router. With the modifications I can run in besteffort and without any iptables rules (to classify the traffic into different tins) which significantly reduces CPU load, actually overload, to a small fraction. Plus a couple of other hacks.

One thing that is somewhat of a mystery to me is q->failsafe_next_packet. I see that this was introduced with ingress mode, to help in scenarios where a lot of packets are dropped?
Wouldn't this just advance time_next_packet a bit further into the future? Could you please explain in which situation this failsafe operates? Thanks.

@tohojo
Copy link
Collaborator

tohojo commented Nov 10, 2020

I understand what you're saying but I don't see this reflected in the code. Let me repeat:
When a new packet arrives for a decaying flow, it is treated almost exactly like a new flow.

1. it's added/moved to the end of `new_flows`

2. its deficit is reset

1 & 2 mean that in cake_dequeue it gets into the AQM part first, before any other old_flows. If the flow then sends a packet and keeps decaying the whole process repeats.
As I wrote before, after the flow is set to decaying, there is a goto that will result in another flow being processed, but this is potentially another such "new" flow

Huh, I think it's actually worse than that: By assumption, a sparse flow doesn't trigger the AQM at all. So a small packet (size < quantum) will get enqueued, flow will go to new_flows, and on dequeue that packet will get dequeued immediately, but the flow will stay at the head of new_queues. On the next dequeue, the flow will actually be empty, so cake_dequeue_one() will return NULL, and the logic will hit the else branch:

sch_cake/sch_cake.c

Lines 2263 to 2284 in 4897938

} else {
/* remove empty queue from the flowchain */
list_del_init(&flow->flowchain);
if (flow->set == CAKE_SET_SPARSE ||
flow->set == CAKE_SET_SPARSE_WAIT)
b->sparse_flow_count--;
else if (flow->set == CAKE_SET_BULK) {
b->bulk_flow_count--;
if (cake_dsrc(q->flow_mode))
srchost->srchost_bulk_flow_count--;
if (cake_ddst(q->flow_mode))
dsthost->dsthost_bulk_flow_count--;
} else
b->decaying_flow_count--;
flow->set = CAKE_SET_NONE;
}
goto begin;
}

This will take the flow straight back to empty, causing a new arriving packet to move it back to the front. So a sparse flow that times its packets right (to arrive just after the flow state expires), can potentially cause persistent starvation of the other flows.

@chromi any reason why the above is wrong?

As for the decaying mode, I'm not so worried about that clearing the deficit on re-activation. I mean, it's a bit weird (why are we keeping all the other state but clearing the deficit?), but at least it won't cause starvation because that flow has already been through the old flows rotation at least once...

I took openwrt's sch_cake and did a couple of modifications to the ingress mode. I probably will upload this to a new repo after more testing on a home router. With the modifications I can run in besteffort and without any iptables rules (to classify the traffic into different tins) which significantly reduces CPU load, actually overload, to a small fraction. Plus a couple of other hacks.

Feel free to open pull requests with any changes you want to share.

One thing that is somewhat of a mystery to me is q->failsafe_next_packet. I see that this was introduced with ingress mode, to help in scenarios where a lot of packets are dropped?
Wouldn't this just advance time_next_packet a bit further into the future? Could you please explain in which situation this failsafe operates? Thanks.

I think this was added because of some potential starvation issues. @chromi ?

@xnoreq
Copy link
Author

xnoreq commented Nov 10, 2020

@tohojo I think that's correct and that it's possible to construct malicious traffic patterns that cause total starvation, but I was just concerned with a practical case in my application where I saw a low bandwidth flow (e.g. from an online game with periodic updates in form of small UDP packets) dropping packets even though total bandwidth was not even remotely exhausted.

I do have some other hacks like differentiation of bulk or sparse decaying flows, doing deficit accounting asap (in enqueue) in ingress mode, not resetting the deficit etc. running on an openwrt router, but as I'm not doing any scientifc measurements (wouldn't know where to start) I cannot make concrete statements about the effects of any of those. All I can say is "works better for me on my home router". I'll maybe upload something later in the evening.

I guess that the deficit (counter) is cleared because that's what the DRR algorithm does when a queue is empty. Yeah, I know, that's another one of those "because" answers.

@chromi
Copy link
Collaborator

chromi commented Nov 10, 2020 via email

@xnoreq
Copy link
Author

xnoreq commented Nov 10, 2020

Regarding the next_packet timestamps in ingress mode:
As far as I understand the global time_next_packet is making a step proportional to the bandwidth used (by transmitting or dropping - it's the same in ingress mode). As long as something is queued up, it is not reset or synchronized to the current time. As this is used for scheduling, it should result in dequeuing at a rate that matches the configured global rate.

failsafe_next_packet is making a larger (1.5x) step, but only when packets are transmitted. So if very little or nothing is dropped for a while then failsafe_next_packet will be ahead of time_next_packet. Theoretically this advance could grow to hundreds of ms.

On the other hand, if packets are mostly dropped then time_next_packet will happily step forward, which would mean nothing would get dequeued (transmitted) for a potentially long timespan, but failsafe_next_packet will only make a single step on the final transmission in cake_dequeue.

As min(time_next_packet, failsafe_next_packet) is used for scheduling, this supposedly fixes the problem of nothing getting dequeued for a potentially long timespan.

The issue I see here is that if the situation changes from mostly transmitting for a while to dropping a lot of packets. If failsafe was ahead by a lot (let's say tens of ms) and we start dropping lots of packets then you are exactly in the same situation that this is supposed to fix. min(time_next_packet, failsafe_next_packet) could be tens of ms ahead.

Assuming my understanding of all of this is correct, why is failsafe_next_packet not simply set to now + some max permissible inactivity duration?

@tohojo
Copy link
Collaborator

tohojo commented Nov 10, 2020

If the there is time for the state to expire between packets, then the flow is sparse, and it is appropriate to prioritise it. The scenario you describe would actually require multiple flows to each have their timing fall in a pretty small window of opportunity, if it exists at all.

I'll freely admit that the logic here is more complex and harder to follow than I'd like. However, it's in a state that works well in practice, and which would requite a clean-slate rework to improve appreciably from this point without risking counter-intuitive effects. I would have to prototype that elsewhere in order to establish that a new design worked at least as well in a wide variety of conditions.

I think something like this patch would help on the starvation issue (which I think don't is quite as benign as you make it sound; there's a reason fq_codel safeguards against this):

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 7d37638ee1c7..f79ed99a5166 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -2129,11 +2129,17 @@ static struct sk_buff *cake_dequeue(struct Qdisc *sch)
                                        b->decaying_flow_count++;
                                }
                                flow->set = CAKE_SET_DECAYING;
+                       } else if (flow->set == CAKE_SET_SPARSE) {
+                               /* sparse flows need a round on the old_flows
+                                * list before removal, to avoid the possibility
+                                * of starvation
+                                */
+                               list_move_tail(&flow->flowchain, &b->old_flows);
+                               flow->set = CAKE_SET_SPARSE_WAIT;
                        } else {

Will test this and submit a proper patch :)

@xnoreq
Copy link
Author

xnoreq commented Nov 10, 2020

@tohojo CAKE_SET_SPARSE_WAIT is used for "sparse flows in the bulk rotation". If a packet is enqueued for this flow after hitting your code then it will be categorized as a BULK flow. I'm not sure that's what you want for a sparse flow that potentially didn't even enter the bulk rotation.

@tohojo
Copy link
Collaborator

tohojo commented Nov 10, 2020 via email

@chromi
Copy link
Collaborator

chromi commented Nov 11, 2020 via email

@tohojo
Copy link
Collaborator

tohojo commented Nov 11, 2020 via email

@xnoreq
Copy link
Author

xnoreq commented Nov 11, 2020

My remark wasn't about moving the flow to the end of old_flows, that's fine. It was only about reusing CAKE_SET_SPARSE_WAIT which is not the right status. I think that's what Toke referred to as well.

It's simply to fix. Just add another state and on enqueue add a case for this new state that just sets the state back to SPARSE. Also need to add this case to where empty queues are deleted.
I've added this to my hacks and it seems to work fine.

@xnoreq
Copy link
Author

xnoreq commented Nov 11, 2020

Jonathan Morton [email protected] writes:
Whereas with the existing code you can use careful timing to get half of
the link bandwidth regardless of the number of flows, like so:

Assume N bulk flows all using quantum-size packets, constantly
backlogged. Flow S packet comes in, goes into the sparse rotation, is
immediately dequeued. On the next dequeue, the state for flow S is
cleared, and a bulk packet is dequeued and will start transmission.
While that packet is dequeued, flow S sends another packet, which will
be put to the head of the queue and dequeued next.

So with this, flow S can time its packets so that it is serviced on
every other dequeue, which will give it a bandwidth share of s/n where s
is the sparse flow packet size and n is the bulk flow packet size -
regardless of the number of bulk flows.

Yeah, as mentioned earlier I think it's also easy to starve old flows: a simple loop over some port or IP range, send a packet to each IP:port at a high enough rate. Repeat. That should do it.

@lynxthecat
Copy link

Was the fix ever implemented?

BTW is there an easy way to see which packets CAKE actually drops? I'd love to be able to 'tcpdump' the dropped packets.

@tohojo
Copy link
Collaborator

tohojo commented Aug 25, 2022

Was the fix ever implemented?

Hmm, no, guess not :/

BTW is there an easy way to see which packets CAKE actually drops? I'd love to be able to 'tcpdump' the dropped packets.

There's a kfree_skb trace point you can attach to. Theoretically that should allow you to inspect the packet contents as well, but you may have to write some BPF code to do this. I haven't played around with it much, but it looks like the skbtrace utility here might be able to do this (but not sure if it's feasible to get that to run on an openwrt router): https://github.com/yandex-cloud/skbtrace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants