Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added min_tickets to the bulkheads #248

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

michaelkipper
Copy link

What

This PR adds a min_tickets argument to the Semian bulkheads.

Why

At Shopify, our Semian configuration for MySQL sets quota to 0.51 and has this comment:

# Calculate the number of tickets as a ratio of the number of
# Worker processes, rather than using a static count.
# This is the threshold where the bulkhead will activate.
# Note: We add 0.01 to make 2 workers get 2 tickets instead of 1,
# this should prevent overly sensitive bulkheads at small worker counts
# becuase the calculation is tickets = (quota * workers).ceil

In a multiple-outage scenario, this means that we can lose 100% of capacity for the duration of a single client timeout if those resources become unhealthy at roughly the same time (which was the case we saw in a recent service disruption).

We'd like to reduce that 51% to something else, without losing causing the bulkheads to become overly sensitive at small worker counts. At Shopify, our core application runs with 16 workers, so this is likely not an issue except at startup, but still needs a proper solution.

How

We compute the number of available tickets as the minimum of the number of workers, or the desired ticket count (which itself is the maximum of the computed ticket count and the min_tickets argument).

TODO: Documentation.

cc: @Shopify/servcomm

ext/semian/resource.c Outdated Show resolved Hide resolved
Copy link

@thegedge thegedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main thought is on the need for min_tickets, especially if one is expecting a rather homogeneous setup (in terms of number of workers on a machine). In that case, couldn't one ensure that quota is set such that you wouldn't need a min_tickets?

For example, suppose we have 10 workers, quota set to 0.1, and min_tickets set to 2. Why would we use min_tickets over setting quota to 0.2?

[EDIT]
On the first read, I missed "small worker counts". Do we expect heterogenous setups (in terms of worker counts)? If so, similar question: why not ensure at least one ticket if the worker count is that low?

.gitignore Show resolved Hide resolved
ext/semian/tickets.c Outdated Show resolved Hide resolved
ext/semian/tickets.c Show resolved Hide resolved
ext/semian/util.h Outdated Show resolved Hide resolved

for id in $IPCS_Q; do
ipcrm -q $id;
done
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also tack on an xargs for all of the above:

ipcs -s | egrep "0x[0-9a-f]+ [0-9]+" | grep $ME | cut -f2 -d" " | xargs -n1 ipcrm -s
ipcs -m | egrep "0x[0-9a-f]+ [0-9]+" | grep $ME | cut -f2 -d" " | xargs -n1 ipcrm -m
ipcs -q | egrep "0x[0-9a-f]+ [0-9]+" | grep $ME | cut -f2 -d" " | xargs -n1 ipcrm -q

README.md Outdated Show resolved Hide resolved
@michaelkipper
Copy link
Author

[EDIT] On the first read, I missed "small worker counts". Do we expect heterogenous setups (in terms of worker counts)?

I don't think we can assume anything, but for Shopify Core, we expect pretty homogeneous setups at steady state (except a brief setup period).

If so, similar question: why not ensure at least one ticket if the worker count is that low?

Even in heterogenous setups, we still guarantee at least one ticket, because of the ceil. So even if workers is 1, and quota is 0.01, we'll still give a single ticket.

@thegedge: Did I understand your question correctly?

@thegedge
Copy link

thegedge commented Jul 4, 2019

@thegedge: Did I understand your question correctly?

Yep, so I guess my question is clarification on why we need min_tickets? If I have a small worker count, I'm pretty much guaranteed to have 1 ticket. If I have a larger number of workers, I'm likely going to be sufficiently close to my quota, or can choose a quota value that gets the desired effect.

@michaelkipper
Copy link
Author

This was driven by the comment referenced in the description, which was authored by @dalehamel in https://github.com/Shopify/shopify/commit/35045a4ab3833c43d36f77b1aac4aadbba152646. Maybe he can chime in? Since it's 2+ years old, it predates our move to the cloud, so it might be irrelevant now.

@dalehamel
Copy link
Member

@michaelkipper a little backstory

The quota functionality was actually added because of our move to kubernetes (and thus the cloud). In our data center, we could reliably and predictably determine how many workers would be allocated per host, so tickets being configured statically made sense.

With kubernetes making the orchestration decisions, we wouldn't be able to effectively use static tickets anymore. Eg, if we set 10 tickets for a resource per host before, but in the new model end up with only one worker, it will never actually trip a bulkhead. Likewise, if we ended up with only 10 tickets and 60 workers, there might be too much contention.

The move to a quota of 0.5 is something i've pondered a lot. I figured that it must get out of date at some point, as we continue to scale, our bulkheads become less and less effective. I'm not convinced that the quota strategy is a sustainable one.

You're correct that we always need at least one ticket, in which case the bulkhead is essentially ineffective as there is only one worker, it should always be able to get a ticket. As to the actual math to tune the quota ratio, I imagine we'd want to do it as some ratio of the number of workers to the resource globally across the whole fleet, and with any luck the gods of statistics will favor us, and the workloads will be distributed in an effective way across the fleet.

I suspect that a quota that is near 0.5 will probably be ineffective, we likely want something much lower in many cases now, as our scale has increased by some number of factors since it was originally set. At the time, the 0.5 quota wasn't too dissimilar from many of the static ticket counts we had set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants