-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
circuit: introduce half_open_resource_timeout #188
Conversation
28d7c3c
to
ce2808f
Compare
lib/semian/circuit_breaker.rb
Outdated
@@ -122,5 +123,19 @@ def log_state_transition(new_state) | |||
def disabled? | |||
ENV['SEMIAN_CIRCUIT_BREAKER_DISABLED'] || ENV['SEMIAN_DISABLED'] | |||
end | |||
|
|||
def maybe_with_half_open_resource_timeout(resource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the coupling I think is ugly; passing in the resource and figuring out whether it quacks correctly here. When I do a 2nd iteration here with comments, I'll likely refactor to pass the raw resource into the initializer of the bulkhead and circuit—which seems a bit more clean.
ce2808f
to
500c47a
Compare
Ugh... for some reason this didn't surface in local tests, and only fails in one place on CI, but sometimes it won't allow reconfiguring the timeouts for an established client 😖 Found this... brianmario/mysql2#317 does not look promising... Kir worked on this in brianmario/mysql2#955, I'll talk to him and see if we can get this in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea, and I think it makes sense to only give a small timeout when going into half-open to check if the resource is properly recovered.
However I'm afraid that queries that can't possibly be executed with this reduced timeout might cause the CB to stay open longer than it should.
lib/semian/mysql2.rb
Outdated
@@ -83,6 +83,23 @@ def query(*args) | |||
end | |||
end | |||
|
|||
def with_resource_timeout(temp_timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we support different timeouts for read / write, or is it assumed this timeout is simply for the recovery phase so there is no need for it to be fine grained?
lib/semian/circuit_breaker.rb
Outdated
@@ -27,7 +28,7 @@ def acquire | |||
|
|||
result = nil | |||
begin | |||
result = yield | |||
result = maybe_with_half_open_resource_timeout(resource) { yield } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/nitpick, pass &block
to safe one stack frame and it's nicer.
I read the MySQL2 source, and we can get the This should be good for final review. |
8d7afb7
to
4834280
Compare
@@ -34,7 +34,7 @@ def test_semian_can_be_disabled | |||
end | |||
|
|||
def test_connection_errors_open_the_circuit | |||
@proxy.downstream(:latency, latency: 1200).apply do | |||
@proxy.downstream(:latency, latency: 2200).apply do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why you had to change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I upped the default timeout to 2s, to show that it changes to 1s during the half open state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation is 👍 to me.
Still unsure about the actual effect in prod, but I'm sure you'll look at those closely so 👍 as well.
lib/semian/mysql2.rb
Outdated
yield | ||
ensure | ||
@read_timeout = prev_read_timeout | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this pattern makes me a bit uncomfortable when expressed like this. If someone were to add a line of code to the top of this method that might raise, the ensure
could set a nil value. An alternative that more clearly maps to the intention is:
def
prev = current
begin
current = temp
# things
ensure
current = prev
end
end
4834280
to
0da0d7b
Compare
@sirupsen have you looked into how hard it would be to make other adapters (e.g. Redis) to support the same |
@kirs pretty easy, I just decided to prioritize MySQL here because those resources don't generally pose larger infra concerns in terms of occupying capacity because of their lower timeouts. For MySQL, it wouldn't be crazy to have this 25-50x as small. For something like Redis, you want to never really allow any query to take more than ~250ms. You're not going to get an order of magnitude there, but it might be worth it eventually. |
When a circuit is busy (for those at Shopify, see https://github.com/Shopify/shopify/pull/164168) you may reduce a substantial amount of capacity if you have large timeouts (see #145). One way to circumvent this is to have very large
error_threshold
timeouts, as a multiple of your timeout. E.g. if your resource timeout is 30s, then an error_timeout of 30s will make sure you have 50% of capacity available, and 60s will make sure you have 25% available (assuming that the circuits trigger equally over the timeout period, which is not an unreasonable assumption in production). However, this has obvious drawbacks, making recovering from incidents take longer for the affected resource.Instead, we can reduce the resource timeout in the
half_open?
state to achieve something similar. Normally, timeouts are to allow outlier queries to go through—in the face of an incident, we can eat that (we'd want to kill those anyway) and reduce the timeout dramatically whilehalf_open?
on the circuit.The ugly thing about this, is that it couples the circuit with the resource. However, I think it makes more sense for the circuit to know about the resource, than for the resource to know about the circuit—that's why I chose to implement it this way.
This will need to be implemented per adapter, by implementing
with_resource_timeout
.cc @Shopify/servcomm