-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better handling of problematic sets #119
Conversation
Previously, things like an out of memory error may mean that there is an inconsistent state in L1 after a set operation. On any kind of error during a set operation a delete will be sent to L1 afterwards and the operation will succeed even though L1 had an error.
Did some more local testing on this in the mean time, got some good results for the set operation in l1l2:
|
batch handler looks good too:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will you make the same changes for Add?
if err == common.ErrKeyNotFound { | ||
metrics.IncCounter(MetricCmdSetL1ErrorDeleteMissesL1) | ||
} else if err != nil { | ||
metrics.IncCounter(MetricCmdSetL1ErrorDeleteErrorsL1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this code branch return err?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it depends on how optimistic we want to be. We could ignore any error here and then wait for the next request to fail, or we can fail requests when e.g. the connection has been severed (memcached has crashed). In the first case, we are returning success for an L2 success regardless of what happens in L1. If the latter case, we fail a request if L1 has catastrophically failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So right now, we're returning success, but really don't know what's going to happen on the next request.
I'm not sure what the type of errors we can expect on Delete operation. If it's only memcached crash/disconnect, I think returning error here should be safe.
@@ -615,7 +636,33 @@ func (l *L1L2Orca) Get(req common.GetRequest) error { | |||
|
|||
if err != nil { | |||
metrics.IncCounter(MetricCmdGetSetErrorsL1) | |||
return err | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change in Get may not be necessary since a Set failure has this consequences, which seem acceptable: (1) no record in memcached, (2) new record actually there, (3) some other thread concurrently had set the same key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has a consequence because there may be inconsistency between L1 and L2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not returning an error is the right thing.
However, the Delete operation may not be necessary. The Get operation never modified L2. And presumably L1 get failed because it wasn't there. Inconsistency would only happen if there was some error that would cause the Set (after Get) to fail but in reality succeeded, while another concurrent modify operation had already completed. We don't handle the concurrent modify operation condition anyway, so this really only reacts to a more rare condition.
All this makes me wonder if Add operation should have been used instead of Set after the Get Miss.
if err == common.ErrKeyNotFound { | ||
metrics.IncCounter(MetricsCmdSetReplaceL1ErrorDeleteMissesL1) | ||
} else if err != nil { | ||
metrics.IncCounter(MetricsCmdSetReplaceL1ErrorDeleteErrorsL1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this code branch return err?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same answer as above
Previously, things like an out of memory error may mean that there is an inconsistent state in L1 after a set operation. On any kind of error during a set operation a delete will be sent to L1 afterwards and the operation will succeed even though L1 had an error.
This PR is the beginning of fixing #102