Better handling of problematic sets #119

ScottMansfield · 2017-03-29T23:57:50Z

Previously, things like an out of memory error may mean that there is an inconsistent state in L1 after a set operation. On any kind of error during a set operation a delete will be sent to L1 afterwards and the operation will succeed even though L1 had an error.

This PR is the beginning of fixing #102

Previously, things like an out of memory error may mean that there is an inconsistent state in L1 after a set operation. On any kind of error during a set operation a delete will be sent to L1 afterwards and the operation will succeed even though L1 had an error.

ScottMansfield · 2017-03-30T18:10:05Z

Did some more local testing on this in the mean time, got some good results for the set operation in l1l2:

rend_cmd_set_errors_l1                          42046
rend_cmd_set_errors_l2                          0
rend_cmd_set_errors_oom_l1                      0
rend_cmd_set_errors_oom_l2                      0
rend_cmd_set_errors_oom                         0
rend_cmd_set_errors                             0
rend_cmd_set_l1_error_delete_errors_l1          0
rend_cmd_set_l1_error_delete_hits_l1            3
rend_cmd_set_l1_error_delete_l1                 42046
rend_cmd_set_l1_error_delete_misses_l1          42043
rend_cmd_set_l1                                 597973
rend_cmd_set_l2                                 598053
rend_cmd_set_replace_errors_l1                  0
rend_cmd_set_replace_l1_error_delete_errors_l1  0
rend_cmd_set_replace_l1_error_delete_hits_l1    0
rend_cmd_set_replace_l1_error_delete_l1         0
rend_cmd_set_replace_l1_error_delete_misses_l1  0
rend_cmd_set_replace_l1                         0
rend_cmd_set_replace_not_stored_l1              0
rend_cmd_set_replace_stored_l1                  0
rend_cmd_set_success_l1                         597961
rend_cmd_set_success_l2                         597973
rend_cmd_set_success                            597961
rend_cmd_set                                    598053

ScottMansfield · 2017-03-30T18:11:53Z

batch handler looks good too:

rend_cmd_set_errors_l1                          0
rend_cmd_set_errors_l2                          0
rend_cmd_set_errors_oom_l1                      0
rend_cmd_set_errors_oom_l2                      0
rend_cmd_set_errors_oom                         0
rend_cmd_set_errors                             0
rend_cmd_set_l1_error_delete_errors_l1          0
rend_cmd_set_l1_error_delete_hits_l1            0
rend_cmd_set_l1_error_delete_l1                 0
rend_cmd_set_l1_error_delete_misses_l1          0
rend_cmd_set_l1                                 0
rend_cmd_set_l2                                 343217
rend_cmd_set_replace_errors_l1                  1597
rend_cmd_set_replace_l1_error_delete_errors_l1  0
rend_cmd_set_replace_l1_error_delete_hits_l1    0
rend_cmd_set_replace_l1_error_delete_l1         1597
rend_cmd_set_replace_l1_error_delete_misses_l1  1597
rend_cmd_set_replace_l1                         343186
rend_cmd_set_replace_not_stored_l1              341589
rend_cmd_set_replace_stored_l1                  0
rend_cmd_set_success_l1                         0
rend_cmd_set_success_l2                         343186
rend_cmd_set_success                            343186
rend_cmd_set                                    343217

vuzilla

When will you make the same changes for Add?

vuzilla · 2017-03-30T20:12:16Z

orcas/l1l2.go

+		if err == common.ErrKeyNotFound {
+			metrics.IncCounter(MetricCmdSetL1ErrorDeleteMissesL1)
+		} else if err != nil {
+			metrics.IncCounter(MetricCmdSetL1ErrorDeleteErrorsL1)


Should this code branch return err?

it depends on how optimistic we want to be. We could ignore any error here and then wait for the next request to fail, or we can fail requests when e.g. the connection has been severed (memcached has crashed). In the first case, we are returning success for an L2 success regardless of what happens in L1. If the latter case, we fail a request if L1 has catastrophically failed.

So right now, we're returning success, but really don't know what's going to happen on the next request.

I'm not sure what the type of errors we can expect on Delete operation. If it's only memcached crash/disconnect, I think returning error here should be safe.

vuzilla · 2017-03-30T20:19:26Z

orcas/l1l2.go

@@ -615,7 +636,33 @@ func (l *L1L2Orca) Get(req common.GetRequest) error {

 					if err != nil {
 						metrics.IncCounter(MetricCmdGetSetErrorsL1)
-						return err
+


The change in Get may not be necessary since a Set failure has this consequences, which seem acceptable: (1) no record in memcached, (2) new record actually there, (3) some other thread concurrently had set the same key.

It has a consequence because there may be inconsistency between L1 and L2

Not returning an error is the right thing.

However, the Delete operation may not be necessary. The Get operation never modified L2. And presumably L1 get failed because it wasn't there. Inconsistency would only happen if there was some error that would cause the Set (after Get) to fail but in reality succeeded, while another concurrent modify operation had already completed. We don't handle the concurrent modify operation condition anyway, so this really only reacts to a more rare condition.

All this makes me wonder if Add operation should have been used instead of Set after the Get Miss.

vuzilla · 2017-03-30T20:22:38Z

orcas/l1l2batch.go

+			if err == common.ErrKeyNotFound {
+				metrics.IncCounter(MetricsCmdSetReplaceL1ErrorDeleteMissesL1)
+			} else if err != nil {
+				metrics.IncCounter(MetricsCmdSetReplaceL1ErrorDeleteErrorsL1)


Should this code branch return err?

same answer as above

ScottMansfield self-assigned this Mar 29, 2017

ScottMansfield requested review from smadappa, vuzilla and senugula March 29, 2017 23:57

ScottMansfield merged commit b275c03 into master Mar 30, 2017

vuzilla reviewed Mar 30, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling of problematic sets #119

Better handling of problematic sets #119

ScottMansfield commented Mar 29, 2017

ScottMansfield commented Mar 30, 2017

ScottMansfield commented Mar 30, 2017

vuzilla left a comment

vuzilla Mar 30, 2017

ScottMansfield Mar 30, 2017

vuzilla Apr 4, 2017

vuzilla Mar 30, 2017

ScottMansfield Mar 30, 2017

vuzilla Apr 4, 2017

vuzilla Mar 30, 2017

ScottMansfield Mar 30, 2017

Better handling of problematic sets #119

Better handling of problematic sets #119

Conversation

ScottMansfield commented Mar 29, 2017

ScottMansfield commented Mar 30, 2017

ScottMansfield commented Mar 30, 2017

vuzilla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment