-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fault injection: get() incorrectly returns not_found #107
Comments
This one is a bit more interesting. It still has a close & open cycle in the middle (as well as a failed open attempt), but it only uses a single key.
The diagnostic output:
|
I've seen over a dozen variations of this failure. (I have yet to have a 5 minute run succeed.) Each failing case has a close-and-reopen cycle in the middle. |
If you are failing reads, that seems to make sense that <<"k">> is sometimes not found. What am I missing? |
The trace shows the order of operations: open, put <<"k">>, close, open, close, open (fails), open, get <<"k">> -> not_found. That last operation is definitely bogus. The fault injection isn't causing pread(2) to return invalid data: the fault injection is causing a return of -1 + valid errno value. I don't believe that eleveldb should be returning {ok, ...} in such a case. |
Hi. I have another case. I'm not sure if it's different or not. The case frequently passes, which is annoying. But it might have something to do with compaction, so I'm wondering if I might open another ticket -- if I run the test case 25 times it a row, it will always fail before the end of the 25th run.
The put_filler step is writing 3,085 keys with a prefix of <<137,112>> and value blobs of 50,170 bytes each (for about approximately 150MBytes of data). |
It's also worth noting that the case that I mention from approx. 14 hours ago does not have close & reopen steps in the middle. |
I don't know if there's one or more separate bug/trigger/thingies in here, but FWIW I observed a failure last night where
|
Here's more background info on the original failing test case. I commented out the The failing test case triggers one message to
|
@slfritchie My theory of the moment is that this is "functions as designed". Set "paranoid_checks" to true in eleveldb … likely advanced.config in your Riak 2.0 environment. Instead of the key being not found, the open operation will abort … forever. If this expected behavior is confirmed, then we do not want to change the code. |
Hrm, well, there's no Riak involved in this case. IMHO a series of API-says-it-was-successful operations followed by another API-says-it-was-successful operation that doesn't return the expected value is a data loss bug. If |
Current status: still seeing eleveldb commit: 35f681e of branch To recreate, in two cut-and-paste steps. Tested on OS X 10.8.5 and Ubuntu 12.04.2 LTS (GNU/Linux 3.2.0-54-generic x86_64). First:
And second. This one fails for OS X (Ubu case part 2 is below):
The Ubuntu 12 Linux part 2 is:
Here's an easier-to-read-for-humans formatting of the test case and failing diagnostic output:
|
I still have neither built a leveldb unit test reproducer nor taken time to set up eqc. That said, the paranoid_check=true and the syslog message point to Recover() log processing eating the error. Specifically the failed ftruncate call did not send an error return. Also, there is the possibility that even if the error was returned that it would be overwritten. Branch mv-tuning7 in basho/leveldb contains the changes (and a throttle change in version_set.cc that is unrelated). |
I shouldn't be asking until tomorrow morning, but I'll ask now anyway ... why isn't |
No code review. And there may be other minor bugs to fix. "tuning" branches are just that, minor stuff. |
@matthewvon Nop, sorry, |
Update: Source drift/version control is a pain, but I think I put all of the moving parts in the proper place. Today's |
Moving to 2.0.1. |
Fault injection parameters:
Failing test case:
The successful fault injection triggers are:
Test case trace of results:
All of the DB mutations have status
yes
, meaning that the eleveldb API responded affirmatively to each call: i.e. there were no errors returned by the API.This is probably not a truly minimal failing test case: the nature how the random number generator is used means that a certain number of commands needed to happen "in the middle" before a lucky 7% event happened at a crucial time. The two close & open cycles in the middle of the test case: they might be truly necessary, but they might not.
To reproduce, cut-and-paste to execute these commands. Tested with OS X. It ought to work with Linux, but honestly I haven't tested the faulterl library very much on Linux.
You now have an Erlang shell running with a fault-injectable VM. Cut-and-paste to execute the following three commands.
The text was updated successfully, but these errors were encountered: