Quick succession of saving script leads to broken history state #19

shulhi · 2023-01-09T04:53:26Z

If you save the script quickly and rapidly, you will encounter TryingToAppendToNonHead exception.

The script history is also broken (see screenshot above). It thinks that the HEAD is the second latest one. If you pick the actual latest version, you won't be able to save the script as it throws TryingToAppendToNonHead exception.

I think we need some sort of lock while saving the script.

The text was updated successfully, but these errors were encountered:

siddharth-krishna · 2023-01-10T09:12:47Z

I've been looking into the code of storeVCObject to see if there could be any concurrency bugs due to two racing stores that could cause issues like the above, and I think there shouldn't be.

inferno/inferno-vc/src/Inferno/VersionControl/Operations.hs

Lines 145 to 166 in 552fd3f

    
           obj_h <- case maybeCurrentHead of 
        
             Just pred_hash -> do 
        
               let head_fp = storePath </> "heads" </> show pred_hash 
        
               -- check to see if pred_hash exists in '<storePath>/heads` 
        
               exists_head <- liftIO $ doesFileExist head_fp 
        
               if exists_head 
        
                 then do 
        
                   -- we know that pred_h is currently HEAD, we can therefore store the object and metadata in the store 
        
                   obj_h <- writeHashedJSON storePath obj 
        
                   -- next we make the newly added object the HEAD 
        
                   let new_head_fp = storePath </> "heads" </> show obj_h 
        
                   liftIO $ renameFile head_fp new_head_fp 
        
                   -- we append the previous head hash to the file (this serves as lookup for all the predecessors) 
        
                   appendBS new_head_fp $ BL.fromStrict $ vcObjectHashToByteString pred_hash <> "\n" 
        
                   -- now we need to change all the predecessor mappings in '<storePath>/to_head' to point to the new HEAD 
        
                   -- we also include the new head pointing to itself 
        
                   preds <- readVCObjectHashTxt new_head_fp 
        
                   let obj_h_bs = BL.fromStrict $ vcObjectHashToByteString obj_h 
        
                   forM_ (obj_h : preds) (\pred_h -> writeBS (storePath </> "to_head" </> show pred_h) obj_h_bs) 
        
                   pure obj_h 
        
                 else throwError $ TryingToAppendToNonHead pred_hash

Let's say that object o1 is currently the head of a script history, and there are two calls: store(o2) and store(o3), both with pred = o1. If these calls race and one of them (say o3) goes first, then since it will rename the heads/o1 file to heads/o3 on line 156, store(o2) will either fail at the check to see if file heads/o1 exists on line 150 or when it tries to rename the file on line 156. Both store operations cannot succeed because the rename on line 156 acts as an atomic compare-and-swap (CAS), a common primitive used in lock-free concurrent data structures. This means the append on line 158 will only be executed by the successful store (in our example, o3).

There are still a few potential issues with the current code, however:

We don't catch the IO error thrown by renameFile if it fails due to a race. We should catch that, return a TryingToAppendToNonHead error and clean up the object file that was written in line 153.

Let's say we have two successive but concurrent calls to store, store(o2, pred=o1) and store(o3, pred=o2) but within a very short duration so that their updating of the to_head mappings on lines 161-163 overlap. This could lead to a situation where store(o3) finishes first, and then store(o2) comes along and incorrectly updates a bunch of to_head mappings to point to o2 instead of o3. One way to fix this might be to check at each loop iteration that the head hasn't changed, and simply break out of the loop if it has. The operation that changed the head can be responsible for updating the rest of the to_head pointers.

(cc @goodlyrottenapple @smurphy8 )

This PR uses a read-write lock to protect concurrent file IO performed by inferno-vc-server. This should hopefully prevent any concurrency bugs caused by two people trying to save/update/delete a script at the same time. (#19 ) The writeup below is a description of the current implementation, its issues, and some proposed solutions. This PR corresponds to Option 0. ---- ## Inferno Version Control Store A simple version control for scripts. ``` store(object, predecessor) -> hash fetch(hash) -> object fetchHist(hash) -> [object] del(hash) ``` ## Current Implementation Data structure: uses the file system. Objects are stored in the root directory with their hash as filenames, the `heads/` directory contains one file for each script history named by the hash of the head of that history, and the `to_head/` directory maps every script to the head of the history it belongs to: ``` /vc_store/ ├── 2ZDR-_h3LhqkDsk3qIOx7gobNrAvwbSA67aakIQFabI= ├── Bu1zONJHnlYCqqACpMeJ4pac_V2jmMOYISqyAvaSoBs= ├── FqBr0k0PQ33qskv9c6z9IsKE8jbTkgMyHx_iqyJFXLc= ├── IlRYIOwovxwbHWor54PHrYWQL_pZVWCyFGw_lmvAGjI= ├── heads │ ├── Bu1zONJHnlYCqqACpMeJ4pac_V2jmMOYISqyAvaSoBs │ ├── IlRYIOwovxwbHWor54PHrYWQL_pZVWCyFGw_lmvAGjI= └── to_head ├── 2ZDR-_h3LhqkDsk3qIOx7gobNrAvwbSA67aakIQFabI= ├── Bu1zONJHnlYCqqACpMeJ4pac_V2jmMOYISqyAvaSoBs= ├── FqBr0k0PQ33qskv9c6z9IsKE8jbTkgMyHx_iqyJFXLc= ├── IlRYIOwovxwbHWor54PHrYWQL_pZVWCyFGw_lmvAGjI= ``` Store method: ```python def store(o, h_p): h_o = writeToStore(o) f_p = 'vc_store/heads/{h_p}' if fileExists(f_p): f_o = 'vc_store/heads/{h_o}' renameFile(f_p, f_o) appendFile(f_o, h_p) # Update to_head pointers: for h in readFile(f_o): writeFile('vc_store/to_head/{h}', h_o) ``` The `to_head` mappings are used by the `fetchHist(h)` operation, which uses the mapping to find the head of the given object and then reads the object's history from the file in the `heads/` directory. ## Issues with current implementation: 1. The `renameFile` and `appendFile` are not in an atomic block. This means by the time an operation tries to append to the file, the next store operation could have already renamed the file to something else, meaning the first operation's predecessor would be lost from the history. 2. The update of `to_head` pointers of one operation `store(o2, o1)` can race with the successive operation `store(o3, o2)`. If the latter overtakes the former, this will result in some objects in the history incorrectly pointing to `o2` as their head instead of `o3`. 3. Crash safety: crashes, for example between `renameFile` and `appendFile`, will leave the store in an inconsistent state. ## Option 0: slap a lock around all operations - Pros: Easy fix, no need to migrate current store - Cons: inefficient ## Option 1: fix the current implementation - Idea: all versions of a script share a unique `histID`. Instead of `to_head`, you have a meta field `to_hist` in the object file as this is stable. - Maintain a `hist_to_head` map in memory that maps each histID to the hash of its head, and use [Control.Concurrent.Lock](https://hackage.haskell.org/package/concurrent-extra-0.7.0.12/docs/Control-Concurrent-Lock.html) or an MVar to update this map atomically. - `store(o, p)` writes a new head file for `o`, but retries until it can successfully update `hist_to_head`. - `fetchHist(h)` looks up the `histID` from the meta file, finds the head from `hist_to_head`, and returns the history saved in the appropriate head file. - `hist_to_head` can be periodically saved to file and on startup the last snapshot can be loaded and updated if necessary (as it is easy to reconstruct it from all object metas). - Pros: small migration from current store, most operations don't need lock so is relatively efficient. - Cons: need to carefully check/prove concurrent correctness, needs migration - What happens if there is a crash, and 2 concurrent new HEADs? Recovery can pick arbitrary one. Or discard both. - Use a checksum to detect crash. Remember crash can happen when snapshotting! (How does VPDB deal with this?) ```python def store(o, h_p): hist_id = readStore(h_p)['hist_id'] h_o = writeToStore(o['hist_id' := hist_id]) if hist_to_head['hist_id'] == h_p: f_p = 'vc_store/heads/{h_p}' f_o = writeFile('vc_store/heads/{h_o}', readFile(f_p) ++ h_p) if CAS(hist_to_head['hist_id'], h_p, h_o): return Success return TriedToAppendToNonHeadError ```  ## Option 2: use existing file-backed DB - For example: have a DB table with columns: `hash, pred, isHead, object, histID` - Store adds row to table and unsets `isHead` of pred (atomically) - Index can be `hash`, so fetch is fast - All versions of a script share a unique `histID`, so `fetchHist(h)` looks up `histID` of `h` and fetches all rows with that `histID`. - DB is periodically [backed up](https://www.postgresql.org/docs/current/continuous-archiving.html) to file - Pro: concurrent correctness and efficiency guaranteed by DB - Con: making all operations atomic might be slower than a correct implementation of Option 1 - Con: needs migration of store ## Option 3: STM - Can use an in-memory [STM data structure](https://web.mit.edu/price/a/2007/haskell/stm-lock-free-data-structures.pdf) and make each operation atomic - Like VPDB, have a thread save a snapshot of the structure to file periodically - Pro: STM ensures no concurrency bugs - Con: entire store needs to fit in memory - Con: needs migration of store ## Testing: are there libraries to test for such concurrency bugs? For example, something that runs a random sequence of operations in a single-thread and in a concurrent setting and compares outputs.

This was referenced Feb 27, 2023

Use RWLock to prevent races in inferno-vc-server #36

Merged

WIP: A Concurrency Tester for Inferno-VC (and other servant servers) #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick succession of saving script leads to broken history state #19

Quick succession of saving script leads to broken history state #19

shulhi commented Jan 9, 2023

siddharth-krishna commented Jan 10, 2023

Quick succession of saving script leads to broken history state #19

Quick succession of saving script leads to broken history state #19

Comments

shulhi commented Jan 9, 2023

siddharth-krishna commented Jan 10, 2023