TODO phi accrual failure detector TODO merge in notes from CS192, 6.829 TODO comparison of database replication techniques based on total order broadcast TODO http://bytepawn.com/readings-in-distributed-systems/ TODO http://www.cs.jhu.edu/~jak/docs/paxos_for_system_builders.pdf TODO Availability in Globally Distributed Storage Systems

comparisons

http://grydz.com/grid/show/nosql-open-source-and-commercial-nosql-data-storage-systems
http://spreadsheets.google.com/ccc?key=0Ale_YaCwKEUVclVFVFlrUWt5aWhQaGQ0OXVCMUl4Vmc&hl=en#gid=0
http://en.wikipedia.org/wiki/NoSQL

Background

resources

http://marcua.net/notes/6.824/

core

distributed concurrency control: serializability
distributed commit: agree to do something or abort
consensus: agree on a value
replication: for availability, durability
- paxos/viewstamped replication
- virtual synchrony (spread)
CAP theorem's semantics
- assume partition of 5 nodes, with X,Y in non-quorum and Z in quorum; X holds lock on resource, and Y and Z get request for resource
- if consistent & available, then Y can reach X to acquire
- if consistent & partition-tolerant, then Z can acquire, breaking X's lock
- TODO what if lock were on quorum side? TODO tie in with "req" vs "node" failure
- consistency:
- one-copy serializability
- linearizability
- http://pl.atyp.us/wordpress/?p=2521
- http://pl.atyp.us/wordpress/?p=2532
PACELC: if there is a partition (P) how does the system tradeoff between availability and consistency (A and C); else (E) when the system is running as normal in the absence of partitions, how does the system tradeoff between latency (L) and consistency (C)
consistency
- eventual consistency: stale reads (OOO if read from diff nodes), divergent writes (clocks and conflicts)
- strict cons
- event cons: staleness
  - monotonic read cons aka timeline cons: can be stale but always in-order
  - RYOW aka RYW aka session cons: not nec implies (subset of) timeline cons
  - causal cons: use version numbers (lamport)
  - immediate cons
- event cons: multi-writer
replication: quorums, primary-backup, master-master
- $R+W>N$; no guarantees unless joins/leaves incur synchronization barriers
- primary-backup: $R=1,W=2,N=3$; only need ack from majority (2), but write to all 3; majority prevents divergence in case of partition
- quorums stronger guarantee: always read latest; PB read can be stale (if read from a replica that hasn't applied update yet)
- PB gives in-order, since all updates serialized through primary
- quorums require app-specified ordering, eg cassandra's app timestamps, since different clients write directly to their quorums (no global ordering)
- with viewstamped replication, joins/leaves change views; this is when sync should happen (to bring new node up to speed)
  - but sync barriers are expensive, eg full merkle tree comparison or log shipping
- dynamo uses sloppy quorums, bc trad'l strict quorums implies unavailability on failures/partitions
  - replicate to first $N$ healthy nodes, not first $N$ nodes; hence, the $N$ may change or be entirely different
  - dynamo tries to spread quorum across datacenters
  - cassandra claims to use strict quorums
- multi-master: requires dist commit and dist concurrency control or eventual cons
design patterns
- tombstones: for deletes in eventually consistent systems
- vector timestamps
logical clocks
- happened-before relationship
  - $a \rightarrow b$ if $a$ happened before $b$ in same process
  - $a \rightarrow b$ if $a,b$ are the send, recv of a msg
- Lamport clocks/timestamps
  - timestamp = (counter, process ID); all msgs have unique timestamp; total order possible
  - on msg send, first incr then stamp
  - on msg recv, set counter = max(msg.counter, counter)
  - $a \rightarrow b \implies L(a) < L(b)$ but not vice versa
  - however, key utility: $L(a) \nless L(b) \implies a \nrightarrow b$
- vector clocks/timestamps
  - all msgs include vector timestamp (snapshot of global vector clock)
  - on update, increment own component
  - on msg send, first incr own component
  - on msg recv, incr own component and use component-wise max
  - time is more advanced on receiver than sender
  - $a \rightarrow b \iff V(a) < V(b)$
- version vectors: for syncing replicas
  - after sync, both vectors are same: component-wise max
- only for detecting conflicts; resolving is still separate, hard problem
  - eg commonly need to maintain complete history, e.g. with counters or version control (ancestor info needed)
  - easier cases include unioning sets

rsync

directed (one-way) synchronization
receiver sends rolling checksums of each non-overlapping block
sender compares these with its checksums of every (incl. overlapping) block
- rolling checksums efficient to incrementally update
- if match, calculate and compare MD5

NAS/SAN

SAN: block protocols, eg scsi, FC, iscsi, ataoe
NAS: remote FS, eg CIFS, NFS; NAS heads can be atop SANs

MPI/PVM

MPI
- lib for apps
- don't mandate thread-safe impls, but allow them
- high perf on high perf systems
- modular: all refs relative to a module, not entire program
  - eg linear solver module must be restricted to a subset of processes
  - process src/dst must be spec'd by rank in a group not absolute ID
- extensible
- heterogeneous computing (impls working together)
- well-defined behavior: no race conds/avoidable impl-specific behavior
PVM
- also supporting heterogeneous computing and diff approach to extensibility
- provide portable, heterogeneous env for using clusters over TCP

MPI

system objects: apps have handles to these
- group: set of $n$ processes; each process has a rank in $[0,n)$; group routines mainly for specifying processes for communicator ctor
- communicator: always associated 1:1 with a group; a "tag" passed in MPI calls; MPI_COMM_WORLD represents all tasks
organize tasks into task groups

industry

cloudera
- consulting/training services
- products: hadoop distro, desktop gui, squool for rdbms import/export
- people
  - doug cutting: hadoop/nutch/lucene founder; yahoo
  - mike cafarella
  - jeff hammerbacher: facebook data team; hive
consistency
- strict: timestamp everything; impractical
- sequential: some total ordering that's consistent with the local orderings each computer sees
Merkle trees
consistent hashing: when changing hash table width, impose minimal disruption
- eg, when adding, take a fair amount from each existing bucket
- chord ring is a way to implement this
  - can replicate each node multiple times to make things smoother
- http://citeseer.ist.psu.edu/karger97consistent.html (Karger)
gossip: decentralized but eventually consistent

general principles for reducing latency

distributed caches
CDN
caching proxy server
app-level (yslow, ajax)
edge DNS accelerator
zero copy
no serialization
load-balanced replicas
TCP offload engine

TODO DSM

http://www.cs.umd.edu/~keleher/bib/dsmbiblio/dsmbiblio.html

TODO consistency

http://en.wikipedia.org/wiki/Consistency_model

coordination

lamport clocks: total ordering; single values
vector clocks: causal ordering; partial ordering; more information to eg resolve conflicts; easy to move to total order (eg by appending node ids)

data grids

data grids: general distributed utilities
- caches, in-mem data structures, txns, pub-sub/msging, membership
oracle coherence, ibm websphere extreme scale/object grid, terracotta, gigaspaces, gemstone, jbosscache/jgroups/infinispan, hazelcast
- hazelcast: undocumented consistency guarantees

BFT

normal operation: why simple approach similar to failstop replication won't work

invoke VIEWSTAMP OPERATION -> DEST

invoke 1,1 X -> B invoke 1,1 Y -> C nobody's been fooled yet at this point since the you need 2f+1 agreements invoke 1,2 Z -> B invoke 1,2 Z -> C but at this point both will reply correctly even though they have diff states
what about sending hashes of logs?
SUNDR: guarantees fork consistency property: if server lies to clients (drops, reorders, etc.) causing a fork in the clients' histories
actually what happens: see journal
- a prepared certificate is a set of 2f+1 prepares
- commit point: when $f+1$ non-faulty replicas have a prepared certificate
read requests: send to all replicas, just wait for enough consistent replies
view changes: completely diff from paxos
- go through primaries round-robin; no leader election
- R_n to (new) P: <DOVIEWCHANGE, list of prepared certificates> R_n
- P to all: <NEWVIEW, list of $2f+1$ DOVIEWCHANGE msgs> P <PREPREPARE, req...> all the operations that were supposed to have been committed in the prev view
- note: 2 replicas can execute the same operation in diff viewstamps; only thing that matters is that they all run in the same order
questions
- diff btwn viewstamp replication and paxos?
- what was the point about fork consistency in SUNDR and log hashes?
- isn't the all-to-all comm necessary since the primary could lie?

queueing systems

AMQP
- rabbitmq: erlang impl; http://developian.blogspot.com/2009/11/drizzle-replication-using-rabbitmq-as.html
SQS: AWS offering
yahoo/apache hedwig: for PNUTS replication

TODO dtxns

http://citeseer.ist.psu.edu/533331.html a fast commit protocol for distributed main memory database systems
http://citeseer.ist.psu.edu/589561.html
http://citeseer.ist.psu.edu/238153.html Transaction Routing for Distributed OLTP Systems
transaction management in the R* distributed database management system

TODO distributed query optimization

R*

datacenter

amazon aws

ebs: volumes can only be mounted from 1 instance at a time, so no consistency issues
- behaves like SAN
autoscaling: set conditions eg CPU, network, disk util

microsoft chicago

trailer containers
$.5B, 700 ft^2, 2 levels, 112 containers, 224K servers
each container: 2K servers, .5MW critical load, 50K lb, air skates
11 diesel generators, currently 30MW total load, fully 56MW critical load

FAWN (dga, sosp09)

best paper; vijay presented
low-power embedded CPUs + small flash storage
- increasing CPU/IO gap: reduce CPU
- power grows super-linearly with speed
- dynamic power scaling on trad'l systems is inefficient; near min volt at highest freqs
flash
- fast random reads: << 1 ms (up to 175x faster than disk)
- power-efficient IO: < 1 W at heavy load (disks at load use 10 W)
- slow random writes
FAWN-DS: log-structured store
- main-mem hash index: maps 160-bit keys to log pos
  - only stores 15-bit fragment of key, 1 valid bit, 4 B log pos = 16 B
  - index into hash table is lowest i bits (16 bits); tables have 2^i elts
  - key frag is next 15 bits; cmp this
  - pseudocode
    
    index = key & 0xffff keyfrag = (key >> 16) & 0x7fff for i = 0 to n: bucket = hash[i][index] if bucket.valid and bucket.keyfrag == keyfrag and readkey(bucket.offset) == key: return bucket return not_found
  - periodic snapshots
  - typical obj size: 256 B to 1 KB; for 256 MB RAM, yields approx 4 GB storage
- each node has several virtual IDs in the key ring, so multiple files; these semi-random writes are still fast
- maintenance: split, merge, compact are all scans that can run concurrently with ongoing operation (thanks to immutability/main-memory index)
TODO
FAWN-KV: consistent, replicated, HA KV store
- chained replication
350 KV queries/Joule: 2 OOM more than disk-based system

People

Michael J Freedman
Frans Kaashoek
Robert Morris
Ion Stoica
Scott Shenker

vuze aka azureus

custom msging protocol for extensions
- peer exchange
superseeding
vivaldi coords
kademlia dht

DSM

Ivy: Memory Coherence in Shared Virtual Memory Systems (Yale)

migrate pages among writers; write invalidation
read-only copies can be replicated
implements sequential consistency
centralized manager algorithm
- to get a read
  - C: RQ to M (read query)
  - M: RF to O (read forward)
  - O: RD to C (read data)
  - C: RC to M (read confirm)
- to get a write
  - C2: WQ to M (write query)
  - M: IV to C1 (invalidate)
  - C1: IC to M (invalidate confirm)
  - M: WF to O (write forward)
  - O: WD to C2 (write data)
  - C2: WC to M (write confirm)
- tricky: read/write confirms are done to order the read/write data requests
efficiency problems: comm intensive, page granularity, false sharing

TreadMarks: DSM on Standard Workstations and Operating Systems

"better" than Ivy
implements lazy release consistency with write notices/write diffs
compatibility with existing apps, since you can always re-write programs to respect sharing granularity (to avoid false sharing)
technqiues
- LRC: lazy release consistency
- write diffs: just send RLE of changes
- write notices: don't send write diffs immediately, but on demand (even lazier)
- vector timestamps
details
- release consistency: just publish changes following a lock release
- eager RC: all coherence actions performed immediately on release operations
- lazy RC: all coherence actions delayed until after a corresponding subsequent acquire
  - even lazier: use write notices
  - less work: use write diffs
- coherence actions = publishing all changes
vector timestamps are necessary for this situation:

CPU0: al1 x=1 rl1 al2 z=99 rl2 CPU1: al1 y=x rl1 CPU2: al1 print x, y, z rl1
- want CPU2 to see not just CPU1's release but CPU0's as well
- CPU2 knows about CPU0's change via CPU1's vector timestamp, which includes CPU0's entry

Consensus

Paxos

simplified overview: there are proposers and acceptors
- propose
  - proposer sends proposal (with unique round number $n$) to all acceptors
  - acceptors promise to accept proposal if $n$ is highest $n$ seen (promise to ignore any lower $n$)
- accept
  - if proposer hears back from quorum, send accept requests to quorum with the value it wants to set
  - if acceptor hasn't promised to a higher $n$, tell proposer it accepted
- if proposer gets quorum of accepts, then commit
normal for all nodes to be both proposer and acceptor
actual overview: acceptor promise identifies accepted $n$ to prevent future proposers with higher $n$ from changing the value

prepare(n) -> "if you've already accepted a proposal what's its n and value?" <- prep_ok(n_a, v_a) the proposer will adopt this value, if it hears it from a majority accept(n,v) -> <- acc_ok(n) decide ->
proposal n is chosen when a majority of acceptors sent acc_ok(n)
- interesting property of paxos: once proposal is chosen, system will never change its mind
problem: if proposers all get prep_ok responses that say nothing was agreed upon, then there may be conflicts in the accept stage
- this is why we have acc_ok
pseudocode

proposer(v): choose n, unique and higher than any n seen so far send prepare(n) to acceptors if prep_ok(n_a, v_a) from majority: # go with the lower proposal's value (but retain same n) if any v_a != nil: v' = v_a with max n_a else: v' = v send accept(n, v') to all if acc_ok(n) from majority: send decided(v') to all

acceptor: n_p: highest prepare seen, initially 0 v_a: highest accept seen, initially nil n_a: highest accept seen, initially 0
```
prepare(n) handler:
  if n > n_p:
    n_p = n
    reply prep_ok(n_a, v_a)

accept(n,v) handler:
  if n >= n_p: # reject anything lower than highest *prepare*
    # this ensures that we only accept the highest prepare we've seen
    n_a = n
    v_a = v
    reply acc_ok(n)
```
commit point: majority of acceptors record a particular v_a if acceptor times out (doesn't receive decide): ask all servers for v_a, see if majority agrees otherwise become a proposer
more pseudocode (http://nil.csail.mit.edu/6.824/2015/notes/paxos-code.html)
```
  --- Paxos Proposer ---
```
1 proposer(v): 2 while not decided: 2 choose n, unique and higher than any n seen so far 3 send prepare(n) to all servers including self 4 if prepare_ok(n, na, va) from majority: 5 v' = va with highest na; choose own v otherwise
6 send accept(n, v') to all 7 if accept_ok(n) from majority: 8 send decided(v') to all
```
  --- Paxos Acceptor ---
```
9 acceptor state on each node (persistent): 10 np --- highest prepare seen 11 na, va --- highest accept seen

12 acceptor's prepare(n) handler: 13 if n > np 14 np = n 15 reply prepare_ok(n, na, va) 16 else 17 reply prepare_reject

18 acceptor's accept(n, v) handler: 19 if n >= np 20 np = n 21 na = n 22 va = v 23 reply accept_ok(n) 24 else 25 reply accept_reject
scenario

A1: p10 a10v10 p11 a11v10 A2: p10 a10v10 p11 a11v10 A3: p10 p11 a10v10 X a11v10
say two proposers both get back a majority of prep_oks; n_p allows us to reject any accept request that isn't the highest
vague argument for why other interleavings work out right:
important: the variable values must persist across crashes; must not be allowed to forget values
- otherwise nodes that have already accepted a value will forget that they have done so, and may subsequently reply incorrectly to a different prepare/accept
- what this effectively means is that nodes must record their variable updates to disk

life beyond distributed transactions: an apostate's opinion (pat helland)

dtxns are fragile

Storage & Consistency

Errors in Database Systems, Eventual Consistency, and the CAP Theorem (Michael Stonebraker, cacm blog)

http://cacm.acm.org/blogs/blog-cacm/83396-errors-in-database-systems-eventual-consistency-and-the-cap-theorem/fulltext
most NoSQL disallow multi-partition txns, so consistency applies only to replication
LAN partns rare, esp w redundant LANs
disasters: no eventual WAN consistency since you'll never recover (no durability)
app/dbms bugs: corruption/byzantine
single-node failures: easily survived by many algos
WANs are redundantly connected

From http://www.allthingsdistributed.com/2008/12/eventually_consistent.html:

The first formal treatment of the consistency vs. latency (and if one wishes, availability and tolerance to network partitions) in fact appeared in the paper

Hagit Attiya, Jennifer L. Welch: Sequential Consistency versus Linearizability. ACM Trans. Comput. Syst. (TOCS) 12(2):91-122 (1994) - initially published in the SPAA 1991 conference.

It formally proves that for lineariability both read and write operations cannot be faster than the network latency. In sequential consistency, either reads or writes can be faster than the network latency, but at least one of them has to be slow.

In particular, this means that strong consistency conditions cannot be made highly available in partitionable environments, and cannot be implemented in a scalable manner.

Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems (read 2/25/08)

cluster storage systems where data is striped across multiple servers experience tcp throughput collapse
cause: exceeding output port buffer capacity within switches -> packets dropped -> tcp timeouts
solutions: nothing very satisfactory

Chain Replication for Supporting High Throughput and Availability

simple idea: have a chain of servers
all writes start at head and return from tail
all reads go to tail
"high availability": you can stay up, but you're only reading from tail....

The Chubby Lock Service for Loosely-Coupled Distributed Systems

Rethink the Sync

TODO
basically: bluefs allows for speculative IO assuming that cached data is still valid; allow computations to proceed while buffering linux IO so that speculative output is not released until data is validated

Tra (Russ Cox)

sync time >= mod time; time btwn is guaranteed to have no changes
only benefit of VT pair is bounding the times for dirs; same thing can be done with VTs, but more verbose repr
impl: traserv scans fs for changes
paper bugs
- why is fig 8 known to have no mods at B3? it could've....
- why doesn't d have min of sync times in fig 12?

MapReduce

mapreduce

[work underappreciated for its answers to systems issues like stragglers and query fault-tolerance]
[issues]
- pull-based reducers mean lots of disk seeks
  - $O(mr)$, where $m$ is mapper count, $r$ is reducer count; on single disk, $O(r)$
  - bc after each mapper sorts locally, all reducers read a slice from each sorted map output

hadoop

job: full map-reduce; task: an execution slice
jobtracker: master node that accepts jobs from clients, splits jobs into tasks, and assigns them to slaves (workers)
tasktracker: on each slave; manages task execution
map task execution
- split: portion of input file for a task; defaults to 1 chunk (64MB)
- mapper interface
  
  class mapper: map(key, val, OutputCollector) close(): called when map has been called on all records in split
- OutputCollector: stores output in a format for reducer
  - apply partition function on key, buffer record/partition number in buffer
  - spill to disk occasionally; spill sorts by (partition, key)
- commit phase: on task completion; merges spills; read when servicing reqs
- combiner: optional map-side reducer
reduce task execution
- each reduce task is assigned a partition of key range
- shuffle phase: fetch task partition from each map task's output (HTTP; default 5 at a time)
- sort phase: group records with same key
- reduce phase: apply reducer to each key and corresponding list of values
oozie workflow engine uses DAG of hadoop, pig, ssh, http, email jobs

hive

TODO (below is just starter)
hiveql
- from subqueries, inner/outer equijoins, multi-group-by, multi-table inserts, UDFs
- sampling
- nested collections: arrays, maps, UDTs
query optimizer
- predicate pushdown
  - left outer join: left side preds pushed
  - right outer join: right side preds pushed
  - full outer join: none pushed
  - non-det functions not pushed (rand; specified with java annot)
- map joins: user-specified small in-mem hash tables on mappers; no sort/reduce needed
- group by optimizations: map-side partial aggs, load balancing for data skew
extensibility
- disk data formats: text, sequence, ...
- in-mem ('serde') data formats: java, hadoop writable, thrift, ...
- user map-reduce scripts
- UDFs: substr, trim, from_unixtime, ...
- user-defined agg functions: sum, average, ...
- user-defined table functions: explode, ...

mapreduce online (tyson condie, nconway, ..., jmh, cal TR 2009)

built hadoop online prototype (HOP); submitting to hadoop project
key idea: pipeline (no materialization)
features
- online aggregation: see snapshots of results as they are generated ("early returns")
- continuous data stream processing: indefinite input
- jobs finish faster
single job: start as many map/reduce tasks as you have slots for
- don't schedule all at once, since not enough slots and too many TCP conns
  - if no reducer task to take from a mapper, then just materialize
- buffer records for opportunity to apply sort/combiner
  - resort/recombine periodically if reducer not keeping up
multi-job pipelining
fault-tolerance: reducer just avoids mixing tentative ("uncommitted") map task output with "committed"
online agg: snapshot accuracy is hard; report simple progress metric along with snapshot results
continuous mr: continuous online agg
- circular buffer of spillfiles
[db group meeting discussion]
- they claim that they can't overlap/pipeline mapreduce jobs, but unclear why
- for regular non-online-agg MR jobs, unclear how much pipelining benefits perf
- seems to be pushing the disk-writing to the reducer
  - mappers send smaller sorted units to reducers; reducers are reading from many input file descriptors
  - but reducers need to buffer these somehow; most likely need to spill to disk

dryad

TODO
DAGs, pipelining, explicit materialization
iterative algos incl. k-means, matrix power iteration, and graph traversal
cosmos distributed file system
ecosystem
- scope: sql-like language; by bing team
- dryadlinq: academic release
- nebula: a distributed shell with file descriptors as connectors

manimal: relational optimization for data-intensive programs (mcafarella, webdb10, talk at cloudera)

dbms efficient, mr slow but flexible
given unmodified mr programs as bytecode
apply well understood opts like indexes, col stores
analyzer -> optimizer -> executor
special handling of regexes; eg constant regex must start w chars
eval 16 programs: 4 from pavlo, 12 from mahout
- detected 3/4 sel opt oppty's
  - couldn't handle hashtable, but that's popular enough for special handling
counters, log msgs, etc - we're fine for abusing
opts: sel, proj, delta compression, dict compression w direct operation
- dict compression can use the compressed value directly if you're just doing e.g. equality comparisons; no need to decompress/look up in dictionary
- sel: 1% selectivity -> 1% runtime
- proj: huge data reductions
- compression: delta compression not worth the trouble; too small gains
  - need semantic analysis of the code
outstanding problems
- indexing time overhead is another mr job; only makes sense for repeat jobs
details
- works w writables and text
- overridden recordreader can use the index
mentioned haloop from uw on optimizing gradient ascent mr jobs
discussion
- what % of mr programs are java or pig/hive?
  - karmasphere: most (90%) mr jobs are in java
  - amr thinks it's the other way around; some even ban java
  - 99% are python streaming at fb, visa
- flumejava/plume: closer to linq; java lib for mr jobs
  - craig chambers' uw project -> google
- hive has indexes
- optimizer is rule-based; cost-based optimizers that take runtime feedback would be nice
extra cloudera work (not part of webdb talk)
- learnavro: 'zero to olap in 60s'; mcaf's cloudera hacking
  - based on at&t learnpads paper; python done; java wip
  - go from raw data files to structured data
- schemadictionary
  - inputs: anon structured data + previously seen datasets
  - find $k$-closest datasets to anon, + schema mappings
    - schema type info, data statistics
    - "how likely is it that this data was drawn from the same distrib?"
  - after user picks one of the 3, label anon data using mapping
  - dist computation function
- tool to auto pick the correct visualization

Block stores

DRBD

2-node block-level replication, either sync or async
active/active only for FSs like OCFS2 or GFS2 (slower FSs), not ext3 etc.
supports online recovery
http://lwn.net/Articles/329543/

Filesystems

ceph (sage weil, osdi06) TODO incomplete

in linux 2.6.34; weil's ucsd phd thesis, then founded new dream/dreamhost
near-POSIX semantics; has linux VFS client
- extends interface & selectively relaxes consistency for app needs & for perf
- O_LAZY: don't need to disable read caching & write buffering when file opened by multiple clients
- HPC community proposed extensions to posix
metadata server (MDS) cluster manages namespace and coordinates security, consistency, and coherence
- dynamic subtree partitioning: can redundantly replicate portions for perf
- measure metadata popularity using exponentially decaying counters; each op affects parent dir up to root
- 2PC subtree authority transfers
CRUSH distribution function designed for heterogeneous, unreliable object storage devices (OSDs, i.e. backends)
- file broken into objects mapped into placement groups (PGs) using simple hash fn
- PGs assigned to OSDs using CRUSH
ceph monitor cluster manages MDS clusters
- just has to handle MDS heartbeats
- uses Paxos w leasing mechanism
largest scalabality test: 430 nodes, 128 MDS, per-MDS perf 50% of 1-MDS cluster, ~250K metadata ops/s (enough for 1000s of OSDs)
OSDs only comm w peers
actively migrates data to new nodes and re-replicates data on failed nodes
uses (and contributes to) btrfs; orig used own EBOFS (extent and b-tree object file system)
- tuned for obj semantics & other features eg async notif of commits to disk
other features: snapshots; file/capacity accounting at dir level

pohmelfs TODO

farsite distributed file system (MSR, 2000)

replicates each file onto multiple desktops

Reclaiming Space from Duplicate Files in a Serverless Distributed File System (2002) [32 citations — 2 self]

over half of all consumed space is occupied by duplicate files
convergent encryption: coalesce duplicates to take single file's space
database for aggregating file content and location information in decentralized, scalable, fault-tolerant manner

frangipani: scalable distributed file system (1997)

built on petal block storage

xtreemfs

WAN operation
TODO

Replication in the Harp File System (Liskov, Ghemawat, et al)

primary copy replication: single primary server waits on multiple backup servers before returning
- primary ensures enough backups to guarantee operation effects will survive all subsequent failovers
- failover algorithm masks failures, removes failed node
operations are immediately recorded into volatile memory log
- write-behind to disk
- UPS for power failures
each file managed by a group (with one master)
- modifications require 2PC
- non-modifications don't require 2PC, but instead TODO
replication method
- overview
  - view change: each group configuration is a view
  - as with any replication scheme that tolerates network partitions, require $2n+1$ for $n$ failures
  - only actually need to store data on $n+1$ of the machines; can be propagated on failures (partition)
  - witnesses: replicas that don't store data
  - 1 designated primary (will act as primary whenever it can), $n$ designated backups, $n$ designated witnesses
  - arrange groups so that each node is designated primary of one group, designated backup of another, designated witness of a third; this distributes workload
- normal-case processing
  - log of event records with growing IDs of operations, both phase 1 and committed
  - commit point (CP): ID of latest committed op (separates phase 1 and committed)
  - 2PC
    - primary logs, sends logged info to backups
    - backups only accept in log order; ack for $n$ means got up through $n$
    - primary commits by advancing CP and send CP to backups
  - applying changes
    - log is WAL no-steal: changes not applied on disk (FS) until committed
    - application point: ID of latest op that has started being applied to the local FS
    - lower bound (LB): ID of latest op that has finished being applied to the local FS
    - primary and backups exchange LBs
    - global LB: min(all LBs); may truncate things beyond this
  - recovery: log brings recovering node up to date
    - log is redo: only sufficient to redo op after failure
  - reads can be serviced immediately without 2PC
    - results reflect committed, no uncommitted (i.e. serialize these reads at the CP)
    - if primary is partitioned off, then the result of the read may not reflect a write that has been committed in the new view
    - compromises external consistency, which requires comm outside FS
    - make this unlikely by using loosely synchronized clocks
      - each message from backup to primary contains a time equal to backup's clock + $\delta$ (few hundred ms)
      - backup promises to not start new view until that time
      - expect that starting a new view will not be delayed, since $\delta$ will have passed before new view is ready to run
      - primary needs to comm with the backup about a read op only if the time of its local clock is greater than the promised time
    - access time: can enable loose consistency on this
      - return immediately from primary, before committing
      - may be lost if there's a failure/partition
- view changes
  - views have view numbers
  - promoted witnesses may not have resources (eg FS) so must never trunc log
  - TODO left off here

The Google File System

3x replication

GFS Evolution

main problem: master has memory index of all files' metadata
orig master per cell per DC, but isolation was difficult and 1-master not scalable in # files; moved to multi-cell, multi-master
- multi-cell: multiple masters (isolated FSs) over shared pool of chunkservers; app must manually partition over masters
- namespaces: static partitioning of one FS namespace over multi-cell
started having more online apps like gmail
- 64MB chunks too big but smaller chunks -> more files; 1MB compromise; see above discussion on master scalability
- orig GFS was for throughput not latency
- also needed more availability; 1-master is SPOF
  - orig master failover was manual: cell could be down for an hr
  - then auto, but slow (minutes); now down to 10s
- generally, bigtable solves a bunch of problems
  - bigtable is very failure-aware & responds faster
multihoming
- bigtable txn log: have 2 logs, write to one and failover to other, merging later
- gmail uses multihoming, for availability and to hide gfs problems
many disks claimed to linux driver that they supported a range of ide protocol versions but actually responded reliably only to more recent ones, silently corrupting data
- motivated GFS checksums, but these were E2E so covered everything (eg TCP corruption)
loose/eventual consistency
- apps can read stale data
- client failures are big issue
- multi-writers are problematic; eg RecordAppend interface (dupes, ordering)
general google approaches
- get things working reasonably well, then focus on scaling, not so much on efficiency; usually never focus on a binary, except for GFS master perf
- had flexibility since apps & infrastructure all by google
http://queue.acm.org/detail.cfm?id=1594206
http://cacm.acm.org/magazines/2010/3/76283-gfs-evolution-on-fast-forward/fulltext

wheelfs: Don't Give Up On Distributed File Systems (strib)

specify hints about consistency

CloudStore aka Kosmos File System (KFS)

from kosmix auto-portal site
hadoop/hypertable/hbase integration [compat with HDFS?]
for: mainly write-once/read-many loads; Ms of big files; mostly seq access
replication (configurable per-file) with async re-replication on failure
periodic chunk rebalancing; elasticity
for data integrity, verify checksum on read; re-replicate to recover
leases for client cache consistency
chunk versioning to detect stale chunks
C++, STL, boost, aio, log4cpp; solaris, linux; fuse

DHTs

UsenetDHT (Emil Sit, NSDI08)

TODO: read paper
high bandwidth for peering and high storage/durability rquirements
related work
- DHash, OpenDHT optimized for small objects (eg CFS)
- Galcier: focus on catastrophic failure
- Total Recall: ...
- Coral: perf, but a cache, so no durability
UsenetDHT: shared Usenet server
- 80 TB/yr
- Passing Tone: algo for data maintenance
  - re-replication on failures
  - durability for immut dat
  - using only pairwise sync
  - supports obj expiration
- enhanced DHash for perf
  - bulk data
  - 6MB/s per server
goal: reduce distribution inefficiency and storage reqs
- share articles among eg universities, ISPs
nodes do full exchange of headers only
challenges
- in talk and paper
  - respond to eachfailure?
  - syncs expensive
  - TODO
- in paper
  - limited disk cap: can spurious repairs be avoided
  - rand disk IO
Passing Tone
- $r_L$ reps for durab
- extra replicas to mask transient failures
- repair decisions need only pairwise syncing
- two algos:
  - local: replication among current replica set
  - global: ensures correct placement
data placement
- replicate to $r_L$ successors
- idea from Carbonite system: extra replicas to mask transient failures
- make new copies when $< r_L$ avail
predecessors enable local decisions
- chord only know about immed pred
  - forces immed succ to handle maint
  - coord hard: constant writes/epirations
- pred list allows nodes to handle own maint
  - given $r_L$ and preds, know responsible range
  - extras reps can be reclaimed for space if nec
local maint
- calc respons range from preds
- sync with nbrs over range to
  - identify objs missing locally
  - make local reps of those objs

Kademlia (Petar Maymounkov, David Mazieres)

simplest

Pastry

Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications

Programming

Friday: global debugging (paper notes)

intro
- objective: track global state
- contributions [basically new debugger]
  - primitives detect events with watch/breakpoints
  - attach commands to these
liblog
- for libc/posix c/c++ apps
- each thread logs side effects of non-det syscalls (recvfrom, select)
- maintains causally consistent group replay by inserting lamport clocks in messages
- incrementally deployable
- can simulate a reasonably large # nodes on debugging machine
- still hard to debug - find needle in haystack
design
- watch/breakpoints
  - breakpoints are the same
  - implementation [watchpoints]
    - write-protected pages
    - alternatives
      - hardware - limited # watchpoint regs (usu. 8) not enough
      - single-stepping - too slow
      - local breakpoints - hard to analyze program to know where to insert them
      - periodic sampling - coarse-grained
  - implementation complexity
    - replicated gdb functionality
    - replaced gdb's calling facilities due to conflicts with write-protection
- commands
  - scripting language (running in own world; effectively 'global' state)
  - - syntax to access app state, current node, lamport/real clocks, etc
  - language choice
    - python - easy to embed
  - syntax
    - translate to gdb's 'print'/'set'
    - string interpolation
    - memory protection on app function invocs
    - general data structure marshalling -> raw bytes, enough for equivalence testing
- limitations
  - false positives slow down replay
    - can recompile app to spread out stack
  - must manually build debugging state if starting from middle of replay
    - future: add debugging state to liblog checkpoints (at debug time)
  - friday predicates often require debugging themselves
  - ...
case studies
- routing consistency: i3/chord dht

Orc

P2

overlog

BOOM: data-centric programming for the data center

implemented HDFS, including multi-paxos master node, in overlog
master/namespace is partitioned for scale-out; took 1 day
overlog timestep

The NIL distributed systems programming language: a status report

Mace: language support for building distributed systems

JXTA

Prolac

A System for Constructing Configurable High-Level Protocols

From asynchronous to synchronous specifications for distributed program synthesis

File-Sharing

The Darknet and the Future of Content Distribution

A Survey of Anonymous Peer-to-Peer File-Sharing

F2F filesharing
group filesharing
- http://waste.sourceforge.net/
- http://dcplusplus.sourceforge.net/ (hubs)

Cloud

ec2

elastic compute cloud
"compute units" = some amd opteron
vm's (~2-3 min to start)
can't run arbitrary os; must be linux 2.6-based with paravirt (for xen)

s3 TODO

storage service

simpledb TODO

similar to bigtable
objects with loose set of multi-valued attrs
simple query lang: select ATTRS from DOMAIN where PREDS
eventual consistency
queries capped at 5 sec runtime
$1.50 / gb-month
$0.14 / cpu-hr consumed by queries

Deployment

On Designing and Deploying Internet-Scale Services

lessons learned from Windows Live

Misc

PeerReview: Practical Accountability for Distributed Systems

Algorithms

Hundreds of Impossibility Results in Distributed Computing http://citeseer.ist.psu.edu/fich03hundreds.html

TODO

Uncategorized

Arpeggio

SIP

endpoints talk to a gateway (eg belonging to the company)
gateways talk to each other (over IP)
no P2P network as in Skype

Debugging

r2: record and replay debugging tool

lib-based replay
- binary rewriting (loader): liblog, jockey, recplay
  - intercept at syscall/libc
  - more lightweight than whole-sys replay
- record run
  - call orig syscall/libc funcs
  - log input
  - log thread scheduling info
read()
- record: log input
- replay: feed buf
challenges
- mem addrs must be identical
  - but fundamentally diff, eg shared libs
  - control mem consumption
- nondet must be enclosed
  - shared mem, ioctl, benign race, ...
  - fixing syscall/libc too restricted
- fat interface
  - need to wrap too many api funcs
contribs
- app-level isolation
  - kernel ideas: replay & system space
  - manage mem and exec
  - customize record & replay layer
- annotation keywords
  - code template
- experiments and experience
  - sqlite, bt, mpi
replay & system spaces
- api as r2 syscalls: replay -> system
- r2 upcall (from lib into app): system -> replay
space mode bit
- 0 - system space; 1 - replay space
- start in 0
- when entering main, switch to 1
- when invoking a syscall, switch to 0; e.g., read()
- when issuing an upcall, switch to 1
  - wrap upcall (callback), e.g., createThread(...workerThread...)
mem mgmt
- separate mem pools: all calls to malloc inside system space do not exist during replay run
- explicit data xfers: read()
code templates
- annotated prototype: int read([in]int fd, [out,bsize(return)]void *buf, [in]unsigned int nbytes)
  - language already in use (windows); see also SafeDrive(?)
- php code template to generat c++ code snippet
mem mgmt (cont)
- cross space pointers (xpointer)
  - make a copy in replay pool, e.g., getcwd(NULL,o)
- encapsulated resources: as ints, e.g., files
capturing exec orders: lamport's algo
- maintain logical timestamp for each thread
- syscall-upcall (callback): createthread < workerthread
- syscall-syscall (signal & wait): releasemutex < waitforsingleobject
- total order vs. causal order: race
related work
- lib-based replay
- whole system replay: hardware (strata), vm (revirt, idna)
- annots: sal icse06, deputy osdi06
- isolation: xfi osdi06, nooks sosp03
- domain-specific replay: ml, mpi, java
- replay apps: predicate checking, rate detection, intrusion detection

NSDI09

hashcache (nsdi09)

web proxy cache with pluggable indexing policy
- goal: minimize in-mem index for storing largest amount of data
- goal: maximize perf by minimize disk seeks
- take advantage of the fact that there is a hash table on disk
- index mainly to serve existence queries
- achieved just 1 disk seek per req
3 policies, from most compact to highest performant: Basic, SetMem, and Log
Basic: just store 8 bit low-false-positive hash
SetMem: store add'l 3 bits of LRU info
Log: store offset onto disk improves write performance
compared against open-source squid (least efficient) and commercial tiger (more efficient)

partial content replication (ucsd nsdi09)

eventual filter-based consistency

cimbiosys: content-based partial replication/partial content replication (doug terry, msrsv, ucsd, nsdi09)

motiv: photos
reqs: content-based selection, flexible (mesh) communication network
protocol: simple
- target sends knowledge and filter to source
- source sends unknown and filtered items to target
- target adds items and updates knowledge
provides: eventual filter consistency (correctness), eventual knowledge singularity (perf)
move-out notifications for removing item when item no longer matches filter
- "local" move-outs: must hold on to it so that it's not lost permanently (in case someone else wanted to get it)
- move-out notification chains: when a node that has removed an item syncs with another node that should remove the item as well; send move-out iff:
  - source's filter is a superset of target's filter,
  - source is as knowledgeable,
  - source doesn't store item that target stores
filter changes: local operation
- remove items that no longer match
- retract knowledge
to ensure eventual singularity: tree embedded sync topology
misc
- implemented in Mace on linux

rpc chains (nsdi09)

this primitive chains together multiple RPC invocations so that the computation can flow from one server to the next without involving the client every time
results are added to param list and fwded
evaluated with chained copy from one nfs server to another
nest/compose sub-chains by maintaining stack of chaining functions and state
- also allow concurrent sub-chains using splits in the path
also ties in to efficient bandwidth usage a la GFS (chained replication)

google storage (google io 2010)

like S3
all data WAN-replicated, RYW consistency, any size objects
REST API, ACLs

transactions across datacenters (google io 2009)

google dubs it "multihoming"; focus on app engine datastore
consistency (wrt durability)
- weak: reads might not see prior write
  - gae memcache
- eventual: reads eventually see prior write
  - dns, email, S3/simpledb (~500ms lag in S3)
- strong: reads will see prior write
  - gae datastore, filesystems (sync), rdbms, azure tables
stats
- within DC: 1-100Gbps, <1ms within rack, 1-5ms cross racks, no cost
- across DC: 10Mbps-1Gbps, 50-150ms, cost for fiber
  - 30ms baseline speed-of-light across US
options
- bunkerize single DC: azure, simpledb
  - redundant power sources, backbone connectivity, cooling
- single-master
  - if async, then window of data loss vulnerability; sometimes even inconsistent replication
  - AWS
    - EC2, S3, SQS: choose US or EU
    - EC2: availability zones, elastic load balancing
  - financial institutions (banks, brokerages, etc.) use this frequently
  - can geolocate reads, depending on your consistency requirements
- multihoming/multi-master: sync replication
techniques
- backups: super-async, weak consistency; low latency, high throughput
- master-slave: async, eventual consistency; low latency, high throughput
  - gae datastore uses this
- multi-master: async, eventual consistency; need ordering; can keep updating on DC failures (since no single master)
- 2PC: for cross-shard operations
  - semi-distributed consensus protocol (always need coordinator)
  - synchronous, high latency, low throughput
  - 3PC buys async with extra round trip
- paxos
  - fully distributed; no single coordinator
  - synchronous, high latency (prohibitively high even in same DC), medium throughput
  - used in chubby; for moving state, memcache, offline processing
- no silver bullet; embrace tradeoffs
entity groups in megastore are the unit of partitioning/consistency
criticism: if you watch this, don't get too caught up in the taxonomies here; they aren't very carefully designed. unfortunately the talk doesn't bring to the forefront the fundamental dimensions, instead convolving them into a jumble of different "packages" without sufficiently teasing them apart. my fear is that this talk will lead to the proliferation of such falsehoods as "master-slave replication != paxos" (!) and "master-slave replication means == weak/eventual consistency." that, and "multihoming." (google managed to do it before with "sharding.")
http://code.google.com/events/io/2009/sessions/TransactionsAcrossDatacenters.html

CAP theorem

bigtable, hbase opt for consistency and availability
dynamo opts for availability and partition-tolerance

revision control systems

git/hg

cherry-picking: choosing individual changes from one branch and applying them to another branch
daggy fixes: fix the code as a tiny branch off of wherever the bug was introduced
- nicer than cherry-picking because the problem is reduced to one of merge
- reuse the same merge/conflict-resolution machinery

perforce

do work in client workspaces; submit changed files together in changelists
p4 client
used at google; g4 frontend

source depot

microsoft's internal rcs; fast
shares the same syntax and model as perforce
there's a vast hierarchy of branches
- integration happens up the tree (reverse integration) and down the tree (forward integration)
- integrations occur more frequently toward the leaves and less frequently toward the root
- the leaves are what individual teams work on
- they're all mirroring the same file/dir namespace
- there's actually a set of these trees; the roots are called trunks

distributed oltp databases

spanner

multi-version, global, sync-repl, "externally consistent" txns
novel time API exposes clock uncertainty for external consistency
non-blocking reads in past, lock-free read-only txns, atomic schema changes
auto partition across paxos RSMs
trillions of rows
relational data model, SQL-like query lang
configurable repl
architectural hierarchy
- universe: eg production, dev
  - universe master: display status of all zones
  - placement driver: manages data xfer across zones
  - zone: administrative unit; single location
    - zonemaster: assigns data to spanservers
    - location proxy: helps clients locate spanservers that house the data
    - spanserver: controls 100-1K tablets
      - tablet: bag of (key:string, timestamp -> string)
tablet data stored in ~B-trees & WALs on colossus (new GFS)
paxos replica group per tablet
- leader also does CC w lock table
- 2PC across leaders for dist txns
directory: keys w common prefix; unit of data placement; ~50MB
- all data in dir have same (geo-)replication config
- move data btwn paxos groups in dir-wise manner (TODO huh?); not txn'l (or blocks txns for few secs)
spanner layers a bucketing abstraction
TODO wound-wait truetime 2px+paxos 2pl

bigtable

in-memory memtable LSM for holding pending writes
on write, flush WAL and update memtable
on reads, read memtable & multiple disktables, but first checking for non-existence in their respective bloom filters (which may say "maybe" or "no")
when full, flush memtable to disk as a new disktable
periodically compact the disktables, merging them into a single disktable
not partition-tolerant
scale to PB range across thousands of machines; auto elasticity
coprocessors: run arbitrary code over rows/row ranges
- eg: fancy updates, fancy filtering
- http://www.scribd.com/doc/21631448/Dean-Keynote-Ladis2009

voldemort (linkedin, 2009)

based on dynamo
linkedin: no joins, lot of denorm, no orm, no constraints/triggers, memcache, no-downtime releases, NOT horizontally partitioned (!), latency is key
large/persistent dataset cannot all be in mem
combines memory caching
design
- layers top to bottom
  - client API, conflict resolution, serialization
  - routing and read repair, failover (hinted handoff)
  - storage engine (bdb/mysql/memory)
- layers all have same interface: put, get, delete
physical arch configs (all top to bottom)
- 3-tier, server-routed: frontend, load bal, backend service, load bal, voldemort cluster
- 3-tier, client-routed: frontend, load bal, backend service, partition-aware routing, voldemort cluster
- 2-tier, frontend-routed: frontend, best-effort partition-aware routing, backend service with voldemorts
client api
- put, get, get_all, delete
- store: a table; keys are unique to stores
- put, delete: can specify expected version
- OCC to support multi-row updates and consistent read-update-delete
versioning and conflict resolution and consistency
- vector clocks provide partial order on values; like an improved OCC
- conflicts resolved at read time and write time; user can supply handler
- no locking/blocking necessary
- also recently added paxos
routing
- turns single get/put/delete into multiple parallel ops
- consistent hashing variant that allows for weighing nodes by their capacity
- repairs stale data at read time (??)
- can add user strategies, eg only do synchronous ops on nodes in local DC (??)
storage: bdb (default), mysql, mmap, or memory for testing
limitations: no elasticity
offline processing
- parallel extract from voldemort with throttling to respect live workload
- use hadoop to produce new voldemort stores then atomic swap with live
perf
- latency: median 0.1ms, 99.9% <3ms
- 1-cli 1-svr throughput: 20K reads/s, 16.5K writes/s
TODO: any relation to memcached?
http://project-voldemort.com/

cassandra (facebook, 2008)

based on dynamo's eventual consistency, P2P arch, DHT, gossip mgmt
based on bigtable's column family data model, SSTables, hadoop integration
designed by one of the authors of dynamo, avinash lakshman
richer data model than voldemort
- KeySpace > CF > Key > SuperColumn > SubColumn
- with range partitioning, can range_slice to range-query on keys and columns; with random partitioning, can range-query on columns
- cluster: the machines in a logical cassandra instance; set of keyspaces
- keyspace: a namespace for column families; typically one per app
- column families: ordered list of columns; analogous to relational table
  - rows have column name, value, timestamp; referenced by row keys
  - each stored in separate file, in row (i.e. key) major order
- supercolumns: columns that have subcolumns
writes in LSMs
consistency
- every kv pair has a client-managed timestamp
  - no version vectors or vector clocks or server-managed timestamps (vclocks may be added in 0.7)
  - clients must have synchronized clocks
  - latest timestamp wins
- can specify replication degree per request
flexible replication: sync or async
- RackUnaware: place replicas on $n-1$ subsequent nodes around ring
- RackAware: place replica 2 in other DC, then rest on other racks in same DC
- DatacenterShard: place $m$ of $n$ reps in other DC, rest on nodes in other racks in same DC
- use Snitches, mappings from node (IP) to physical location (DC, rack)
usage
- serialization
  - thrift, avro added in 2010
  - expects column names in network order (eg for ordering)
- nodetool: check nodes, decommission, trigger compaction
- working on compressed LSMs
- need to regularly schedule major compactions to free up disk space
- keyspaces, column families statically specified; changes require restarts
- 0.6 allows direct access from hadoop
no mem caching
gossip-based distributed failure detection & membership protocol
- sync membership & partitioning info w random node every sec
- preserves symmetry among nodes
WIP
- session level consistency
- avro instead of thrift
- backend for hive
adoption: fb, twitter, digg, rackspace
- fb aborted adoption of cassandra, trying to move away but still stuck with it for inbox search; this helped the 3 devs leave (spullara)
- twitter not planning to use cassandra for any critical data stores (7/7/10)
14K writes/s, 7K reads/s; 15% write overhead from fsync (durable log)
refs

scalien keyspace TODO

KV + PaxosLease [seems interesting]
supports safe reads from master and dirty reads from any replica

riak (basho)

faithful impl of dynamo, incl all key components + extra features below
erlang; primary key lookup only; json fmt; consistent hashing; vnodes
eventual consistency w vector clocks & app conflict resolution
pluggable storage backend (in-mem, etc)
allows replicas to be temporarily inconsistent (tunable RW quora)
- read-repair, anti-entropy
- full single server durability
raw speed slower than cassandra
buckets are buckets are dynamically created
extra features
- links: lightweight ptrs among data; can select along these paths
- map-reduce over data in js
[basho founded by former akamai ppl; brewer joined BOD]
http://riak.basho.com/nyc-nosql/

perf comparisons with key-value usage model

chunkd: 100% tyrant in warm-read case, 83% in cold
cassandra: 30-85% chunkd
voldemort: 30-80% cassandra
tabled: 50-90% voldemort
riak: 15-45% tabled; 8-24x slower than tyrant
http://pl.atyp.us/wordpress/?p=2417

lightcloud

provides consistent hashing on top of tokyo tyrant
TODO

piccolo (jiyang, nyu, osdi10)

partitioned KV store; can eg force together partition $i$ from multiple tables
commutative updates
load balancing
python/c++ impl
eg pagerank
other apps: crawler, numeric algos

dynamo

works across DCs
eventual consistency
hinted handoff: if a node that should receive a write is down, Cassandra will send that write to another node with a "hint" saying that when the destination node becomes available again, the write should be forwarded there
vector clocks to detect conflicts
quorum consensus: if R+W>N then [usu.] consistent reads [but see critique]
sloppy quorums?
- writes don't need to happen on first N members of preference list
- eg if first 3 are down then just take next 3 on the list; if first 3 came back, inconsistency
- but then: "Using hinted handoff, Dynamo ensures that the read and write operations are not failed due to temporary node or network failures. Applications that need the highest level of availability can set W to 1, which ensures that a write is accepted as long as a single node in the system has durably written the key it to its local store."
- so, if W < N, then N - W writes can be written using hinted handoff to nodes other than the final destination, and forwarded when the natural nodes come back online. BUT the W writes specified by the requested consistency level do have to be to the right nodes or everything falls apart.
3 levels of repair: read-repairs, hinted handoffs, replica synchronization/anti-entropy
- read-repairs and hinted handoffs are near-realtime, minimizing drift
- read repair: when reading from quorum, update any stale replicas
- hinted handoffs: bulk op done on node rejoin
- replica sync/anti-entropy: complete merge once a day using merkel trees
critique: http://jsensarma.com/blog/2009/11/dynamo-a-flawed-architecture-part-i/
- inconsistency: actually asymmetric when claiming to be symmetric; symmetry is a flawed goal because nothing else in the datacenter architecture is symmetric
- disaster recovery: lack of point-in-time consistency pushes complicated recovery onto app
  - eg can't just replay logical operations from a point in time since the DB is not a consistent snapshot
  - standard log-shipping replication/recovery can at least track replication lag
- quorum consensus: doesn't actually guarantee consistency because nodes that (re-)join a quorum don't have to go through a sync barrier (not brought up to date before being considered part of the quorum), so eg if R=1 then it may respond to a read request with stale data
- decentralization !iff availability
  - centralized resources may be made highly available, eg dual-everything in SANs, eg highly available
  - centralization is a scalability bottleneck if anything
  - in fact dynamo cluster itself is likely centralized; eg no dual-routers means unavailability/partitions

hbase

bigtable clone for HDFS (or KFS); apache hadoop subproject
region: key range in consistent hashing (a shard/partition)
- rows stored in byte-lexicographic order [first bytes must be key]
similar to HDFS: master:namenode :: regionserver:datanode (usu. deployed alongside each other on same hosts)
- regionserver: assigned regions
- master: manages assignments, notices failures
memstore: fast random writes
column family data model provides user-controlled locality and versioning and semi structure
- think of sparse columns, so fields stored as key-values
- separate column families stored in separate files
hbase 0.20: fast random reads
in-progress:
- multi-master WAN replication
- rewrite the master to leverage zookeeper more
- need HDFS appends to avoid data loss
writes: WAL is HDFS file
- commit after in mem on 3 replicas, but flushed to disk asynchronously
- fast writes; CDH3 adds single-server durability
HDFS is rack-aware
http://hstack.org/why-were-using-hbase-part-2/
coprocessors: same as in bigtable
- http://hbaseblog.com/2010/11/30/hbase-coprocessors/

http://developer.yahoo.net/blog/archives/2009/06/nosql_meetup.html http://stackoverflow.com/questions/1189911/non-relational-database-design http://www.google.com/search?hl=en&rlz=1C1GGLS_enUS337US337&q=voldemort+cassandra+hbase+hypertable+tyrant+dynomite&aq=f&oq=&aqi= http://research.yahoo.com/files/ycsb.pdf

google app engine (GAE) data store

entity groups are units of partitioning
supports optional stale reads
separate blobstore for storing large objects
supports secondary indexes
built on megastore

megastore

built on big-table
provides declarative schemas, multi-row transactions, secondary indices
used to do in-order & atomic wide-area replication with Paxos, but now realizes the folly of its synchronous ways
- switch to asynchronous replication
- along with fixing replication to maintain transactional consistency (before, replication would leave destination in potentially transactionally inconsistent state)
means that apps now get to deal with eventual consistency (the alternative being unavailability)
multi-row txns limited to rows sharing the same key prefix
- optimistic concurrency control: a separate "row" contains the version number that clients check before and after that operation
asynchronous replication but durable commit log at master
- if master fails before replication, then fresh read requests fail
- later attempts to access fresh will replay the log
- probably delay replay till subsequent access to give transient errors time to go away
supports declarative secondary indexes
- like replication, these are updated asynchronously, hence may be stale
- commit log also replays these
- system understands protocol buffers; eg can index a PB's repeated fields

infinispan (data grid)

FLOSS from jboss
formerly jboss cache
supports read committed and repeatable read
distributed transactions
supports eager distributed locking
mvcc

DRKP: working on "mul-tree-cast" membership service

multicast updates down multiple trees
each tree has the same root (leader)
leader changes by epoch
is BFT (without BFT protocol)
the idea is that if any leader is BF, then just wait for the next epoch
they use the currently constructed trees to disseminate the other trees being constructed
trees are constructed using a deterministic algorithm that each node can independently calculate
problem: what if you send different subsets of the membership to each tree?

memcached

cmds: set, add, replace, append, prepend, get, gets/cas, delete, incr/decr, stats*, flush_all, get_multi
all ops atomic; can set expiration time (impl: lazy on-access expiration); LRU
max key 250B; max value 1MB; slab mem allocator
TCP and UDP, ASCII and binary wire protocols
non-blocking; multi-threaded parsing, but not cache access; may be future work
interesting use cases
- flood control: allow user attempt to post iff put with 60s expire succeeds
advantages vs db caching
- size: no need for 1:1 balance btwn cache nodes and db nodes
- efficiency: RDBMSs have a lot of other engineering (buffer manager, etc.); also, eg, mysql writes invalidate buffer
- control: build very efficient caches, at the level application desires (cache not just DB accesses but the result of arbitrary computations/accesses)
pains, greatest to least
- requires manual invalidation; del or put new value (usu preferred)
  - typically used as read-only cache, so only worry about staleness
  - though, causality can be easily guaranteed with version numbers
- stampedes: overload backing DB on expiry/invalidation (if highly requested)
  - reduce window by:
    - using app-level expiry timestamp; on get, if expired, immediately put back with extended deadline to prevent others from stampeding; then hit DB and update cache
    - use probabilistic app-level expiry timestamps
    - somewhat helps: invalidate via cron instead of on-demand
    - somewhat helps: for invalidation, update eagerly instead of delete
  - avoid by: issue a unique work queue item to hit DB & update cache, eg using gearman
  - [ideally: allow the above "notify-one" expiry, informing later requests that a request is already working on it; then the others can wait a while for the value to appear in memcache, trying again if they still don't see it after the delay, in case the first requestor failed to hit DB & update]
- see note on facebook's fancier invalidation experiences
- no secondary indexing: can't reference multiple keys to one value; must do this app-level
tips
- scalable lists
  - maintain lists of keys instead of data; can spread load too
  - chunk up the lists

velocity

MS's memcached
http://msdn.microsoft.com/en-us/magazine/cc655792.aspx

mongodb

data model/organization
- server can manage multiple isolated DBs
- each DB has 1+ collections; collections have documents (schema-less objects with fields) + ACL
- all objects have _id field; always indexed
BSON: like JSON data model; general binary serialization format; primitives, objects, arrays
dynamic queries
- JSON-like notation; equality and gt/lt comparators
- atomic update modifiers via set operators (constrained to same document)
- generates plan, uses indexes; cursors
indexing on inner objects and embedded arrays; compound keys; specify sorting order; indexes support unique
query profiler
master-slave replication: async; both manual and auto
- replica pairs (sets): node 1 is master 1 slave 2, node 2 is master 2 slave 1
- auto failover
efficient storage of binary data incl. large objects (photos, videos)
auto-sharding for high scalability
- each shard is replica pair and manages some chunks
- chunk: continuous range of objects from particular collection, ordered by a key (specified similarly as index)
- other components: frontend coordinators; config/metadata servers
aggregation (group by); support for large map-reduce operations
no durability: no sync disk/replica writes
more features: capped collections, gridfs for large blobs
focus on performance
- C++; mmap'd files
- no REST; binary TCP protocol
many supported langs, platforms
no txns

sinfonia (HP including mehul shah, sosp07)

application nodes & storage nodes are separate
data model: raw linear address space; accessed directly by client libs
minitransaction: a single-shot txn containing compare, read, and write phases
- batchable; can be executed within phase 1 of 2PC; in par with replication

RAMcloud (ousterhout, HPTS09)

goal: low latency
rationale
- 64GB RAM/server w 1M ops/s, 5-10us RPC; 1TB RAM in 5-10 yrs
- $60/GB today; $4/GB in 5-10 yrs
- lower latency reduces txn durations, conflicts

gizzard (twitter)

middleware sharding framework
for any storage system, not just sql
specify range routing table; more control than consistent hashing
each shard is a replication tree
- each partition can have distinct replication tree
- better fault tolerance
async replication only
writes are buffered till shard or replica available again
requires writes to be idempotent; retry-later strategy can apply ops OOO (eg newer ops before older failed ops)
handles migrations by applying WriteOnly shard in front of new node and replicating ops to both
http://engineering.twitter.com/2010/04/introducing-gizzard-framework-for.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed.page

Distributed.page

Background

invoke VIEWSTAMP OPERATION -> DEST

datacenter

People

DSM

Consensus

Storage & Consistency

MapReduce

Block stores

Filesystems

DHTs

Programming

File-Sharing

Cloud

Deployment

Misc

Algorithms

Uncategorized

Debugging

NSDI09

revision control systems

distributed oltp databases

Files

Distributed.page

Latest commit

History

Distributed.page

File metadata and controls

Background

invoke VIEWSTAMP OPERATION -> DEST

datacenter

People

DSM

Consensus

Storage & Consistency

MapReduce

Block stores

Filesystems

DHTs

Programming

File-Sharing

Cloud

Deployment

Misc

Algorithms

Uncategorized

Debugging

NSDI09

revision control systems

distributed oltp databases