TODO merge in notes from CS186

Background

resources

http://blog.marcua.net/post/117671929/mit-database-systems-6-830-ta-course-notes

vocab

PDBMS = MPP
data cube: hypercube of multi-dim info; analytical aggregation along dims
a predicate is sargable if DBMS can use index to speed up execution
- usu. requires invertible, order-preserving (monotonic) function on indexed attribs

fields

analytical processing, BI, data warehousing
transaction processing: worry about scaling, consistency
stream processing, sensor nets
data integration, deep web, federated db's
- template extraction, schema recognition
- schema resolution
- entity resolution
- data extraction, wrapping (deep web)
- query routing (federated db's)
information extraction

companies

BI
- vectorwise: orig monetdb; from CWI, holland; led by marten kersten
  - partnered with ingres
  - shared-everything (single machine) multicore
  - open-source column store (actually row/col hybrid)
  - query operators are run via query execution primitives written in low-level code that allow compilers to produce extremely efficient processing instrs
  - vectors of 100-1000 values get SIMD'd
  - founders from CWI
  - 3GB/s decompression, 4-5 cycles/tuple, 3-4X ratio
  - operate in decompressed domain
  - shared scans
- kickfire/C2: entered with TPC-H win
  - column store, execution, compression
  - FPGA for data- and task-parallelism
  - targeting mysql
  - died, acquired by teradata at bargain price
  - http://dbmsmusings.blogspot.com/2010/08/thoughts-on-kickfires-apparent-demise.html
- sybase iq: data mining
  - scale-out; C++ UDFs; embedded data mining/predictive analytics lib (modeling & scoring)
- sas: has own internal db
- teradata
- XSPRADA/Algebraix
- paraccel
- kognitio
- enterprisedb: postgresql company
  - major contributor to postgresql
  - postgresql plus standard server: extensions, eg connectors, geospatial, replication, caching
  - postgresql plus advanced server: deeper oracle compat; infinite cache
- greenplum: postgresql based; acquired by emc
- aster data: customers: myspace; hadoop interop; COTS; investors: sequoia
  - ncluster: row-based MPP DBMS based on PG; supports map-reduce
- calpont's infinidb: mysql + columnar storage engine (MPP, dict/token compression, multicore queries, ACID, no indexes/mat views, "extends" like netezza's zone maps)
  - tried this out since it seemed like most promising free c-store; suffers from mysql row size limit (no large tables, which is common in OLAP!)
  - varchar 8000 max
- infobright: wants multicore queries, ACID
  - also column store, competes with infinidb
  - tried this out; not the fastest but pretty good sql support out of box
  - no CREATE TABLE AS
- luciddb, monetdb: supposed to be slower/less mature than infinidb/infobright
- netezza: row-based db hw with compression, clustering; acquired by ibm
  - compression: prefix compression, huffman coding
  - clustering: multi-dimensional simultaneous partitioning w space-filling hilbert curve
  - twinfin: well-designed proprietary hw appliance
  - twinfini: parallel
- datallegro/ms
- xtremedata: plugs into cpu socket for cpu bus bandwidth
- groovy sql switch: mem oltp rdbms, appliance & sw
- vertica: column store
  - stonebraker
  - h-store
main-memory
- ibm soliddb: uses tries instead of btrees for indexes
- oracle timesten
- scaledb: mysql clustering; uses tries instead of btrees for indexes; better in-mem perf, more compact
distributed
- continuent tungsten replicator: DB-neutral master/slave async replication
elephants
- MS: sql server
  - shared scans
- oracle/sun: mysql
  - oracle exadata: analytical
- postgresql
- hp
  - hp neoview: analytical
- ibm
emc
marklogic: XML DB with XQuery as primary QL

people

dave dewitt: uwisc prof; started msr wisc

misc

XA transactions: open group specification for common dtxn standard

denormalization: simplex/complex reads vs writes

optimizations TODO

from ormperf: caching, outer-join fetching, transactional write-behind, etc.

applications

on-line transaction processing (OLTP): many small updates
on-line analytical processing (OLAP) aka data warehousing: ad-hoc, long-running queries
decision support systems
business intelligence (BI)
data mining

OLAP

multi dimensional expressions (MDX): lang for querying OLAP cubes
- resembles SQL on surface; expresses things that may be clumsy in SQL
- basically like pivot tables over RDBMSs
closely related: cubes, pivot tables, contingency tables

OLAP cubes

hypercubes of summarized data
- often generated from raw data stored in data warehouse as star schema, where central facts table links to dimension tables
- think of this as an excel pivot table; distinct values are hoisted to the forefront to define the dimensions of the hypercube
- "dimensions" (an overloaded term) are discrete/compound data types,
- "measures" are scalar and can be raw values that get aggregated (Temperature), or generated as aggregates (COUNT(), etc.)
- ref: https://dba.stackexchange.com/questions/45655/what-are-measures-and-dimensions-in-cubes
MOLAP: pre-aggregated dense arrays; resembles pivot tables
- fast to query/update but restricts queries and no raw data drill-down
- also, tend to have sparse cubes
ROLAP: ad-hoc queries with SQL, esp. column stores; flexible
- PG and other DBMSs have ROLLUP/CUBE/GROUPING SETS that conveniently generate more aggregations, but all you really need is GROUP BY
- Microsoft has a convenient SQLish language for querying OLAP cubes: https://en.m.wikipedia.org/wiki/MultiDimensional_eXpressions
- generating cubes means specifically generating this intermediate representation; column stores encourage end-to-end ad-hoc queries: https://www.quora.com/How-do-I-start-OLAP-cubes-with-Amazon-Redshift
- tangentially, can also do excel pivot tables in PG: https://stackoverflow.com/questions/3002499/postgresql-crosstab-query/11751905#11751905
completely unrelated to postgresql's cube data type

From https://www.scribd.com/document/259252509/Difference-Between-Column-Stores-and-OLAP-Data-Cubes# and https://webcache.googleusercontent.com/search?q=cache:nJHhYpMp85EJ:https://my.vertica.com/tag/data-cube/+&cd=3&hl=en&ct=clnk&gl=us and https://docs.google.com/document/d/1ZkqFu5Q5GP9N8ePlBmXhzF_zIFpoDav9BCvL5fWeN0w/edit#:

Both column-stores and data cubes are designed to provide high performance on analytical database workloads (often referred to as Online Analytical Processing, or OLAP.) These workloads are characterized by queries that select a subset of tuples, and then aggregate and group along one or more dimensions. For example, in a sales database, one might wish to find the sales of technology products by month and store–the SQL query to do this would look like:

SELECT month, store, COUNT(*) FROM sales, products WHERE productType = ‘technology’ AND products.id = sales.productID GROUP BY month, store

In this post, we study how column-stores and data cubes would evaluate this query on a sample database:

Column Store Analysis In column-stores, this query would be answered by scanning the productType column of the products table to find the ids that have type technology. These ids would then be used to filter the productID column of the sales table to find positions of records with the appropriate product type. Finally, these positions would be used to select data from themonths and stores columns for input into the GROUP BY operator. Unlike in a row-store, the column-store only has to read a few columns of the sales table (which, in most data warehouses, would contain tens of columns), making it significantly faster than most commercial relational databases that use row-based technology. Also, if the table is sorted on some combination of the attributes used in the query (or if a materialized view or projection of the table sorted on these attributes is available), then substantial performance gains can be obtained both from compression and the ability to directly offset to ranges of satisfying tuples. For example, notice that the sales table is sorted on productID, then month, then storeID. Here, all of the records for a givenproductID are co-located, so the extraction of matching productIDs can be done very quickly using binary search or a sparse index that gives the first record of each distinctproductID. Furthermore, the productID column can be effectively run-length encoded to avoid storing repeated values, which will use much less storage space. Run-length encoding will also be effective on the month and storeID columns, since for a group of records representing a specific productID, month is sorted, and for a group of records representing a given (productID,month) pair, storeID is sorted. For example, if there are 1,000,000 sales records of about 1,000 products sold by 10 stores, with sales uniformly distributed across products, months and stores, then the productID column can be stored in 1,000 records (one entry per product), the month column can be stored in 1,000 x 12 = 12,000 records, and the storeID column can be stored in and 1,000 x 12 x 10 = 120,000 records. This compression means that less the amount of data read from disk is less than 5% of its uncompressed size.

Data Cube Analysis Data cube-based solutions (sometimes referred to as MOLAP systems for “multidimensional online analytical processing”), are represented by commercial products such as EssBase. They store data in array-like structures, where the dimensions of the array represent columns of the underlying tables, and the values of the cells represent pre-computed aggregates over the data. A data cube on the product, store, and month attributes of the sales table, for example, would be stored in an array format as shown in the figure above. Here, the cube includes “roll-up” cells that summarize the values of the cells in the same row, column, or “stack” (x,y position.) If we want to use a cube to compute the values of the COUNT aggregate, as in the query above, the cells of this cube would look like: Here, each cell contains the count of the number of records with a given (productID,month,storeID) value. For example, there is one record with storeID=1, productID=2, and month=April. The “sum” fields indicate the values of the COUNT “rolled up” on specific dimensions; for example, looking at the lower left hand corner of the cube for Store 1, we can see that in storeID 1, productID 1 was sold twice across all months. Thus, to answer the above query using a data cube, we first identify the subset of the cube that satisfies the WHERE clause (here, products 3, 4, and 5 are technology products–this is indicated by their dark shading in the above figure.) Then, the system reads the pre-aggregated values from sum fields for the unrestricted attributes (store and month), which gives the result that store 2 had 1 technology sale in Feburary and 1 in June, and that store 3 had 1 technology sale in February and 1 in October. The advantages of a data cube should be clear–it contains pre-computed aggregate values that make it a very compact and efficient way to retrieve answers for specific aggregate queries. It can be used to efficiently compute a hierarchy of aggregates–for example, the sum columns in the above cube make it is very fast to compute the number of sales in a given month across all stores, or the number of sales or a particular product across the entire year in a given store. Because the data is stored in an array-structure, and each element is the same size, direct offsetting to particular values may be possible. However, data cubes have several limitations:

Sparsity: Looking at the above cube, most of the cells are empty. This is not simply an artifact our sample data set being small–the number of cells in a cube is the product of the cardinalities of the dimensions in the cube. Our 3D cube with 10 stores and 1,000 products would have 120,000 cells, and adding a fourth dimension, such as customerID (with, say, 10,000 values), would cause the number of cells to balloon to 1.2 billion! Such high dimensionality cubes cannot be stored without compression. Unfortunately, compression can limit performance somewhat, as direct offsetting is no longer possible. For example, a common technique is to store them as a table with the values and positions of the non-empty cells, resulting in an implementation much like a row-oriented relational database! Inflexible, Limited ad-hoc query support: Data cubes work great when a cube aggregated on the dimensions of interest and using the desired aggregation functions is available. Consider, however, what happens in the above example if the user wants to compute the average sale price rather than the count of sales, or if the user wants to include aggregates on customerID in addition to the other attributes. If no cube is available, the user has no choice but to fall back to queries on an underlying relational system. Furthermore, if the user wants to drill down into the underlying data–asking, for example “who was the customer who bought a technology product at store 2 in February?”–the cube cannot be used (one could imagine storing entire tuples, or pointers to tuples, in the cells of a cube, but like sparse representations, this significantly complicates the representation of a cube and can lead to storage space explosions.) To deal with these limitations, some cube systems support what is called “HOLAP” or “hybrid online analytical processing”, where they will automatically redirect queries that cannot be answered with cubes to a relational system, but such queries run as fast as whatever relational system executes them. Long load times: Computing a cube requires a complex aggregate query over all of the data in a warehouse (essentially, every record has to be read from the database.) Though it is possible to incrementally update cubes as new data arrives, it is impractical to dynamically create new cubes to answer ad-hoc queries. Summary and Discussion Data cubes work well in environments where the query workload is predictable, so that cubes needed to answer specific queries can be pre-computed. They are inappropriate for ad-hoc queries or in situations where complex relational expressions are needed. In contrast, column-stores provide very good performance across a much wider range of queries (all of SQL!) However, for low-dimensionality pre-computed aggregates, it is likely that a data-cube solution will outperform a column store. For many-dimensional aggregates, the tradeoff is less clear, as sparse cube representations are unlikely to perform any better than a column store. Finally, it is worth noting that there is no reason that cubes cannot be combined with column-stores, especially in a HOLAP-style configuration where queries not directly answerable from a cube are redirected to an underlying column-store system. That said, given that column-stores will typically get very good performance on simple aggregate queries (even if cubes are slightly faster), it is not clear if the incremental cost of maintaining and loading an additional cube system to compute aggregates is ever worthwhile in a column-store world. Furthermore, existing HOLAP products, which are based on row-stores, are likely to be an order of magnitude or more slower than column-stores on ad-hoc queries that cannot be answered by the MOLAP system, for the same reasons discussed elsewhere in this blog.

benchmarks

TPC-C: OLTP
TPC-E: new OLTP (TODO: does anyone use this? what's different?)
TPC-H: OLAP

indexes

primary vs. secondary
- primary: no duplicates, usu. primary key, may store rec directly in index (otherwise point to rec using rid)
- secondary: support duplicates, points to rec using rid or primary key
clustered vs. unclustered
- TODO

joins

bitmap join indices
- build bitmap index over low cardinality outer table R (at least on join key)
- map into this while scanning over S
simple hash join
- build hash table of R in memory, and probe into it with S
- on overflow, start spilling to file R'; S similarly spills to S'
  - a new hash function h' is used to determine whether incoming R tuples should be sent to the hash table or to disk
  - puzzle: should you just stop reading and perform joins with what you have so far (i.e. potentially many keys, but not all tuples with that key are necessarily in the table)? or should you evict things to make space for all the tuples belonging to each key in the hash table (i.e. have fewer keys)?
    - the former degenerates into O(n^2) nested loop joins - worse because you can't start to throw out tuples of S at any point
    - so you'll need to start making room for the incoming tuples that h' says should go in memory - but now you need to choose an eviction policy (probably unimportant since it's never discussed - or I'm just misunderstanding something)
- recursively (iteratively?) repeat on the overflow files R' and S'
grace hash join (assume |R| < |S|)
- phase 1: partition R and S into buckets R_i and S_i (all spilled onto disk)
  - hash function dictates partitioning; same for R, S so R_i need only be joined with S_i
  - if ~B buffer pages, can have at most ~B partitions (1 output buffer each)
- phase 2: for each R_i, build hash table, then probe S_i against this
  - if R_i is too big and doesn't fit in memory, overflow
    - can recursively partition R_i back onto disk into sub-partitions
hybrid hash join
- uses same partitioning, but instead of using all of memory for output buffers, use part of it for the first bucket's hashtable (i.e. build hash table for bucket 1 immediately)
- usually you want to make sure you have enough buckets so that each bucket will subsequently be able to fit in memory; if this is the primary goal, then the immediate hash-tabling of bucket 1 is opportunistic (using whatever pages remain)
- makes sense when you have larger memory (N buffer pages >> B buckets)

logging and recovery

buffer management policies
- steal: steal frames with dirty pages
  - thus dirty pages are written ("swapped") to disk
  - faster but complicates recovery
- force: force all updates to disk before commit
  - thus committed pages might not be on disk
  - slower but helps recovery
```
           no steal             steal
```

no force no undo, redo undo, redo force no undo, no redo undo, no redo

write-ahead logging (WAL)
- undo info: force log record to disk for an (uncommitted) update before corresponding data page is written to disk (swapped/stolen)
- redo info: force log record to disk for a xact before commit
log seq num (LSN): increasing ID for each log record
- each data page contains a page LSN, the LSN of the last log record for an update to that page
- system keeps track of flushed LSN, the max LSN flushed so far
technique to ensure coherent page written to disk: include in commit marker the CRC of the data
- add'l barriers vs CRCs: http://lwn.net/Articles/283161/
WAL: before page $i$ written to disk, log must satisfy pageLSN $\le$ flushedLSN
log records
- fields: LSN, prevLSN, XID, type, [pageID, len, offset, beforeImg, afterImg]
- prevLSN: LSN of last record in this xact
- type: update, commit, abort, checkpoint (for log maintenance), compensation (for undo), end (end of commit/abort)
- [...]: updates only
in-memory tables
- xact table: XID, status (running/committing/aborting), lastLSN (last written by xact)
- dirty page table: pageID (entry exists for each dirty page), recLSN (log record which first dirtied the page)
checkpoint periodically for faster recovery

B+ trees TODO

B-link-trees: sibling links are for concurrency, as a "temp repair bridge" while parent nodes are being rebalanced

LSM trees

uses log-structured merge (LSM) trees to create full DB replicas using purely sequential IO
LSM-trees don't become fragmented; provide fast, predictable index scans
LSM-tree lookup perf comparable to B-tree lookups
good for many (amortized) random writes; bad for random reads (need to hit multiple components)

mysql

engines
- myisam
- innodb
- percona's xtradb: innodb replacement that scales better to mcores and better utilizes mem
- mariadb's maria: evolve myisam toward innodb; adds txns, crash safety, etc
forks
- mariadb: founder's community-oriented fork; dissatisfied with sun procedure
- drizzle: complete refactoring/slimming by brian aker, long-time mysql dev

innodb

index-organized tables
- full row data in PK index; saves space
- 2ary indexes refer to PK value, not phys loc (can change due to splitting)
- can't add/drop/change PK def'n without full-table rewrite
- can't scan in physical page order; must scan in key order
supports index-only scans; covering indexes run faster
MVCC
- updates copy old row to rollback segment first
- deletes mark row for deletion, insert info into rollback segment too
- "purge" GC can quickly determine what to discard from rollback segment
- purge is single-threaded
- not sure how deleted space is reclaimed/reused/kept around (doesn't info about it get removed from the rollback segment by the purge process?)

sqlite

concurrency
- one global lock
- multi-process concurrency: only whole-database locks
  - unlocked, shared, reserved, pending, exclusive
    - reserved: intend to write (upgrade to exclusive)
    - pending: waiting to upgrade to exclusive; prevents new locks
  - isolation levels: eg BEGIN DEFERRED
    - deferred: acquire only on first query
    - immediate: acquire reserved lock right away; reads can still proceed
    - exclusive: acquire exclusive lock right away
- shared-cache concurrency (disabled by default)
  - supports multiple cnxns to shared cache; 1 txn per cnxn at a time
  - 3 levels of locking
    - txn-level locking: at most 1 write txn at a time; txn implicitly becomes writer on first INSERT/UPDATE/DELETE; coexist with any readers
    - table-level locking: reader and writer locks
    - schema-level locking: on sqlite_master catalog table; some special rules
  - isolation levels
    - default isolation level is (actually) serializable, since <=1 writer per db at a time
    - read uncommitted: cnxn doesn't take read locks; no blocking
- features
  - heavy testing: 100% branch coverage; fault injection (malloc, IO); virtual file system snapshots to simulate crashes
  - virtual table: you can export a table interface and provide callback implementations for events on this table
storage
- data stored directly in btree

sql server 2008

integration services: ETL tools
reporting services
analysis services: BI tools
powerpivot for excel/sharepoint makes report designing easier/interactive
full text search engine
r2
- streaminsight: CEP
- master data management: dealing with multiple DBs/sources
- reporting services integration w sharepoint CMS: publish olap analytics

postgresql

table data and PK index are separate
- can scan in physical page order; fast
- supports changing PKs wo full-table rewrite; even allows concurrent writes
no index-only scans yet; covering indexes don't help
- adding more cols to indexes slows things down
MVCC: on update, old row left in place, new row quickly inserted
- VACUUM must scan all data as a result
- VACUUM is multi-threaded
users: disqus, skype, debian, omniti
- http://www.postgresql.org/about/users

postgresql vs mysql

table org: PK index & table data sep vs same (see sections)
- neither can scan index in physical order, only in key order
- http://rhaas.blogspot.com/2010/11/mysql-vs-postgresql-part-1-table.html
MVCC GC: VACUUM vs purge
- innodb: updates must copy old row to rollback segment
- PG: faster updates, slower VACUUM scans all data
- http://rhaas.blogspot.com/2011/02/mysql-vs-postgresql-part-2-vacuum-vs.html

oracle

table data and PK index can be same (mysql) or sep (pg)

distributed postgresql

postgres-r: async master-master replication
pgcluster: sync multi-master replication
pgpool-II: sync

distributed mysql

mysql cluster/NDB (network database)
- node types
  - mgmt node: config server, start/stop nodes, run backup, etc
  - data node: number of data nodes = replicas * fragments
  - sql node or api node
- database
  - partition: a fragment of the database; these are replicated
  - replica: a copy of a partition
  - hash-partitioned on table primary keys
  - stored in mem or on disk asynchronously
- organization
  - data node: manages a replica
  - node group: manages a partition
- checkpoints
  - local: save all data to disk; every few mins
  - global: txns for all nodes synced; redo log flushed; every few secs
- system
  - replication may be sync or async
  - 2PC
  - no referential integrity
  - no txn isolation above read committed, so dtxns are naive
  - replication via mysql replication (replicate entire cluster)
mysql replication
- system
  - async replication or semi-sync replication (wait for receipt; contributed by google in 2007)
  - supports multi-master
  - simple "greatest timestamp wins" conflict resolution
  - no auto failover
- formats
  - statement-based replication: stream the queries
  - row-based replication: stream the updates
  - mixed format: default; usu statements, but rows for things like UUIDs
mysql multi-master replication manager (MMM)
- async replication with auto failover
- also: flipper, tungsten

clustering

blenddb

colocate related tuples on same pages in navigational queries
- eg web page for a movie: actors >< movies >< directors; movies -- ratings
- eg further navigate into
- construct paths starting from prim key into one table and following foreign key joins to other tables; eg movie.id -> acted.movieid -> actor.id, movie.id -> directed.movieid -> director.id, movie.id -> ratings.movieid
- pull in all tuples read by those paths
- clustering is an offline process
  - optimization i didn't try to fully understand: prioritize path heads by how frequently they spill over one page (don't waste space on path sets that are too big anyway and require multiple seeks?)
- faster than performing joins online and less space than materialized views
related work
- (stochastic) clustering in OODBMSs: could use their algos here, but they're about migration not replication
- merged indices of multiple cols/tables: can cluster on this, but that only benefits joins on a single key, i.e. no navigation
- memcached: blenddb is not a cache; prepped for any data
- multi-table indices, eg oracle clusters: crappy implementation of merged indices that requires sep keys to be on sep pages
first work on clustering in RDBMS

RDBMSs

amazon relational database as a service (RDS)

mysql 5.1
synchronous wan repl (multi-az)
TODO how does it scale?

OLTP Through the Looking Glass, and What We Found There (stavros, dna, madden, stonenbraker)

lots of CPU overhead in conventional RDBMS
removed features from Shore to get to main memory DB
- Shore uses user level non-preemptive threads with IO processes
- [hence latching overheads not from multiprocessing; unclear why latching is so impactful or even prevalent]
went from 640 to 12700 TPS; down to 6.8% of orig work
removed in order:
- 35% buffer mgr
- 14% latching: many structs
- 16% locking: 2PL lock mgr
- 12% WAL: building records; pinning records; IO
- 16% hand-coded optimizations: tuned B trees
"every reason to believe that many OLTP systems already fit or will soon fit into main memory"
- [but no cost benefit analysis; james hamilton says even move to SSDs is not worth it, contrary to myspace's claims]

scientific databases

general

projects
extremely large database (xldb): workshop started by jacek becla of slac
- 55 PB raw images, 100 PB astronomical data for LSST
- involvement
  - ppl: stonebraker, dewitt, kersten
  - vendors: teradata, greenplum, cloudera
  - companies: ebay, web companies
scidb: stonebraker, dewitt
- A data model based on multidimensional arrays, not sets of tuples
- A storage model based on versions and not update in place
- Built-in support for provenance (lineage), workflows, and uncertainty
- Scalability to 100s of petabytes and 1,000s of nodes with high degrees of tolerance to failures (including query fault tolerance)
- Support for "external" objects so that data sets can be queried and manipulated without ever having to be loaded into the database
- Open source in order to foster a community of contributors and to insure that data is never "locked up"---a critical requirement for scientists
refs
- http://www.dbms2.com/2009/10/03/issues-in-scientific-data-management/
- http://www.dbms2.com/2009/09/12/xldb-scid/

scientifica (DB group meeting talk, Mike Stonebraker, 4/24/08)

multi-dimensional array-based
- interesting question is storage manager: how to partition the data (and then in each partition, how to chunk it)
uncertainty
- minimal, since error analysis is different for each application
- use error bars
lineage, provenance
- need also to remember the derivation process, not just origin
interest from various scientific institutions but also amazon, google, etc.
CERN had planned to use Objectivity

Efficient Provenance Storage

main problem: compressing binary xml
intro
- fields: science, law, etc.
- dataset 270mb, provenance store 6gb

parallel/distributed olap

"query evaluation techniques for large databases" TODO

HARBOR (Lau, VLDB06)

intro
- integrate mechanisms for recovery and high availability
- take advantage of data redundancy in replicated dbms
- features
  - no logging
  - recover without quiescing
traditional approaches
- logging
approach overview
- data warehousing techniques (assumptions)
  - data replication: logical (phys unnec); helps perf anyway
  - snapshot isolation
    - large ad-hoc query workloads over large read data sets intermingled with a smaller number of OLTP transactions
    - using snapshot isolation avoids lock contention (TODO is this a requirement?)
    - harbor uses time travel mechanism similar to snapshot isolation
    - historical queries of past can proceed without locks since past data never changes
- fault tolerance model
  - k-safety: requires k+1, k may fail
  - no network partitions, no BFT
- historical queries
  - assume T < now, so no locks needed (result set don't change due to updates/deletes after T)
- recovery approach
  - checkpoints: flush dirty pages and record current time T
  - after coming back up, execute historical query on other live sites that replicate your data; this catches you up to a point closer to present
  - execute standard non-hist query to catch up to present; this uses read lock, but is shorter than the prev query
query execution
- recovery query is a range query on T, so break up data into time-partition segments; means normal queries will need to merge from all the segments
- commit processing
  - 2pc with changes
    - commits include commit times (for modified tuples)
    - in-mem lists of modified tuple ids can be deleted on commit/abort
  - opt 2pc: no need for logging/disk writes, except coord
  - opt 3pc: no need for logging/disk writes, including coord (persist state onto workers, basically); non-blocking (can establish timeout)
recovery
- query remote, online sites for missing updates
evaluation
- compare against 2pc and aries

A Performance Evaluation of Four Parallel Join Algorithms in a Shared-Nothing Multiprocessor Environment (DeWitt, SIGMOD '89)

tried out some hash joins in gamma
simple hash join
- partition tuples among the join nodes (via a split table)
- overflow possible at any of the join nodes; overflow handling is same as in centralized version, i.e., each node has its own overflow h' and R'
- the split table is augmented with h', so S tuples are directed to either the correct join node, or to S'
grace hash join
- each partition (bucket) is in turn partitioned across all partition nodes
- then, for each bucket, its tuples are partitioned across the join nodes, where the hash table is constructed
- hence, all partition nodes talk to all join nodes
hybrid hash join
- the bucket-1 tuples go straight to some join nodes, and the rest are spilled to the partition nodes
- it seems in their system that join nodes are diskless, so that's all they can do - but if the set of join nodes overlaps with the set of partition nodes, then you can imagine immediately processing a number of buckets immediately (as many as there are nodes - the immediate hash table will just consume part of the memory of each of the join nodes)

chained declustering (dewitt 92)

problem space: distributed data placement (replication, clustering, etc.) strategies for high availability, reliability in shared nothing dbs
compared against:
- tandem: mirrored disks
  - each partition mirrored on 2 disks that are each connected to same 2 controllers that are each connected to same 2 cpu's
  - on cpu failure, cpu 2 has to support the load of parts 1 & 2
  - [silly: the problem with this comparison is that it's comparing a shared-nothing system against a shared-something system]
    - equivalent shared nothing system is one where you treat each disk pair as the single partition they're storing, and replicate each such partition onto 2 machines
    - and that would be obv. stupid
- teradata
  - each partition has primary and backup copies
  - in $n$-node system, fragment the backup copy into $n-1$ pieces scattered over the other nodes
  - good innate load redistribution on failure
  - but expensive writes in normal operation
- RAID5: maintain (parity) bytes
  - even a blind write of one sector of a block requires 4 accesses (1 extra "round"): read all 3 other disks then write parity
chained declustering
- eg
  - node 1: part 1 primary + part $n$ backup
  - node 2: part 2 primary + part 1 backup
  - node 3: part 3 primary + part 2 backup
  - etc
- on failure, evenly redistribute the partition over all other hosts
- in larger systems, don't spread across all nodes; partition up system (called "relation clusters")
[this was not designed for high scaling distributed systems]
- assuming indep failures
- back to inefficient system if nodes $i$ and $i+2$ fail
- compared against weird old architectures, not giant commodity pc clusters in datacenters

hybrid-range partitioning strategy: a new declustering strategy for multiprocessor database machines (dewitt, vldb90)

partitionings in gamma: round-robin, range, hash
- for small range queries, range partitioning can localize execution to only relevant processors and is faster
- for large range queries consuming significant CPU/IO resources, hash/RR can parallelize, lowering response time
hybrid-range partitioning strategy (HRPS): do query analysis to get optimal degree of intra-query parallelism, balancing range and hash/RR
- fragments contain $FC$ tuples and fragments contain unique range of values of partitioning attr
- collect some stats on avg query characteristics, plugging into formulas for first derivatives to optimize (e.g.) # servers to distributed across
The hybrid-range partitioning strategy is an alternative to hash- and range-partitioning that uses query analysis to compute the typical resource consumption requirements of queries and from this determines the optimal partition size for range-partitioning. The scheme attempts to reduce resource consumption by localizing small range queries while declustering and parallelizing long-running range queries; however, HRPS is only applicable to range queries.

adaptively parallelizing distributed range queries (adam silberstein, brian cooper, vldb09)

problem space: parallelizing range queries
- traditionally, lay out data to achieve highest throughput/max parallelism
- in high-scaling systems like PNUTS/bigtable, maximizing server parallelism produces too much throughput for 1 client
assumption: tables are range-partitioned
proposal: adaptive approach
- step 1: adaptive server allocation
  - find ideal parallelism for single query execution
  - depends on: query selectivity, client load, client-server BW, etc
  - min # servers $K_q$ for query $q$ is adjusted as scanning proceeds
- step 2: multi-query scheduling
  - even after knowing $K_q$, must avoid contention (assigning too many queries to same server simultaneously)
  - num concurrent scans on a server to avoid random IO can result in faster execution for all queries
  - minimize disk contention and ensure all queries get good perf
  - max # queries $L_s$ on server $s$
tunable: can favor short vs long queries or high vs low priority
implemented & evaluated in PNUTS

gamma (project dewitt led 8x-92)

shared-nothing dbms; horizontal partitioning
implemented and evaluated parallel joins: simple hash, grace hash, hybrid hash, sort-merge

sharing

shared memory
shared disk
shared nothing
the first two are relics anyway

why main memory is bad for olap (DB group meeting talk, Mike Stonebraker, 4/24/08)

compared: Vertica with disk, Vertica with ramdisk, and Vertica WOS only
- seek times disappear; a few disks can satisfy bandwidth (even for several cores)
- absence of indexes contributes 2x
- compression contributes 7x
SAP will compare their TREX against Vertica

distributed dbms

building a database on s3

this system completely gives up concurrency control (write-write conflicts) and coherency (write-read consistency)
durability via SQS
highly inefficient, but claims "good for internet speeds"; several seconds per transaction
http://www.dbis.ethz.ch/people/kraskat/SIGMOD-s3-slides.pdf

rose: compressed, log-structured replication (russell sears, VLDB08)

ROSE: replication oriented storage engine
- storage engine for high-tput replication
- targets seek-limited, write-intensive workloads that do near-real-time OLAP
- target usage in replication: replication log (of actual data) comes streaming into replicas, which are fed as input to ROSE (which itself just uses STASIS for logging LSM tree ops)
- although targeting replication, provides high-tput writes to any app that updates tuples without reading existing data, eg append-only, streaming, and versioning DBMSs
- write perf depends on replicas' ability to do writes without looking up old values; reads are expensive (esp. of deleted values)
maintains multiple versions/snapshots for each item, supporting eg OCC, MVCC, and more
- deletion: tombstone insertion
introduce page compression format that takes advantage of LSM-tree's sequential, sorted data layout
- increases replication tput by reducing sequential IO
- enables efficient tree lookups by supporting small page sizes & doubling as an index of the values it stores
- any scheme that can compress data in a single pass and provide random access to compressed values could be used by rose
replication envs have multiple readers, 1 writer; thus rose gets atomicity, consistency, isolation to concurrent txns without resorting to rollbacks, blocking index reqs or interfering with maintenance tasks
rose avoids random IO during replication and scans
- leaves more IO capacity for queries than existing systems
- provides scalable, real-time replication of seek-bound workloads
analytical models and experiments: OOMs greater replication bandwidth
column-store within pages (a la PAX)
- within each column, doesn't sort by column values; just stores in tuple slot id order
- adds overhead to lookups
- appends: append to each col; buffered staging space

OLTP

the end of an architectural era (it's time for a complete rewrite) (stonebraker, madden, dna, stavros, nabil, pat helland)

features
- main memory
  - "vast majority of OLTPs <1 TB in size, and growing slowly"
  - [implies you're throughput bound vs. capacity bound]
  - timesten, soliddb: inherit RDBMS baggage of System R, incl. disk-based recovery log, dynamic locking
- single-threading vs. multi-threading and resource control (sync)
  - OLTP txns lightweight
  - current systems interleave CPU and IO via synchronization
- elasticity vs. fork-lift upgrades
- high avail through shared-nothing replication
- no knobs
txn and schema characteristics
- OLTP tend to have highly partitionable tree schemas [think entity groups]
  - [unsubstantiated; I do believe a more cyclic reasoning: high-scaling apps already are partitionable]
- constrained-tree application: txns run within 1 partition
- one-shot txn: can be executed in parallel without requiring intermed results to be serialized; still needs 2PC; can be done via vert part
- also: two-phase, strongly two-phase, commutativity
system architecture
- all txns are stored procedures; runtime in same thread as DB
- distributed txn mgmt: single-sited, one-shot, general
- replication and recovery
- DB designer: make as many txn classes as possible single sited
  - [this is focused on static property of txn classes; I would like to focus on dynamic property of txn instances, i.e. individual objects]
[lot of faith in TPC-C]

h-store concurrency control (evan jones)

meat is speculative execution with multi-partition tweak
local speculation scenario
- say you have the following txns, where x=2 is on partition p and y=10 is on q:
  - txn A: swap x, y
    - this is a multi-partition txn
    - this is multi-fragment on p
  - txn B1: x++
  - txn B2: x++
  - end result should be x=12, y=2 OR x=11, y=3 OR x=10, y=4
- timeline: say we execute in A, B1, B2 order
  - read x and read y, sending to coord
  - cannot yet begin speculation of B1/B2, since A left inconsistent state
    - ow, second fragment of A will overwrite B1/B2's effects
    - end result will be x=10, y=2; not serializable
  - locking or occ can "execute" B1, but they have overhead, and in this (conflicting) case, they will be blocked
  - each partition gets second fragment of A from coord
  - each partition sends ack for A, waits to commit (2PC)
  - while waiting, can start speculating B1/B2
  - must queue results; release only once no longer speculative
    - first fragment of a multi-part txn can always be speculated like this
    - can do better for speculating these first fragments of multi-part txns if using central coord and coord is aware of speculative txns; see next sec
  - must keep undo buffers; must undo/re-exec B1/B2 iff A aborts
if you have one coord for a batch of multi-partition txns...
- A, B1, C, B2, where C increments both x,y; note C is multi-part single-frag
- everything goes as before up till B1 speculation
- because A and C have same coordinator, q can return the result to its portion of C, along with indication that result is speculating on A
  - this is the end of all of C; it's a speculative ack
  - after committing A, coord can immediately commit C
- p can speculate its fragment of C, but like B1, cannot send this result, because coord is unaware of prior B1 txn, since single-partition txns do't go through the coord
- where does this help?
  - allows a sequence of multi-part txns, each composed of a single frag at each part, to be executed without blocking; call these simple multi-part txns
  - these are common, e.g. updates to highly replicated read-mostly tables
  - e.g. updates based on a non-partition column need to search all partitions
  - all distributed txns in TPC-C are simple multi-part txns
- limitations
  - only applicable after executing last fragment of a multi-part txns, so not useful for non-simple txns
  - require same coord; on abort, need to be able to cascade aborts
    - might be able to avert this by assigning txns a global order
speculation is on the success of a 2PC
- unlike concurrency control, it assumes all txns conflict

voltdb

1.0
- server-compiled java stored procedures
- specify partitioning columns for each table
- $k$-safety; not viewstamped replication; static config
- supports DB-wide replicated (read-mostly) tables
- no concurrency control; strictly sequential operation
- statements can't do joins on 2 partitioned tables
- supports aggs
- periodic/continuous snapshots to disk
- 2PC for dtxns?
- early beta w 150 customers featured bunch of web gaming companies and bunch of use cases resembling stream processing

percolator (daniel peng, frank dabek, google, osdi10)

btwn DBMS & mapreduce: online processing that tolerates latency
latency comes from conflicts
built on bigtable, chubby
no blocking - on conflict, abort
locks taken at the sites
one of the locked sites designated as primary
ACID snapshot isolation (write-write conflicts only; reads are not monitored)
centralized timestamp oracle hands out timestamps; can reply to 2B req/s
- allows others to run their own systems
solves blockage on 2PC coordinator failure: after some time, try to undo the primary lock
- the primary lock site is "the" commit site
- bigtable has row txns, which allows for atomic test-and-set to rollback or commit txns at the primary lock site
abort & backoff if see a lock w any timestamp or a write w more recent timestamp than current txn timestamp
observers run on storage servers; multiple changes may coalesce into one observer run
- implemented via notifications in special bigtable column group/column
slower than MR due to excessive RPCs, but major engineering gains
- lock operations made faster by adding special RPC to bigtable
app: incremental indexing

haystack (facebook, osdi10)

facebook photo store
replaces NFS NAS solution that choked on sheer # files and excessive disk seeks
photos stored in 100GB append-only physical volumes (files) that are replicas of logical volumes
lookup operation
1. get metadata/location-specifying URL from the metadata server (haystack directory), which balances load across physical volumes
2. browser issues img requests to the CDN, which fwds to haystack, or directly to haystack
3. lookup in haystack cache (DHT, basically a CDN)
4. haystack store looks up some position info in in-memory index and grabs the photo in 1 seek
5. put into haystack cache iff requesting directly (CDN can cache) & requesting from a write-enabled store (recently-written photos are most-read)
index has filename, offset, and size for each photo ID
photo IDs have random number (cookie) to prevent guessing valid URLs
writes synchronously append to volume and asynchronously write index checkpoints for faster recovery (the volume is like the DB log)
- batched uploads when possible; luckily many users upload whole albums
- updates: latest ver in physical offset has highest offset; if diff physical vol, must update
- deletes synchronously update physical volumes & index; compaction does GC
compactions: processes deletes & duplicate keys
occasionally, full reset needed of a node, which takes many hours
use XFS extent-based FS bc it has efficient prealloc & blockmaps that fit in mem

Data Integration

Indexing Dataspaces (Xin Dong, Alon Halevy, sigmod07)

INTRO
- users explore and use predicate queries and neighborhood
- keyword queries
- traditionally: inverted list or index tree or XML
PROBLEM DEFINITION
- INDEXING HETEROGENEOUS DATA
  - eg: bibtex, word, ppt, email, www, xls, xml, rdb
  - triples: (instance, attr, val) or (inst, assoc, inst) (directional)
  - how to extract from data sources to this model is a separate topic
  - synonyms among attrs & assocs
  - hierarchies: 2 types (don't distinguish between them)
    - sub-property: eg father < parent, contactAuthor < author
    - sub-field: eg city < addr, firstName < name
      - [not about complex structures' components, but see 4.4]
      - [what happens when name is a unique inst? question of extraction]
- QUERYING HETEROGENEOUS DATA
  - predicate queries
    - attr pred: matches on direct attr/sub-attr
    - assoc preds: matches instances that have a (sub-)association with another instance that has any matching attr
    - [sub-attr/sub-assoc: either sub-property or sub-attr]
    - [(firstName 'Yang') won't match an obj with name 'Yang']
  - neighborhood keyword queries
    - relevant inst: has direct attr
    - associated inst: assoc with a relevant inst
  - user need not have knowledge of exact schema
  - ranking is a separate issue
  - [both queries limited to 2 degrees]
- INVERTED LISTS
  - store occurrence counts
INDEXING STRUCTURE
- INDEXING ATTRS
  - attr inverted lists (ATIL): inverted list on (keyword//attr//)
- INDEXING ASSOCS
  - attr-assoc inverted list (AAIL): ATIL + (keyword//assoc//) entries too
  - [doesn't count occurrences within each assoc inst]
  - easy extension to support k-ary assocs
INDEXING HIERARCHIES
- INDEX WITH DUPLICATION
  - dup-ATIL: duplicate (eg Yang//firstName//, Yang//name//)
  - increases index size
- INDEX WITH HIERARCHY PATH
  - hier-ATIL: eg Yang//name//firstName//
  - use row-wise deltas to minimize space overhead
  - prefix serach more expensive
- HYBRID INDEX
  - summary rows shadow other rows and end in additional //
  - [need to skip the shadowed rows somehow - easy problem]
  - start with hier-ATIL; add summary row for prefix p if count of other rows with prefix p exceeds threshold
  - [what if: a/a/{a,b,c/{a,b,c}}?]
- SCHEMA-LEVEL SYNONYMS
  - assoc in one src = attr in another: naturally handled
  - term heterogeneity: have table mapping sets of synonyms to canonical names, and use this replacement in index and queries
- NEIGHBORHOOD KEYWORD QUERIES
  - keyword inverted list (KIL): hybrid-AAIL, but also summarize prefixes that correspond directly to keywords, i.e. k//
EXPERIMENTAL EVALUATION

functiondb - arvindt

bootstrapping pay as you go - alon halevy

pay as you go - alon halevy

oltp ttlg

uncovering the relational web

column-stores vs. row-stores

column stores

general

column stores only recently gained traction because there's enough memory for fetching large ranges of blocks

hybrid approaches

PAX (Weaving Relations for Cache Performance by Natassa Ailamaki, David DeWitt, Mark Hill, and Marios Skounakis, VLDB 2001)
- store data by column within disk block
- pros
  - CPU efficiency of C-stores
  - improved cache hit ratio & mem bandwidth
  - better compression
  - easy to implement in row-store
- cons
  - IO properties of row-stores
  - cache prefetching on modern CPUs makes PAX cache efficiency obsolete
fractured mirrors (A Case for Fractured Mirrors by Ravi Ramamurthy, David DeWitt, and Qi Su, VLDB 2002)
- replicate multiple ways; c-stores are OOM faster than row-stores and vice-versa
- cons
  - may need higher degrees of replication
  - harder impl: need full c-store & row-store exec engines & query optimizers
  - writes can be a problem
fine-grained hybrids (Cudre-Mauroux, Wu, and Madden, CIDR 2009)
- individual tables can be row- or col-oriented; declarative storage
- pros: get perf advantages of fractured mirrors without additional replication
- cons: complex to implement; may be easier to start with c-store since that already supports early materialization and so requires operators be able to handle input in both column and row-oriented format
- vertica
http://dbmsmusings.blogspot.com/2009/09/tour-through-hybrid-columnrow-oriented.html

distributed olap

hadoopdb

hadoop scalability while faster on structured data workloads
hadoop coordinates/exports from hdfs; dbms's execute; allows fast sql
currently using pg; slower than db2; may use vectorwise
pg significantly outperforms hadoop on structured data
benefits vs PDBMSs
- open/cheap/free
- query fault-tolerance
- tolerate node degradation (eg stragglers, not outright failures)

greenplum

postgresql based
table storage options
- traditional pg
- row-oriented append-only: compressible, scan-optimized
- col-oriented append-only (storage-level-only solution)
- treat external source as relational
compression: LZ, gzip
no need for joins to get multi columns
no: columnwise compression options, in-mem compression, late materialization
greenplum chorus: some new kind of ETL

teradata

major customers, 10/2008
- ebay: 5 PB
- walmart: 2.5 PB
- bofa: 1.5 PB
- anon financial services co: 1.4 PB
- dell: 1 PB

alternative DBs

nimbusdb (jim starkey, ieee/acm talk at mit, 3/18/10)

consistency is overloaded
serializability: sufficient for consistency, but not necessary
- expensive; "almost serializable" is useless (no guarantees and all cost)
- serializable -> sequential txn order -> at every point, db has definitive state
keeps relating to relativity
- time is seq of events, not just clock
- comm requires latency
- 2 nodes just can't see same events in same order
- not a bug, but a fact of life
mvcc/si: alternative to serializability
- row updates create new versions pointing to old versions
- each version tagged with creating txn
- a txn sees a version consistent w when it started
- a txn can't update a version it can't see
- each txn sees stable, consistent state
nimbusdb: elastic, scalable, acid, sql rdbms
- modest goals:
  - very high perf in data center
  - high perf geographically dispersed
  - fault tol: sw, hw, geological
- less modest goals:
  - put DBAs outta work: 0 admin; dynamic, self-tuning
  - arbitrary redundancy; multi-tenant; of, for, in the cloud
chorus: set of nodes that instantiate a db
- txn nodes: do sql
- archive nodes: maintain a persistent disk archive; have copy of entire db
a nimbus db composed of distributed objs called atoms
- can be serialized to net/disk
- can be on any # of chorus nodes
- all insts of atom know of each other
- atoms replicate p2p
- every atom has a chairman node: enforces consistency
  - deterministically chosen; if chairman disappears, everyone knows who next one is
- [partition/replica group]
examples of types of nimbus atoms
- txn mgr: start/end txns
- table: metadata for a relational table
- data: container for user data
- catalog: tracks atom locations
comm is fully connected, async, ordered, batched
messaging
- most data is archival and inactive
- small fraction is active but stable (eg what was best seller 2 yrs ago)
- smaller fraction is volatile but local (customer's shopping cart or account in active use)
- even less data is volatile and global
- replicate only to those who care
nodes are autonomous: choose where to get an atom, which atoms to keep, which to drop
txn control
- record version based
- txn sees only the results of txns reported committed when & where it started
- consistency maintained by atom chairman
  - eg can't insert dupe into unique index
  - eg can't update record 42 if another txn updating it
- atom updates bcast repl msgs; repl msgs precede commit msg
nimbusdb is relativistic
- db/consistency viewed only thru txns; no single definitive db state
- nodes may differ due to msg skew
archive nodes provide durability
- see all atom updates followed by a pre-commit msg
- bcast the actual commit
- txn nodes retain 'dirty' atoms till an archive node reports the atoms archived
- multiple archive nodes provide redundancy
txn nodes provide scalability
- any txn node can do anything
- conn key is client session #
- conn broker can give effect of sharding
- txn nodes tend to request atoms from local nodes
- data dynamically trends toward locality
misc features
- semantic extensions eg structural inheritance
- unbounded strs, unbounded nums
- all metadata is dynamic
- chorus members platform indep, rolling sw updates, 24/365
network partitions, CAP, nimbusdb
- certain archive nodes are designated as commit agents (really, canaries)
- subsets of commit agents form into coteries
  - subsets where no 2 are disjoint
- a pre-commit must be received at least one commit agent in every coterie to commit
- post partition, the partition that contains a coterie survives
[this talk did not convince me that this system would actually work; others felt similarly]

Alternative DBs

key‐value‐cache: memcached, repcached, coherence, infinispan, eXtreme scale, jboss cache, velocity, terracoqa
key‐value‐store: keyspace, flare, schema‐free, RAMCloud
eventually‐consistent key‐value‐store: dynamo, voldemort, Dynomite, SubRecord, Mo8onDb, Dovetaildb
ordered‐key‐value‐store: tokyo tyrant, lightcloud, NMDB, luxio, memcachedb, actord
data‐structures server: redis
tuple‐store: gigaspaces, coord, apache river
object database: ZopeDB, db4o, Shoal
document store: CouchDB, Mongo, Jackrabbit, XML Databases, ThruDB, CloudKit, Perservere, Riak Basho, Scalaris
wide columnar store: BigTable, Hbase, Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI

hyder (phil bernstein, msr 2009)

modern scalable systems: abandon dtxns; eventual consistency; memcached
Hyder: oltp KV store on raw flash
- single, reliable, distributed log accessed by all servers; "log is DB"
- "no partitioning, all data avail to all"; servers can fetch from each others' caches
- DB is multi-versioned; server caches are trivially coherent
all homogeneous nodes (symmetric); app+db in-process
- unified cache (no server/client/memcached)
- no RPC
enabling hardware assumptions
- Flash
  - 10^4 more IOPS/GB than HDD; higher perf but lower density
  - 4K pages: 100us to read, 200us to write, per chip
- data center networks: RTTs <25us, 10GigE; so, many servers can share load
- large mem: so, reduce rate that hyder needs to hit the log
- manycore servers: so, maintain consistent views of DB
Flash
- no mutates; must erase before writing
- end up treating as sequential device
- wear-leveling: multi-layer (MLC): ~10K erases; single-layer (SLC): ~100K
- 12/09: 4GB MLC is $8.50; 1GB SLC is $6 (SLC ~= 3*MLC)
- densities can double only 2-3 more times, but other non-vol memories (PCM, memristors)
Hyder stack: API (SQL/LINQ), OCC txn layer, indexed record layer, storage layer
storage layer
- custom controller interface
- log uses RAID erasure coding for reliability
txn
- read a snapshot as cache of last committed db state
- appends intention log record to log
- record points to last committed record
- lookups result in random access; unthinkable in disks but not in flash
- OCC: detect conflicts (R/W sets) after appending, abort
- broadcast intention at same time as appending intention
- when appending, no idea where intention will land; append returns addr
- all servers get a copy of this ack
indexed record layer: multi-versioned binary tree
- think of whole DB as one big binary search tree
- multiversioned binary search trees marshalled into log leaf-first (backptrs)
- no update-in-place; must make new copy; shadow copies are wasteful
- instead, just record the path, not actually the internal node contents
- to avoid having every txn conflict with each other, distinguish between structural updates to internal nodes vs content updates
- rationale: Flash space is costly
- binary trees vs b-trees: b-tree nodes are large, so cheaper to update lots of small nodes
txn execution: get snapshot (ptr to root), generate updates locally, append intention log record
- if there was an earlier conflicting update, cannot commit the change of this log entry
- ergo entries that appear in log may not have been committed
- key ingredient: log updates are bcast to all machines
- but no global ordering
  - custom flash controller supports append-page() that returns addr where it stored the page
  - [could've alternatively had a txn manager sit in the controller that simply tells you whether to abort]
    - avoided this to minimize latency; longer-running txns increases conflicts in OCC
    - but now writing all this crap to flash (previously noted as "precious")
- a lot of effort spent on reliability of controller
every server maintains copy of log tail in memory
- every server rolls fwd update records one by one
- on each one, check if the txn actually committed
- all servers will do the same things and make the same abort/commit decisions
bottlenecks
- network: 15K update txns/s over 1 GigE; 150K update txns/s over 10 GigE
- conflict detection & merge: every machine must roll fwd every op; up to 300K updates/s
- root gets conflicting structural (physical) updates all the time, but db content actually doesn't change conflictingly; they can be deterministically merged
- aborts on write-hot data; currently takes 200us to write txn
major techs
- flash is append-only: custom controller has mechanisms for sync & fault tol
- storage is striped, with self-adaptive algo for storage alloc & load bal
- fault-tol protocol for total ordering
- fast algo for conflict detection & merging of intention records into last-committed state
[essentially building shared disk system]
- one of criticisms is scalability: requires lots of cross-sectional bandwidth
- answer: "predictable cross-sectional talk (bcasting as network was intended)"
- answer?: cross-sectional bandwidth solved by fat trees

benchmarks

http://colinhowe.wordpress.com/2009/04/27/redis-vs-mysql/
- mysql: 3030 sets/s, 4670 gets/s
- redis in-mem: 11200 sets/s, 9840 gets/s
- tokyo tyrant disk: 9030 sets/s, 9250 gets/s

redis

non-volatile memcached that can also map to lists and sets with atomic push/pop ops
designed for in-memory workloads
can be persistent, either async or sync, but sync is currently slow
microbench: 110K sets/s, 81K gets/s on entry level linux box
impl in C; single-threaded event-driven
simple async master/slave replication out of the box

scalaris

in-mem erlang kv store w sync replication; multi-hop chord-style lookups; not much community/support

memcachedb

supports replication
thread pool

bitcask

simple backend introduced in 4/2010 by riak guys for riak
on-disk log files of CRC'd key-val pairs
in-mem keydir maps keys to log file/offset; is a complete mapping
reads req disk seek; rely on OS file cache
periodically merge/compact logs

tokyo cabinet

successor to GDBM, QDBM
features compactness, speed, parallelism, simple API, fault tol, 64-bit
C lib
much faster than BDB
- 1M inserts/.4 s; 1M lookups/.33 s
record-level locking
txns: WAL and shadow paging [but apparently can disable this, since tyrant relies only on replication for durability]
backends
- mem: string, array list dequeue, hash map, ordered tree
- disk: hash, B+tree, fixed, column
  - hash: chaining with binary search
  - column: built on hash DB; col indexes built on B+ tree
lots of low-level opt/tuning, eg combining mmap and pwrite/pread syscalls
optional compression
tokyo dystopia: FTS engine; uses tokyo cabinet for inverted index

tokyo tyrant

server interface to tokyo cabinet
hot backup/update log
async replication; master-slaves or master-master (but no atomic conflict detection, only eventual)
speaks binary, memcached, HTTP
built-in ops: no-reply updates, multi-gets
lua extensibility for more ops (stored procs)
epoll/kqueue + thread pool scalability
1M queries/17.2 s

chunkd

single-node storage daemon for project hail
similar to tokyo cabinet but with multi users and channel security
2nd place to tokyo cabinet in perf

simple DBMs

DBM: simple database engine by ken thompson
CDB: constant database by dan j bernstein; super fast

berkeleydb

very slow for random writes into DBs of >100K rows
hard to configure

GT.M

old DB from 1986
- OCC txns; non-distributed
- users can also use locks
from wikipedia:

GT.M is a high-performance schemaless database engine, optimized for transaction processing. GT.M is also an application development platform and a compiler for the ANSI/ISO standard M language, also known as MUMPS.

GT.M compiles the M source code into the target machine language. These object files are dynamically linked into an image. GT.M databases are UNIX files identified by a file called Global directory. By changing the Global Directories, one can make the same program access different databases. Internal to these files, GT.M stores data in b-tree based multidimensional arrays (otherwise known as MUMPS globals), similarly to most modern M implementations.
MUMPS language predates C
M/DB is a SDB interface to GT.M

FluidDB

impl in Python
schemaless attrib-val pairs
attrib names are OWNER/NAME/NAME/...
each of read/update/create/see is open/closed has exceptions
basic data types
CQL-like query lang: has sara/rating and (tim/rating > 5 or mike/comment matches fantastic or mary/product-reviews/keywords contains "kids")

misc db

talk on mcore systems (nedbday08, anastasia)

total latency going up
larger aggregate l2 cache due to more l2 per core => higher-latency access times into l2
- due to reaching across cores
only thing to do is to write sw that explicitly manages memory locality
dss is more cache-friendly than oltp
- [what distinguishes these 2 workloads?]
contribution of
- travel grows log-linearly
- access drops log-linearly
move to shared-nothing arch
see also: gamma machine (uwisc, dewitt)

couchdb (apache, 2008)

b-tree storage
append-only
MVCC, lockless
REST API
incremental map-reduce
async replication
queries are called views, defined with JS to specify field constraints
erlang
http://sitr.us/2009/09/13/couchdb-notes.html
http://sitr.us/2009/06/30/database-queries-the-couchdb-way.html

site stats

google

2006: 400M queries/day
9/2009: 12.8B queries/mo, 427M queries/day, 5K queries/s
6/8/2010: index ~100PB, growing 100-999TB/day
- http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html
11/2009 jeff dean LADIS talk
- typical server: 16GB RAM, 2TB disk
  - easily hold full web index in mem with just 500K servers
- failures/year: 1-5% disks, 2-4% servers; ea server expected to crash twice+
- DC storage hierarchy
  - Server: DRAM: 16GB, 100ns, 20GB/s; Disk: 2TB, 10ms, 200MB/s
  - Rack: DRAM: 1TB, 300us, 100MB/s; Disk: 160TB, 11ms, 100MB/s
  - Aggregate Switch: DRAM: 30TB, 500us, 10MB/s; Disk: 4.8PB, 12ms, 10MB/s
- times everyone should know (ns)
  - L1 cache reference: 0.5
  - Branch mispredict: 5
  - L2 cache reference: 7
  - Mutex lock/unlock: 25
  - Main memory reference: 100
  - Compress 1KB bytes with Zippy: 3,000
  - Send 2K bytes over 1 Gbps network: 20,000
  - Read 1MB sequentially from memory: 250,000
  - Roundtrip within same datacenter: 500,000
  - Disk seek: 10,000,000
  - Read 1MB sequentially from disk: 20,000,000
  - Send packet CA -> Netherlands -> CA: 150,000,000
- GFS usage: 200+ 1000+-machine clusters, 4+ PB clients, 40 GB/s R/W load
- mapreduce usage: 3.5M jobs/yr (488 machines/job, 8min/job)
- bigtable usage: 500 clusters; largest 70PB, 30+GB/s IO
- bigtable changes
  - better scaling, perf isolation, corruption protection
  - per-table replication; uses eventual consistency
    - all user-facing production uses of bigtable use replication
  - coprocessors: code that lives/moves/splits with data; auto scaling, load bal, req routing
    - eg used in managing metadata for colossus (new GFS)
    - eg lang model serving for machine xlation system
    - eg distributed query processing for full-text indexing support
    - eg regex code search
- spanner: new bigtable; similar tabular data model, locality groups, coprocs
  - auto move/replicate data based on usage
  - optimizes resources of entire cluster
  - hierarchical dir structure
  - allows fine-grained access and replication control
  - supports distributed txns for apps that need it
  - plan to scale to 10M machines and 1 EB data
- refs
  - http://glinden.blogspot.com/2009/10/advice-from-google-on-large-distributed.html
  - http://perspectives.mvdirona.com/2009/10/17/JeffDeanDesignLessonsAndAdviceFromBuildingLargeScaleDistributedSystems.aspx
05/2010 non-volatile memory workshop at UCSD, talk by al borchers of platform group
- apps: 99.9%ile latency, disk queueing, is ~100ms
- experimented with flash as cache on GFS chunkservers
  - simulated traces of bigtable tablet servers for several apps
  - write-heavy
- 16GB cache: 130 misses/s; 512GB: 450; logarithmic (exponential increase yields linear reduction)
- write perf highly variable/chaotic
  - seq writes fine, rand writes kill perf & lifetime
  - lifetime: FIFO ~ 2x LRU (monitored GCs)
  - difference evens out as cache grows; at 512GB, FIFO ~ LRU
  - research opp: replacement policy w good hit rate & lifetime
- shift from seek-bound to CPU-bound
  - PCIe less overhead than SAS/SATA; no need for blcok layer
  - NUMA (going through diff core) hurt just as much
  - sync multithreaded used 2-3x CPU of async
  - block layer: 2-3x CPU overhead of accessing RAM
  - IO scheduler: 30% overhead; unnecessary disk optimizations, lock contention
  - FS: 39% overhead; metadata writes, cold data hurt perf & lifetime
  - NUMA: 20-40% overhead
- IO size constrained by page/block sizes; need new interface
- read error rate higher than predicted by bit error rate
  - block, plane, die failures dominated
  - concentrated in a few bad chips
  - small impact to caching apps; treat as cache miss
  - large impact to DB apps: data loss
  - traditional RAID drivers yields terrible perf, more CPU overhead
  - research opp: fault tolerance in flash devices; long-term/large-scale failure rates and mechanisms
  - SLC/MLC both saw early-life failures
- http://dga.livejournal.com/44132.html

internet archive

index size: 2 PB
refs: http://www.archive.org/about/faqs.php

ebay

olap 4/2009
- main teradata warehouse
  - 2 PB user data, 10,000s of users, Ms of queries/day
  - 72 nodes, >140 GB/s IO (2 GB/node/sec) (peak?)
  - 100s of production DBs being fed in
- greenplum warehouse
  - 6.5 PB user data, 17e12 records, 150B new records/day (50 TB/day)
  - 96 nodes, 200 MB/node/sec IO, 4.5 PB storage, 70% compression
  - few concurrent users
  - only for web/network event logs
- http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/
oltp 11/2006
- 212M users, 1B photos, 1B views/day, $1590/sec, 105M concurrent listings, 2 PB data, 3B API calls/mo, 100K new LOC/2 wks, 26B SQL queries/day
- TODO
- http://www.dbms2.com/2008/02/27/ebay-oltp-architecture/
randy shoup, HPTS09
- 89M active users, 190M items in 50K categories, 8B hits/day, 10% items start/end auction, 70B R/W ops/day
- arch lessons: partition everything, use async, automate everything, tolerate failures, embrace inconsistency, expect evolution, understand dependencies, cache, save everything, invest in infrastructure
article 5/2008
- SOA; horizontal partitioning; avoid dtxns; async; virtualize; cache
- http://www.infoq.com/articles/ebay-scalability-best-practices

amazon 9/2007

TODO
http://highscalability.com/amazon-architecture

wikipedia 2007

300 servers
- 200 app servers
- 70 squid servers
- 30 memcached servers, 2 GB mem each
- 20 mysql innodb servers, 16 GB mem each (200-300 GB each, total 5TB)
- also: squid, nagios, dsh, nfs, ganglia, linux virtual service, lucene on mono, powerdns, lighttpd, apache, php, mediawiki
50K http reqs/sec
80K mysql reqs/sec
7M registered users
18M objects in english version
http://perspectives.mvdirona.com/2008/12/28/WikipediaArchitecture.aspx

youtube, 2007 seattle conference on scalability

apache -> lighttpd
mini-clusters (partitions/replica groups)
CDNs for most popular
squid: perf degraded w load
netscaler load balancer
most scaling issues in db
- mysql w async master-slave replication
- mysql slaves replay logs in serial order w 1 CPU, 1 disk; master can zoom ahead with multiprocessor and multidisk
  - warmed up buffer by reading in relevant pages first before the log could get replayed
- split DB into 2 instances: video watch and general
  - video watch higher priority than general
- initially served off RAID 10, but linux only sees 1 vol so doesn't schedule too many parallel disk IOs
  - changed to software RAID 0 (striping) over 5 RAID 1 pairs (mirroring), ie manual RAID 10
- ultimately partitioned by user

hadoop summit 2010
- 120B relationships, 82 jobs/day 16TB tmp data, 2 engrs
- http://mndoci.com/2010/06/30/massive-data/

hive at facebook, 11/2010

http://www.slideshare.net/jsichi/hive-evolution-apachecon-2010

facebook, 2010-11-02 talk at facebook on "mysql at scale"

stats
- Query response times: 4ms reads, 5ms writes.
- Rows read per second: 450M peak
- Network bytes per second: 38GB peak
- Queries per second: 13M peak
- Rows changed per second: 3.5M peak
- InnoDB disk ops per second: 5.2M peak
operations
- 99.9th percentile metrics
- lots of query optimization
performance
- extensive, redundant monitoring: error logs, frequent show variables sampling, slow query logs, client error logs, sampled query logs
- look at macro trends: by day, by week
engineering: fb patch for mysql: lots of non-general changes
- ppl: vamsi ponnekanti, mark callaghan, (someone else)
- features:
  - online schema changes (OSC): eg add col, add/change index
    - setup triggers to track changes; briefly lock table
    - copy data to new table w desired schema
      - break up into small copies; see openarkkit
    - replay changes on new table
    - rename new table sa target table; again briefly lock table
    - updates can happen except during table locks
  - support major upgrades
  - fixed many stalls
  - fixed many bugs
  - perf improvements
- current changes
  - incremental hot backup w xtrabackup
    - fast binary backup with innobackup + fast compression (qpress)
    - incremental w xtra to save space, working w percona
      - enhance w remote/streaming
    - fast selective table restore
      - sys tablespace ibdata1
      - per table tablespaces foo.ibd
  - single-threaded replication: want multi-threaded slaves
- non-stored procedures
  - problem: want to reduce lock contention
  - trad'l approaches: SPs, triggers
    - thousands of DBs, 20 hards each, 50 edge tables each
    - millions of SPs to maintain
  - soln: pass CLIENT_MULTI_STATEMENTS to mysql_connect()
    - insert into counts on duplicate key
- Insights
  - hadoop to mysql, then advertisers can see analytics
  - didn't really follow this
http://www.livestream.com/facebookevents/video?clipId=flv_cc08bf93-7013-41e3-81c9-bfc906ef8442
http://highscalability.com/blog/2010/11/4/facebook-at-13-million-queries-per-second-recommends-minimiz.html

facebook, misc

http://social.venturebeat.com/2010/07/07/facebook-like-buttons/
- 3B like buttons served per day
hadoop summit 2010
- 36 PB uncompressed, 2250 boxes, 23K cores, 32GB RAM/box, 80-90TB/day
- http://mndoci.com/2010/06/30/massive-data/
http://www.datacenterknowledge.com/archives/2010/06/28/facebook-server-count-60000-or-more/
- 60K+ servers

facebook, 2009 talk at MSR

300M active users
80B photos; upload 1B+/mo
videos: 10M/mo
1B pieces of content
100,000s of apps
per day talk
- 4 TB compressed new data per day
- 135 TB compressed data scanned per day
- 7500+ hive jobs on production cluster per day
- 80K compute hours per day
accessibility
- 200 ppl/mo run jobs on hadoop/hive
- analysts (non-engs) use hive
warehouse
- 4800 cores, 5.5 PB
- 12 TB per node
- 2-lev topology
  - 1 Gb/s node to rack switch
  - 4 Gb to top level rack switch
hadoop
- 36 PB of uncompressed data, >2250 machines, 23,000 cores, 32 GB of RAM per machine, processing 80-90TB/day (hadoop summit 2010)
last year "20 percent more photo servers and 50 percent more upload servers" to process Halloween traffic, adding up to an additional 40 terabytes (that's 40,000,000,000,000 bytes) of storage. That was when the site had 140 million users, less than half its current active user base of more than 300 million. http://www.datacenterknowledge.com/archives/2009/10/30/facebooks-spooktacular-scaling-for-halloween/

datacenter sizes, 2010-04-03

google 1M+, intel 100K, 1&1 Internet 70K, akamai 61K, rackspace 57K, FB 30K
http://www.intac.net/a-comparison-of-dedicated-servers-by-company_2010-04-13/

facebook, 2009 talk at UCSD

general
- #2 property on internet, measuring time spent on site
- 200B hits/mo, >3.9T feed actions/day, 15K sites using FB content
- exponential user growth; 70% users outside US
20B photos in 4 resolutions; 2-3B uploads/mo; serve 600K photos/s
- orig served from NFS; can't handle many files; added cache
- optimized from 10 down to 2-4 IOs per serve
- haystack: 1 IO per serve
arch
- load balancers -> web servers -> memcached
- php: easy but slow, sloppy, hard to interop w C++ backend; optimized php engine
- syslog breaks at scale; built scribe; 25TB msgs/day
- 30K (mostly web) servers
misc
- looked at clustering algos; found little win for add'l complexity; use hash partitioning
- requests get data from 100s of hosts; network perf critical
memcached: total 120M req/s
- initially, 15-20K req/s/host; today, 250K
- added: multithreading, polling device drivers, stats locking, batched packet handling
incast: get all responses in same 40us window -> overflow commodity switch buffers -> 200ms TCP timeout
- custom congestion-aware UDP-based transport to manage congestion across multiple reqs instead of within single conn
authoritative storage in mysql
- no joins; no dynamically changing data; no heavily-referenced static data
- hard:
  - logical data migration
  - load balancing many logical db's over varying number of hosts
  - scaling CPU
WAN replication in west/east (master/slave) US DCs
- layer 7 load balancer routes to west (master) any requests that perform writes (based on a hard-coded list)
- problem
  - async (cross-country) master-slave DB stmt replication
  - originally, invalidate both memcaches on update at master
  - but remote (slave) memcache could get repopulated with stale value from remote DB before new stmt arrives at remote DB (under load, up to 20s)
  - soln: don't invalidate remote from master, invalidate on stmt receipt
    - replicated stmts must thus carry info on what memcached keys to invalidate
    - eg: REPLACE INTO profile (first_name) VALUES ('Monkey') WHERE user_id='jsobel' MEMCACHE_DIRTY 'jsobel:first_name'
- problem: read stale values after updating at master due to lag
  - soln: on write, set cookie w current time; app-aware load balancer routes all reqs to master for 20s
- http://www.facebook.com/note.php?note_id=23844338919&id=9445547199&index=0
future work
- load balancing
- middle tier: balance btwn productivity/efficiency
- graph-based caching/storage systems
- search relevance via social graph
- object discovery and ranking
- storage systems
- personalization
http://idleprocess.wordpress.com/2009/11/24/presentation-summary-high-performance-at-massive-scale-lessons-learned-at-facebook/

facebook & hive, 3/23/10 talk at qcon

5800 cores, 8.7TB storage capacity (div by 3 for replication), 12TB/node
2-level network topology: 1Gb/s node-to-rack switch, 4Gb/s top-level rack switch
stats
- 12TB/day compressed new data, 135TB/day compressed scanned data
- 7500+ hive jobs/day, 80K compute hrs/day, 95% jobs are hive jobs
- used by analysts/non-engrs
apps
- reporting
  - daily/weekly aggs of impression/click counts
  - measures of user engagement
  - microstrategy dashboards
- ad-hoc analysis: # group admins broken down by state/country
- machine learning (assembling training data)
  - ad optimization
  - user engagement as function of user attribs
- many others
challenges: space constraints, resource scheduling, job isolation
- RAID-like parity scheme for less replication space
- implementing columnar compression
- using fair share scheduler
http://www.infoq.com/presentations/Facebook-Hive-Hadoop

myspace

claimed to have moved entirely from HDDs to SSDs
- 1% of orig power
- http://perspectives.mvdirona.com/2009/10/14/ReplacingALLDiskWithSSD.aspx
- james hamilton: "in every company I’ve ever worked or visited, the vast majority of the persistent disk resident data is cold"

stackoverflow

8/2009
- asp.net, C#, LINQ
- 2 * (4 core, 2.83 GHz, 12 MB L2, 500 GB datacenter RAID1, 8 GB RAM)
- sql server 2008 on 1 * (8 core, 3.5 GHz, 24 MB L2, 48 GB RAM)
  - CPU surprisingly important; 1.86 to 2.5 to 3.5 GHz scales linearly typical query times
- scaled up instead of out
- http://highscalability.com/blog/2009/8/5/stack-overflow-architecture.html
- http://blog.stackoverflow.com/2008/09/what-was-stack-overflow-built-with/
5/2010 http://blog.serverfault.com/post/633403731/phase-two
- datacenter hardware details

twitter

hadoop summit 2010
- 7 TB/day into HDFS
- http://mndoci.com/2010/06/30/massive-data/
stats 2010-09-19 http://www.lukew.com/ff/entry.asp
- Twitter is seeing on average 90 million Tweets per day.
- Twitter currently has more than 145 million registered users.
- Twitter's total mobile users jumped 62% since mid-April.
- 16% of all new users to Twitter start on mobile now.
- 46% of active Twitter users make mobile a regular part of their Twitter experience.
- Traffic to Twitter.com has grown about 100% this year.
- More than 60% of Twitter users aren't in America.
- 90% of Tweets are available to the public. And 25% of tweets contain links.
- Twitter serves 80MB of updates per second.
- Twitter delivers six billion API calls per day (70,000 per second.
- The number of registered Twitter OAuth applications is now at almost 300,000
stats revealed at Chirp 2010
- registered users: 105,779,710 (1,500% growth over the last three years.)
- new sign-ups per day: ~ 300,000 (More recently, 60% of new accounts were from outside the U.S.)
- new tweets per day: 55 million
- unique daily visitors to the site twitter.com: ~ 180 million. (That's actually dwarfed by the traffic that flows through twitter's API -- 75% of traffic is through the API.)
- API requests per day: 3 billion
- registered apps: 100,000 (from 50,000 in Dec/2009)
- search queries per day: 600 milion
- Twitter's instance, of their recently open-sourced graph database (FlockDB), has 300 billion edges and handles 100,000 reads per second.
- servers: "... in the hundreds"
- http://radar.oreilly.com/2010/04/twitter-by-the-numbers.html
big data in real-time at twitter (qcon 2010-04-20)
- tweets
  - partitioned by time
  - future: cassandra; back to partitioning by tweet id, but this time maintain index from user id to list of tweet ids
  - memcached for 90+% of reads
- timeline: what user sees when they log in (relevant tweets from friends)
  - materialize messages in followers' inbox caches
- social graph: set operations (eg intersect) can't be pre-computed
  - store fwd & bwd edges, partitioned by user
  - flockdb: real-time, distributed graph store
- search: real-time indexing
  - currently partitioning by (term id, doc id) by time in mysql
  - poor write tput; rare terms hit many partitions; poor space efficiency
  - future: partition by document and time; may switch to lucene
- stats
  - reads/s, writes/s, cardinality, bytes/item, durable?
  - tweets: 100K, 850, 12B, 300, yes
  - timelines: 80K, 1.2M, a lot, 3.2K, no
  - graphs: 100K, 20K, 13B, 110, yes
  - search: 13K, 21K, 315M, 1K, yes
- principles
  - all reads must be from mem; disk for writes only
  - can precompute some problems but not all
- http://www.slideshare.net/nkallen/q-con-3770885
use pig and hadoop http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009
- basic stats
  - reqs/day
  - req latency
  - distribution of response codes
  - searches/day
  - unique searches
  - unique users
  - geo dist
- correlations
  - diffs for other clients (mobile, desktop, services, etc)
  - cohort analyses
  - what fails at the same time
  - what features get users hooked
  - what features do successful users use often
  - search corrections/suggestions
  - A/B testing
- research
  - profiling users based on tweets (theirs, followers, followees)
  - what graph structures lead to successful networks
  - reputation
  - sentiments
  - retweets, retweet depths
  - duplicates
  - language detection
nosql eu 2010
- collect data using scribe
- http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010

yahoo

hadoop summit 2010
- 38K hadoop servers, 70 PB, 1M+ jobs/mo, 50% in pig
- 3 PB/day processed, 120 TB/day flow through hadoop
- 170 PB across world
- largest hadoop clusters at 4000/2300 nodes w 70/40 PB, full repl
- krb for multitenant hadoop clusters
- http://perspectives.mvdirona.com/2010/07/03/HadoopSummit2010.aspx
- http://mndoci.com/2010/06/30/massive-data/
Hadoop infrastructure, 2010-04-26
- 30,000 nodes (16GB RAM, 8cores); 250,000 cores; 100,000 MR jobs perday
- http://twitter.com/marin_dimitrov/status/12900368052
Yahoo!'s 10k+ Hadoop cluster had 1 namenode failure in 6 years during normal operation; 15 failures during upgrades http://twitter.com/dataspora/status/13154690071

zynga

200-node Vertica data warehouse: 40M players/day, 3TB/day http://twitter.com/bigdata/status/13109356141

NYSE (brian clark's talk at linux foundation end user summit 2009)

systems run under high pressure and tight constraints
process 3B txns/day (more than goog); each txn must be <1ms
"customers can switch to competing exchanges instantly and for almost no cost, so if NYSE's systems are not performing, its customers will vanish"
typically 1.5TB/day; PBs of data kept online
"subject to thousands of attacks every day"
downtime must be limited to 90s/yr
wants to be able to move everything except specific app off a given core
- also wants to lock a proc's mem in place, but this has been avail via mlock
- also wants non-disruptive way to measure latency, esp in net stack

stumbleupon

moved from oracle to hbase
chose hbase over cassandra and hypertable bc of community, features, hadoop integration
hbase serves su.pr: no add'l caching, 9B rows, 1300GB, map-red 700GB in 20m on 20 machines
http://www.cloudera.com/blog/2009/11/25/hadoop-world-practical-hbase-from-jonathan-gray-and-ryan-rawson/

streamy

moved from postgresql to hbase
http://www.cloudera.com/blog/2009/11/25/hadoop-world-practical-hbase-from-jonathan-gray-and-ryan-rawson/

blizzard

AT&T has provided space, network monitoring/mgmt for 9 yrs
10 DCs around world: WA, CA, TX, MA, fr, de, sw, kr, cn, tw
20K systems, 1.3 PB storage, 13,250 blades, 75K cores, 112.5 TB RAM
68 admins/ops; monitored from global NOC (GNOC), "which like many NOCs, features televisions tuned to the weather stations to track potential uptime threats across its data center footprint"

adobe

uses hbase
need sub-50ms accesses, no downtime, no data loss
mirrors git repository
- faces transient bugs eg ZK byte-loss bug
preferred over cassandra despite its complexity; argues complexity is non-accidental
http://hstack.org/why-were-using-hbase-part-2/

microsoft

ieee computing article http://glinden.blogspot.com/2010/09/insights-into-performance-of-microsofts.html
- hotmail
  - several PBs in 10,000s of servers
  - typical server: 2 CPU, 2 disks, add'l storage enclosure of up to 40 SATAs
- typical cosmos server: 2 CPU, 16-24GB RAM, up to 4 SATAs
- bing
  - 10,000s of servers
  - use RAM of 1,000s of servers
  - typical server: 2 CPU, 2-3GB RAM/core, 2-4 SATAs
- odd: bing/cosmos servers CPU bound; bc of data compression?

github

http://github.com/blog/667-webpulp-tv-interview-with-tom-preston-werner
- git repos on 6 file server pairs now
http://github.com/blog/530-how-we-made-github-fast
- ldirectord on active/standby pair of xens
- ships your request off to one of the four frontend machines
- 4 frontend machines w nginx -> 16 unicorns w rails
- mysql on 2 8-core 32GB RAM boxes w 15KRPM SAS HDDs, replicating over DRBD
- git access
  - grit is ruby lib that speaks git
  - swapped grit's FS backend to use the smoke RPC proxying service
  - proxymachine layer 7 TCP routing proxy with ruby routing logic
    - extracts username of specified repo
    - uses chimney to look up route for that user; look up in redis
    - each frontend runs 4 proxymachine instances behind haproxy
- 4 active/standby pairs of (git) file servers: 8 core, 16GB RAM, 6 * 300GB 15K RPM SAS HDDs in RAID10
  - each has 2 ernie rpc servers spawning 16 ruby workers
- ssh
  - sshd patched to look up PKs in mysql
  - gerve (git serve) is smarter git-shell (restricts exec)
    - grant access if you are owner or in ACL (in mysql)
    - chimney for route lookup
    - exec ssh git@ROUTE COMMAND ARGS
- git: only does clones (a public operation); handled by proxymachine
- sub-systems: job queue, archive downloads, billing, mirroring, svn importer
- side-systems: github pages, gist, gem server, internal tools

disqus (djangocon 2010)

~100 servers, 30% web, 10% db, 25% memcache, 20% HAProxy/heartbeat, 15% util
syslog-ng, pgbouncer, apache+mod_wsgi, ganglia, libmemcached
db: postgresql, slony-i (trigger-based) replication
- pgfouine log analyzer, slow query logging
- app-level partitioning (tables/vert, rows/horiz), replayed by slony
- non-transactional; not sure how they deal with concurrency control, atomicity
- single row atomic updates done on db, not in python (row.update(attr=val))
prevent thundering herd on memcached objects that have a predetermined lifespan with mintcache approach
- have stale date as well as expire date
- when stale, first requester is told to refresh the data
- returning invalidate value instead of deleting then missing
cont deployment, fabric, hudson, selenium, redmine
- can easily revert to earlier vers; can toggle features (for % users)
- interesting idea: sentry aggregates their exceptions (instead of one per email)
http://www.slideshare.net/zeeg/djangocon-2010-scaling-disqus

flickr

@sampullara: replicates objs to shards where they're needed

others

workday http://www.dbms2.com/2010/08/22/workday-technology-stack/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB.page

DB.page

Background

clustering

RDBMSs

scientific databases

parallel/distributed olap

distributed dbms

OLTP

Data Integration

column stores

distributed olap

alternative DBs

misc db

site stats

Files

DB.page

Latest commit

History

DB.page

File metadata and controls

Background

clustering

RDBMSs

scientific databases

parallel/distributed olap

distributed dbms

OLTP

Data Integration

column stores

distributed olap

alternative DBs

misc db

site stats