Skip to content

Column Store Adaptivity

Anil Shanbhag edited this page Nov 10, 2015 · 6 revisions

Many companies like Facebook/Bing use block-based storage system for their data warehouse. Many other co's use column stores as their data warehouse. The goal is figure what it means to have adaptive column stores.

Parquet is the columnar storage format used. It is marketed as "column store for hadoop".

Q. Figure out how Parquet file ends up stored on HDFS. It seems like we still need to organize it as blocks.

Presentation that described the storage format of Parquet. link video-link

We buffer a row-group of incoming records (which arrive in row format) in memory. Row group size >= 64 MB <= 1GB. Then write them out in column format. We can still apply the same techniques ?

Detail: There are different compression techniques. Snappy seems to be the popular one (its by Google). It is more CPU efficient for slightly larger storage compared to GZIP.

How bad is snappy+parquet vs vertica ?

Clone this wiki locally