-
Notifications
You must be signed in to change notification settings - Fork 3
Column Store Adaptivity
Many companies like Facebook/Bing use block-based storage system for their data warehouse. Many other co's use column stores as their data warehouse. The goal is figure what it means to have adaptive column stores.
Parquet is the columnar storage format used. It is marketed as "column store for hadoop".
Q. Figure out how Parquet file ends up stored on HDFS. It seems like we still need to organize it as blocks.
Presentation that described the storage format of Parquet. link video-link
We buffer a row-group of incoming records (which arrive in row format) in memory. Row group size >= 64 MB <= 1GB. Then write them out in column format. We can still apply the same techniques ?
Detail: There are different compression techniques. Snappy seems to be the popular one (its by Google). It is more CPU efficient for slightly larger storage compared to GZIP.
How bad is snappy+parquet vs vertica ?