Parquet Modular Encryption demo

Introduction

Summary

Parquet files containing sensitive information can be protected by the modular encryption mechanism that encrypts and authenticates the file data and metadata - while allowing for a regular Parquet functionality (columnar projection, predicate pushdown, encoding and compression).

Problem statement

Existing data protection solutions (such as flat encryption of files, in-storage encryption, or use of an encrypting storage client) can be applied to Parquet files, but have various security or performance issues. An encryption mechanism, integrated in the Parquet format, allows for an optimal combination of data security, processing speed and encryption granularity.

Goals

Protect Parquet data and metadata by encryption, while enabling selective reads (columnar projection, predicate push-down).
Implement "client-side" encryption/decryption (storage client). The storage server must not see plaintext data, metadata or encryption keys.
Leverage authenticated encryption that allows clients to check integrity of the retrieved data - making sure the file (or file parts) have not been replaced with a wrong version, or tampered with otherwise.
Enable different encryption keys for different columns and for the footer.
Allow for partial encryption - encrypt only column(s) with sensitive data.
Work with all compression and encoding mechanisms supported in Parquet.
Support multiple encryption algorithms, to account for different security and performance requirements.
Enable two modes for metadata protection -
- full protection of file metadata
- partial protection of file metadata that allows legacy readers to access unencrypted columns in an encrypted file.
Minimize overhead of encryption - in terms of size of encrypted files, and throughput of write/read operations.

How it works

The Parquet writer generates a DEK (data encryption key) for each plaintext chunk to be encrypted, encrypts the plaintext chunk, then sends the DEK to the KMS (key management service) to be wrapped by the chosen KEK (key encryption key). The KMS returns the wrapped DEK to the Parquet writer, which stores the wrapped DEK alongside the corresponding ciphertext chunk.

To read a ciphertext chunk, the Parquet reader sends the corresponding wrapped DEK to the KMS, which unwraps it and returns the DEK to the Parquet reader. The reader decrypts the ciphertext chunk with the DEK.

Prerequisites

Pixi
Configure Git hooks:
```
pixi run -- pre-commit-install
```

Example

Launch the KMS (key management service).

pixi run -- serve

Explore the KMS' OpenAPI specification. Try POSTing the JSON payload

{
  "key": "rlCLtKLrH/b9GZbuZaneQB6yU6vp8tlC1R2LINMYYrM="
}

to one of the wrap endpoints and then try unwrapping the result via the corresponding unwrap endpoint at various privilege levels. To set a privilege level, click the "Authorize" button and set the value of the x-api-key request header to INTERNAL, CONFIDENTIAL or RESTRICTED. PUBLIC does not require the x-api-key request header. (plaintext < PUBLIC < INTERNAL < CONFIDENTIAL < RESTRICTED)

Write an encrypted Parquet dataset with columns of varying privilege levels to the dataset directory.

pixi run -- write

Read the entire dataset from the dataset directory.

pixi run -- read

Edit read_encrypted_parquet.py and experiment with different combinations of KMS_ACCESS_TOKEN and COLUMNS to project. The default is:

KMS_ACCESS_TOKEN = WrappingKeyId.RESTRICTED
COLUMNS = [
    "id",  # minimum required privilege: none (plaintext)
    "date_of_birth",  # minimum required privilege: INTERNAL
    "first_name",  # minimum required privilege: CONFIDENTIAL
    "last_name",  # minimum required privilege: CONFIDENTIAL
    "social_security_number",  # minimum required privilege: RESTRICTED
]

RESTRICTED is the highest privilege level and may decrypt all columns, which is why projecting all columns earlier was successful.

Note that id is the only plaintext column, and no access token is required to project it (i.e. KMS_ACCESS_TOKEN = None).

Final comments

Please note that in reality KEKs should be narrowly scoped (e.g. project-specific), periodically rotated, and gated behind IAM (Identity and Access Management) more secure than static API keys.

Examples of production-grade KMS include:

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.assets		.assets
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
kms_client.py		kms_client.py
kms_server.py		kms_server.py
pixi.lock		pixi.lock
pyproject.toml		pyproject.toml
read_encrypted_parquet.py		read_encrypted_parquet.py
write_encrypted_parquet.py		write_encrypted_parquet.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parquet Modular Encryption demo

Introduction

Summary

Problem statement

Goals

How it works

Prerequisites

Example

Final comments

References

About

Languages

manselmi/parquet-modular-encryption

Folders and files

Latest commit

History

Repository files navigation

Parquet Modular Encryption demo

Introduction

Summary

Problem statement

Goals

How it works

Prerequisites

Example

Final comments

References

About

Resources

Stars

Watchers

Forks

Languages