Skip to content

manselmi/parquet-modular-encryption

Repository files navigation

Parquet Modular Encryption demo

"If you secure everything with a key, how are you going to protect the key?" "With another key" 😐

🔼 source

Introduction

Summary

🔽 source

Parquet files containing sensitive information can be protected by the modular encryption mechanism that encrypts and authenticates the file data and metadata - while allowing for a regular Parquet functionality (columnar projection, predicate pushdown, encoding and compression).

Problem statement

Existing data protection solutions (such as flat encryption of files, in-storage encryption, or use of an encrypting storage client) can be applied to Parquet files, but have various security or performance issues. An encryption mechanism, integrated in the Parquet format, allows for an optimal combination of data security, processing speed and encryption granularity.

Goals

  1. Protect Parquet data and metadata by encryption, while enabling selective reads (columnar projection, predicate push-down).

  2. Implement "client-side" encryption/decryption (storage client). The storage server must not see plaintext data, metadata or encryption keys.

  3. Leverage authenticated encryption that allows clients to check integrity of the retrieved data - making sure the file (or file parts) have not been replaced with a wrong version, or tampered with otherwise.

  4. Enable different encryption keys for different columns and for the footer.

  5. Allow for partial encryption - encrypt only column(s) with sensitive data.

  6. Work with all compression and encoding mechanisms supported in Parquet.

  7. Support multiple encryption algorithms, to account for different security and performance requirements.

  8. Enable two modes for metadata protection -

    • full protection of file metadata

    • partial protection of file metadata that allows legacy readers to access unencrypted columns in an encrypted file.

  9. Minimize overhead of encryption - in terms of size of encrypted files, and throughput of write/read operations.

How it works

The Parquet writer generates a DEK (data encryption key) for each plaintext chunk to be encrypted, encrypts the plaintext chunk, then sends the DEK to the KMS (key management service) to be wrapped by the chosen KEK (key encryption key). The KMS returns the wrapped DEK to the Parquet writer, which stores the wrapped DEK alongside the corresponding ciphertext chunk.

To read a ciphertext chunk, the Parquet reader sends the corresponding wrapped DEK to the KMS, which unwraps it and returns the DEK to the Parquet reader. The reader decrypts the ciphertext chunk with the DEK.

Prerequisites

  • Pixi

  • Configure Git hooks:

    pixi run -- pre-commit-install

Example

Launch the KMS (key management service).

pixi run -- serve

Explore the KMS' OpenAPI specification. Try POSTing the JSON payload

{
  "key": "rlCLtKLrH/b9GZbuZaneQB6yU6vp8tlC1R2LINMYYrM="
}

to one of the wrap endpoints and then try unwrapping the result via the corresponding unwrap endpoint at various privilege levels. To set a privilege level, click the "Authorize" button and set the value of the x-api-key request header to INTERNAL, CONFIDENTIAL or RESTRICTED. PUBLIC does not require the x-api-key request header. (plaintext < PUBLIC < INTERNAL < CONFIDENTIAL < RESTRICTED)

Write an encrypted Parquet dataset with columns of varying privilege levels to the dataset directory.

pixi run -- write

Read the entire dataset from the dataset directory.

pixi run -- read

Edit read_encrypted_parquet.py and experiment with different combinations of KMS_ACCESS_TOKEN and COLUMNS to project. The default is:

KMS_ACCESS_TOKEN = WrappingKeyId.RESTRICTED
COLUMNS = [
    "id",  # minimum required privilege: none (plaintext)
    "date_of_birth",  # minimum required privilege: INTERNAL
    "first_name",  # minimum required privilege: CONFIDENTIAL
    "last_name",  # minimum required privilege: CONFIDENTIAL
    "social_security_number",  # minimum required privilege: RESTRICTED
]

RESTRICTED is the highest privilege level and may decrypt all columns, which is why projecting all columns earlier was successful.

Note that id is the only plaintext column, and no access token is required to project it (i.e. KMS_ACCESS_TOKEN = None).

Final comments

Please note that in reality KEKs should be narrowly scoped (e.g. project-specific), periodically rotated, and gated behind IAM (Identity and Access Management) more secure than static API keys.

Examples of production-grade KMS include:

References

About

Demo of Parquet modular encryption

Resources

Stars

Watchers

Forks

Languages