copyright

lastupdated

subcollection

years
2017, 2022

2022-01-17

AnalyticsEngine

{:new_window: target="_blank"} {:shortdesc: .shortdesc} {:codeblock: .codeblock} {:screen: .screen} {:pre: .pre} {:external: target="_blank" .external}

Configuring Parquet encryption on Apache Hive tables

{: #parquet-encryption-on-hive-tables-serverless}

Spark supports reading and writing data stored in Hive tables. If you want to encrypt Parquet file content in Hive tables, the information about which columns to encrypt with which keys can be stored in the Apache Hive Metastore, and is automatically applied by Spark whenever data is saved in the table files.

To configure Parquet encryption on Hive tables:

Create a table using the Hive Beeline CLI and configure which Parquet file content columns to encrypt with which keys by using SERDEPROPERTIES. The following sample code snippet sets the encryption keys to use to encrypt the "credit_card" column (and the footer):

CREATE TABLE my_table(name STRING, credit_card STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
    'parquet.encryption.column.keys'='c4a21521-2a78-4968-a7c2-57c481f58d5c: credit_card',
    'parquet.encryption.footer.key'='d1ae3fc2-6b7d-4a03-abb6-644e02933734')
STORED AS parquet

If you don't use the Hive Beeline CLI, you can use Spark SQL to setup parquet encryption on a Hive table that you create and pass the table-specific parquet encryption parameters via the OPTIONS keyword. For example:

spark.sql("""
CREATE TABLE my_table(name STRING, credit_card STRING)
USING HIVE
OPTIONS(
  parquet.encryption.column.keys 'c4a21521-2a78-4968-a7c2-57c481f58d5c: credit_card',
  parquet.encryption.footer.key 'd1ae3fc2-6b7d-4a03-abb6-644e02933734',
  fileFormat 'parquet')
  """)

You can use SERDEPROPERTIES or OPTIONS to configure KMS properties too:

CREATE TABLE my_table(name STRING, credit_card STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'parquet.encryption.kms.instance.url'='https://<region>.kms.cloud.ibm.com',
  'parquet.encryption.kms.instance.id'='<KeyProtect instance ID>',
  'parquet.encryption.column.keys'='c4a21521-2a78-4968-a7c2-57c481f58d5c: credit_card',
  'parquet.encryption.footer.key'='d1ae3fc2-6b7d-4a03-abb6-644e02933734')
STORED AS parquet;

After you have created the table, add data to it by using the DataFrameWriter.insertInto operator.

For example, you could load data from a CSV file, and add it to your table:
```
df = spark.read.format("csv").load("/path/to/csv")
df.write.insertInto("my_table")
```

Usage examples

{: #usage-examples-parquet-encryption-hive}

The following examples show you how you can use Parquet encryption on data saved in Hive tables.

Python example: Load data from a CSV file to a Hive table where the content is saved as encrypted Parquet file content with keys managed by IBM KeyProtect:

ACCESS_TOKEN = "A valid IAM access token"
KEY_PROTECT_INSTANCE_ID = "My KP instance ID"
K1 = "d1ae3fc2-6b7d-4a03-abb6-644e02933734"
K2 = "c4a21521-2a78-4968-a7c2-57c481f58d5c"

# setup key protect on the hadoop configuration
hc = spark.sparkContext._jsc.hadoopConfiguration()
hc.set("parquet.crypto.factory.class","com.ibm.parquet.key.management.IBMKeyToolsFactory"
hc.set("parquet.encryption.kms.instance.url", "https://<region>.kms.cloud.ibm.com")
hc.set("parquet.encryption.kms.instance.id", KEY_PROTECT_INSTANCE_ID)
hc.set("parquet.encryption.key.access.token", ACCESS_TOKEN)

# load data from CSV
csvDF = spark.read.format("csv").load("squares.csv")

# create Hive table
spark.sql("""
CREATE TABLE squares_table(int_column int, square_int_column int)
USING HIVE
OPTIONS(
  parquet.encryption.column.keys '%s: square_int_column',
  parquet.encryption.footer.key '%s',
  fileFormat 'parquet')
""" % (K1, K2))

# write csvDF to hive table
csvDF.write.insertInto("squares_table")

# read from table and print
spark.sql("SELECT * FROM squares_table").show()

Python example: Load data from a CSV file to a Hive table where the content is saved as encrypted Parquet file content with keys managed by application:

# setup explicit keys on the hadoop configuration
hc = sc._jsc.hadoopConfiguration()
hc.set("parquet.encryption.key.list", "key1: AAECAwQFBgcICQoLDA0ODw==, key2: AAECAAECAAECAAECAAECAA==")
hc.set("parquet.crypto.factory.class","com.ibm.parquet.key.management.IBMKeyToolsFactory")

# load data from CSV
csvDF = spark.read.format("csv").load("squares.csv")

# create Hive table
spark.sql("""
CREATE TABLE squares_table(int_column int, square_int_column int)
USING HIVE
OPTIONS(
  parquet.encryption.column.keys 'key1: square_int_column',
  parquet.encryption.footer.key 'key2',
  fileFormat 'parquet')
""")

# write csvDF to hive table
csvDF.write.insertInto("squares_table")

# read from table and print
spark.sql("SELECT * FROM squares_table").show()

Python example: Read from a Hive table where the content is saved as encrypted Parquet file content with keys managed by IBM KeyProtect:

ACCESS_TOKEN = "A valid IAM access token"
# setup access token for key protect on the Hadoop
configurationspark.sparkContext._jsc.hadoopConfiguration().set("parquet.encryption.key.access.token", ACCESS_TOKEN)  
# read from table and print
spark.sql("SELECT * FROM squares_table").show()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet-encryption-on-hive-tables-serverless.md

parquet-encryption-on-hive-tables-serverless.md

Configuring Parquet encryption on Apache Hive tables

Usage examples

Files

parquet-encryption-on-hive-tables-serverless.md

Latest commit

History

parquet-encryption-on-hive-tables-serverless.md

File metadata and controls

Configuring Parquet encryption on Apache Hive tables

Usage examples