diff --git a/.gitignore b/.gitignore index 4541b70c3ba3..d9848cab06d3 100644 --- a/.gitignore +++ b/.gitignore @@ -25,7 +25,7 @@ lib/ # web site build site/site/ -site/docs/docs/ +#site/docs/docs/ site/docs/javadoc/ # benchmark output diff --git a/docs/_index.md b/docs/_index.md new file mode 100644 index 000000000000..80925e6b3bd8 --- /dev/null +++ b/docs/_index.md @@ -0,0 +1,54 @@ +--- +title: "Introduction" +menu: main +weight: 0 +--- + + +# Documentation + +**Apache Iceberg is an open table format for huge analytic datasets.** Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink, Hive and Impala using a high-performance table format that works just like a SQL table. + +### User experience + +Iceberg avoids unpleasant surprises. Schema evolution works and won't inadvertently un-delete data. Users don't need to know about partitioning to get fast queries. + +* [Schema evolution](evolution#schema-evolution) supports add, drop, update, or rename, and has [no side-effects](evolution#correctness) +* [Hidden partitioning](partitioning) prevents user mistakes that cause silently incorrect results or extremely slow queries +* [Partition layout evolution](evolution#partition-evolution) can update the layout of a table as data volume or query patterns change +* [Time travel](spark-queries#time-travel) enables reproducible queries that use exactly the same table snapshot, or lets users easily examine changes +* Version rollback allows users to quickly correct problems by resetting tables to a good state + +### Reliability and performance + +Iceberg was built for huge tables. Iceberg is used in production where a single table can contain tens of petabytes of data and even these huge tables can be read without a distributed SQL engine. + +* [Scan planning is fast](performance#scan-planning) -- a distributed SQL engine isn't needed to read a table or find files +* [Advanced filtering](performance#data-filtering) -- data files are pruned with partition and column-level stats, using table metadata + +Iceberg was designed to solve correctness problems in eventually-consistent cloud object stores. + +* [Works with any cloud store](reliability) and reduces NN congestion when in HDFS, by avoiding listing and renames +* [Serializable isolation](reliability) -- table changes are atomic and readers never see partial or uncommitted changes +* [Multiple concurrent writers](reliability#concurrent-write-operations) use optimistic concurrency and will retry to ensure that compatible updates succeed, even when writes conflict + +### Open standard + +Iceberg has been designed and developed to be an open community standard with a [specification](../../spec) to ensure compatibility across languages and implementations. + +[Apache Iceberg is open source](../../community), and is developed at the [Apache Software Foundation](https://www.apache.org/). diff --git a/docs/aws.md b/docs/aws.md new file mode 100644 index 000000000000..da7345291f17 --- /dev/null +++ b/docs/aws.md @@ -0,0 +1,663 @@ +--- +title: "AWS" +url: aws +menu: + main: + parent: Integrations + identifier: aws_integration + weight: 0 +--- + + +# Iceberg AWS Integrations + +Iceberg provides integration with different AWS services through the `iceberg-aws` module. +This section describes how to use Iceberg with AWS. + +## Enabling AWS Integration + +The `iceberg-aws` module is bundled with Spark and Flink engine runtimes for all versions from `0.11.0` onwards. +However, the AWS clients are not bundled so that you can use the same client version as your application. +You will need to provide the AWS v2 SDK because that is what Iceberg depends on. +You can choose to use the [AWS SDK bundle](https://mvnrepository.com/artifact/software.amazon.awssdk/bundle), +or individual AWS client packages (Glue, S3, DynamoDB, KMS, STS) if you would like to have a minimal dependency footprint. + +All the default AWS clients use the [Apache HTTP Client](https://mvnrepository.com/artifact/software.amazon.awssdk/apache-client) +for HTTP connection management. +This dependency is not part of the AWS SDK bundle and needs to be added separately. +To choose a different HTTP client library such as [URL Connection HTTP Client](https://mvnrepository.com/artifact/software.amazon.awssdk/url-connection-client), +see the section [client customization](#aws-client-customization) for more details. + +All the AWS module features can be loaded through custom catalog properties, +you can go to the documentations of each engine to see how to load a custom catalog. +Here are some examples. + +### Spark + +For example, to use AWS features with Spark 3.4 (with scala 2.12) and AWS clients (which is packaged in the `iceberg-aws-bundle`), you can start the Spark SQL shell with: + +```sh +# start Spark SQL client shell +spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:{{% icebergVersion %}},org.apache.iceberg:iceberg-aws-bundle:{{% icebergVersion %}} \ + --conf spark.sql.defaultCatalog=my_catalog \ + --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my/key/prefix \ + --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \ + --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO +``` + +As you can see, In the shell command, we use `--packages` to specify the additional `iceberg-aws-bundle` that contains all relevant AWS dependencies. + +### Flink + +To use AWS module with Flink, you can download the necessary dependencies and specify them when starting the Flink SQL client: + +```sh +# download Iceberg dependency +ICEBERG_VERSION={{% icebergVersion %}} +MAVEN_URL=https://repo1.maven.org/maven2 +ICEBERG_MAVEN_URL=$MAVEN_URL/org/apache/iceberg +wget $ICEBERG_MAVEN_URL/iceberg-flink-runtime/$ICEBERG_VERSION/iceberg-flink-runtime-$ICEBERG_VERSION.jar +wget $ICEBERG_MAVEN_URL/iceberg-aws-bundle/$ICEBERG_VERSION/iceberg-aws-bundle-$ICEBERG_VERSION.jar + +# start Flink SQL client shell +/path/to/bin/sql-client.sh embedded \ + -j iceberg-flink-runtime-$ICEBERG_VERSION.jar \ + -j iceberg-aws-bundle-$ICEBERG_VERSION.jar \ + shell +``` + +With those dependencies, you can create a Flink catalog like the following: + +```sql +CREATE CATALOG my_catalog WITH ( + 'type'='iceberg', + 'warehouse'='s3://my-bucket/my/key/prefix', + 'catalog-impl'='org.apache.iceberg.aws.glue.GlueCatalog', + 'io-impl'='org.apache.iceberg.aws.s3.S3FileIO' +); +``` + +You can also specify the catalog configurations in `sql-client-defaults.yaml` to preload it: + +```yaml +catalogs: + - name: my_catalog + type: iceberg + warehouse: s3://my-bucket/my/key/prefix + catalog-impl: org.apache.iceberg.aws.glue.GlueCatalog + io-impl: org.apache.iceberg.aws.s3.S3FileIO +``` + +### Hive + +To use AWS module with Hive, you can download the necessary dependencies similar to the Flink example, +and then add them to the Hive classpath or add the jars at runtime in CLI: + +``` +add jar /my/path/to/iceberg-hive-runtime.jar; +add jar /my/path/to/aws/bundle.jar; +``` + +With those dependencies, you can register a Glue catalog and create external tables in Hive at runtime in CLI by: + +```sql +SET iceberg.engine.hive.enabled=true; +SET hive.vectorized.execution.enabled=false; +SET iceberg.catalog.glue.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog; +SET iceberg.catalog.glue.warehouse=s3://my-bucket/my/key/prefix; + +-- suppose you have an Iceberg table database_a.table_a created by GlueCatalog +CREATE EXTERNAL TABLE database_a.table_a +STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' +TBLPROPERTIES ('iceberg.catalog'='glue'); +``` + +You can also preload the catalog by setting the configurations above in `hive-site.xml`. + +## Catalogs + +There are multiple different options that users can choose to build an Iceberg catalog with AWS. + +### Glue Catalog + +Iceberg enables the use of [AWS Glue](https://aws.amazon.com/glue) as the `Catalog` implementation. +When used, an Iceberg namespace is stored as a [Glue Database](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-databases.html), +an Iceberg table is stored as a [Glue Table](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html), +and every Iceberg table version is stored as a [Glue TableVersion](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-TableVersion). +You can start using Glue catalog by specifying the `catalog-impl` as `org.apache.iceberg.aws.glue.GlueCatalog`, +just like what is shown in the [enabling AWS integration](#enabling-aws-integration) section above. +More details about loading the catalog can be found in individual engine pages, such as [Spark](../spark-configuration/#loading-a-custom-catalog) and [Flink](../flink/#creating-catalogs-and-using-catalogs). + +#### Glue Catalog ID + +There is a unique Glue metastore in each AWS account and each AWS region. +By default, `GlueCatalog` chooses the Glue metastore to use based on the user's default AWS client credential and region setup. +You can specify the Glue catalog ID through `glue.id` catalog property to point to a Glue catalog in a different AWS account. +The Glue catalog ID is your numeric AWS account ID. +If the Glue catalog is in a different region, you should configure your AWS client to point to the correct region, +see more details in [AWS client customization](#aws-client-customization). + +#### Skip Archive + +AWS Glue has the ability to archive older table versions and a user can roll back the table to any historical version if needed. +By default, the Iceberg Glue Catalog will skip the archival of older table versions. +If a user wishes to archive older table versions, they can set `glue.skip-archive` to false. +Do note for streaming ingestion into Iceberg tables, setting `glue.skip-archive` to false will quickly create a lot of Glue table versions. +For more details, please read [Glue Quotas](https://docs.aws.amazon.com/general/latest/gr/glue.html) and the [UpdateTable API](https://docs.aws.amazon.com/glue/latest/webapi/API_UpdateTable.html). + +#### Skip Name Validation + +Allow user to skip name validation for table name and namespaces. +It is recommended to stick to [Glue best practices](https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html) +to make sure operations are Hive compatible. +This is only added for users that have existing conventions using non-standard characters. When database name +and table name validation are skipped, there is no guarantee that downstream systems would all support the names. + +#### Optimistic Locking + +By default, Iceberg uses Glue's optimistic locking for concurrent updates to a table. +With optimistic locking, each table has a version id. +If users retrieve the table metadata, Iceberg records the version id of that table. +Users can update the table as long as the version ID on the server side remains unchanged. +Version mismatch occurs if someone else modified the table before you did, causing an update failure. +Iceberg then refreshes metadata and checks if there is a conflict. +If there is no commit conflict, the operation will be retried. +Optimistic locking guarantees atomic transaction of Iceberg tables in Glue. +It also prevents others from accidentally overwriting your changes. + +{{< hint info >}} +Please use AWS SDK version >= 2.17.131 to leverage Glue's Optimistic Locking. +If the AWS SDK version is below 2.17.131, only in-memory lock is used. To ensure atomic transaction, you need to set up a [DynamoDb Lock Manager](#dynamodb-lock-manager). +{{< /hint >}} + +#### Warehouse Location + +Similar to all other catalog implementations, `warehouse` is a required catalog property to determine the root path of the data warehouse in storage. +By default, Glue only allows a warehouse location in S3 because of the use of `S3FileIO`. +To store data in a different local or cloud store, Glue catalog can switch to use `HadoopFileIO` or any custom FileIO by setting the `io-impl` catalog property. +Details about this feature can be found in the [custom FileIO](../custom-catalog/#custom-file-io-implementation) section. + +#### Table Location + +By default, the root location for a table `my_table` of namespace `my_ns` is at `my-warehouse-location/my-ns.db/my-table`. +This default root location can be changed at both namespace and table level. + +To use a different path prefix for all tables under a namespace, use AWS console or any AWS Glue client SDK you like to update the `locationUri` attribute of the corresponding Glue database. +For example, you can update the `locationUri` of `my_ns` to `s3://my-ns-bucket`, +then any newly created table will have a default root location under the new prefix. +For instance, a new table `my_table_2` will have its root location at `s3://my-ns-bucket/my_table_2`. + +To use a completely different root path for a specific table, set the `location` table property to the desired root path value you want. +For example, in Spark SQL you can do: + +```sql +CREATE TABLE my_catalog.my_ns.my_table ( + id bigint, + data string, + category string) +USING iceberg +OPTIONS ('location'='s3://my-special-table-bucket') +PARTITIONED BY (category); +``` + +For engines like Spark that support the `LOCATION` keyword, the above SQL statement is equivalent to: + +```sql +CREATE TABLE my_catalog.my_ns.my_table ( + id bigint, + data string, + category string) +USING iceberg +LOCATION 's3://my-special-table-bucket' +PARTITIONED BY (category); +``` + +### DynamoDB Catalog + +Iceberg supports using a [DynamoDB](https://aws.amazon.com/dynamodb) table to record and manage database and table information. + +#### Configurations + +The DynamoDB catalog supports the following configurations: + +| Property | Default | Description | +| --------------------------------- | -------------------------------------------------- | ------------------------------------------------------ | +| dynamodb.table-name | iceberg | name of the DynamoDB table used by DynamoDbCatalog | + + +#### Internal Table Design + +The DynamoDB table is designed with the following columns: + +| Column | Key | Type | Description | +| ----------------- | --------------- | ----------- |--------------------------------------------------------------------- | +| identifier | partition key | string | table identifier such as `db1.table1`, or string `NAMESPACE` for namespaces | +| namespace | sort key | string | namespace name. A [global secondary index (GSI)](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html) is created with namespace as partition key, identifier as sort key, no other projected columns | +| v | | string | row version, used for optimistic locking | +| updated_at | | number | timestamp (millis) of the last update | +| created_at | | number | timestamp (millis) of the table creation | +| p. | | string | Iceberg-defined table properties including `table_type`, `metadata_location` and `previous_metadata_location` or namespace properties + +This design has the following benefits: + +1. it avoids potential [hot partition issue](https://aws.amazon.com/premiumsupport/knowledge-center/dynamodb-table-throttled/) if there are heavy write traffic to the tables within the same namespace because the partition key is at the table level +2. namespace operations are clustered in a single partition to avoid affecting table commit operations +3. a sort key to partition key reverse GSI is used for list table operation, and all other operations are single row ops or single partition query. No full table scan is needed for any operation in the catalog. +4. a string UUID version field `v` is used instead of `updated_at` to avoid 2 processes committing at the same millisecond +5. multi-row transaction is used for `catalog.renameTable` to ensure idempotency +6. properties are flattened as top level columns so that user can add custom GSI on any property field to customize the catalog. For example, users can store owner information as table property `owner`, and search tables by owner by adding a GSI on the `p.owner` column. + +### RDS JDBC Catalog + +Iceberg also supports the JDBC catalog which uses a table in a relational database to manage Iceberg tables. +You can configure to use the JDBC catalog with relational database services like [AWS RDS](https://aws.amazon.com/rds). +Read [the JDBC integration page](../jdbc/#jdbc-catalog) for guides and examples about using the JDBC catalog. +Read [this AWS documentation](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.IAMDBAuth.Connecting.Java.html) for more details about configuring the JDBC catalog with IAM authentication. + +### Which catalog to choose? + +With all the available options, we offer the following guidelines when choosing the right catalog to use for your application: + +1. if your organization has an existing Glue metastore or plans to use the AWS analytics ecosystem including Glue, [Athena](https://aws.amazon.com/athena), [EMR](https://aws.amazon.com/emr), [Redshift](https://aws.amazon.com/redshift) and [LakeFormation](https://aws.amazon.com/lake-formation), Glue catalog provides the easiest integration. +2. if your application requires frequent updates to table or high read and write throughput (e.g. streaming write), Glue and DynamoDB catalog provides the best performance through optimistic locking. +3. if you would like to enforce access control for tables in a catalog, Glue tables can be managed as an [IAM resource](https://docs.aws.amazon.com/service-authorization/latest/reference/list_awsglue.html), whereas DynamoDB catalog tables can only be managed through [item-level permission](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/specifying-conditions.html) which is much more complicated. +4. if you would like to query tables based on table property information without the need to scan the entire catalog, DynamoDB catalog allows you to build secondary indexes for any arbitrary property field and provide efficient query performance. +5. if you would like to have the benefit of DynamoDB catalog while also connect to Glue, you can enable [DynamoDB stream with Lambda trigger](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.Lambda.Tutorial.html) to asynchronously update your Glue metastore with table information in the DynamoDB catalog. +6. if your organization already maintains an existing relational database in RDS or uses [serverless Aurora](https://aws.amazon.com/rds/aurora/serverless/) to manage tables, the JDBC catalog provides the easiest integration. + +## DynamoDb Lock Manager + +[Amazon DynamoDB](https://aws.amazon.com/dynamodb) can be used by `HadoopCatalog` or `HadoopTables` so that for every commit, +the catalog first obtains a lock using a helper DynamoDB table and then try to safely modify the Iceberg table. +This is necessary for a file system-based catalog to ensure atomic transaction in storages like S3 that do not provide file write mutual exclusion. + +This feature requires the following lock related catalog properties: + +1. Set `lock-impl` as `org.apache.iceberg.aws.dynamodb.DynamoDbLockManager`. +2. Set `lock.table` as the DynamoDB table name you would like to use. If the lock table with the given name does not exist in DynamoDB, a new table is created with billing mode set as [pay-per-request](https://aws.amazon.com/blogs/aws/amazon-dynamodb-on-demand-no-capacity-planning-and-pay-per-request-pricing). + +Other lock related catalog properties can also be used to adjust locking behaviors such as heartbeat interval. +For more details, please refer to [Lock catalog properties](../configuration/#lock-catalog-properties). + + +## S3 FileIO + +Iceberg allows users to write data to S3 through `S3FileIO`. +`GlueCatalog` by default uses this `FileIO`, and other catalogs can load this `FileIO` using the `io-impl` catalog property. + +### Progressive Multipart Upload + +`S3FileIO` implements a customized progressive multipart upload algorithm to upload data. +Data files are uploaded by parts in parallel as soon as each part is ready, +and each file part is deleted as soon as its upload process completes. +This provides maximized upload speed and minimized local disk usage during uploads. +Here are the configurations that users can tune related to this feature: + +| Property | Default | Description | +| --------------------------------- | -------------------------------------------------- | ------------------------------------------------------ | +| s3.multipart.num-threads | the available number of processors in the system | number of threads to use for uploading parts to S3 (shared across all output streams) | +| s3.multipart.part-size-bytes | 32MB | the size of a single part for multipart upload requests | +| s3.multipart.threshold | 1.5 | the threshold expressed as a factor times the multipart size at which to switch from uploading using a single put object request to uploading using multipart upload | +| s3.staging-dir | `java.io.tmpdir` property value | the directory to hold temporary files | + +### S3 Server Side Encryption + +`S3FileIO` supports all 3 S3 server side encryption modes: + +* [SSE-S3](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html): When you use Server-Side Encryption with Amazon S3-Managed Keys (SSE-S3), each object is encrypted with a unique key. As an additional safeguard, it encrypts the key itself with a master key that it regularly rotates. Amazon S3 server-side encryption uses one of the strongest block ciphers available, 256-bit Advanced Encryption Standard (AES-256), to encrypt your data. +* [SSE-KMS](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html): Server-Side Encryption with Customer Master Keys (CMKs) Stored in AWS Key Management Service (SSE-KMS) is similar to SSE-S3, but with some additional benefits and charges for using this service. There are separate permissions for the use of a CMK that provides added protection against unauthorized access of your objects in Amazon S3. SSE-KMS also provides you with an audit trail that shows when your CMK was used and by whom. Additionally, you can create and manage customer managed CMKs or use AWS managed CMKs that are unique to you, your service, and your Region. +* [SSE-C](https://docs.aws.amazon.com/AmazonS3/latest/dev/ServerSideEncryptionCustomerKeys.html): With Server-Side Encryption with Customer-Provided Keys (SSE-C), you manage the encryption keys and Amazon S3 manages the encryption, as it writes to disks, and decryption when you access your objects. + +To enable server side encryption, use the following configuration properties: + +| Property | Default | Description | +| --------------------------------- | ---------------------------------------- | ------------------------------------------------------ | +| s3.sse.type | `none` | `none`, `s3`, `kms` or `custom` | +| s3.sse.key | `aws/s3` for `kms` type, null otherwise | A KMS Key ID or ARN for `kms` type, or a custom base-64 AES256 symmetric key for `custom` type. | +| s3.sse.md5 | null | If SSE type is `custom`, this value must be set as the base-64 MD5 digest of the symmetric key to ensure integrity. | + +### S3 Access Control List + +`S3FileIO` supports S3 access control list (ACL) for detailed access control. +User can choose the ACL level by setting the `s3.acl` property. +For more details, please read [S3 ACL Documentation](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html). + +### Object Store File Layout + +S3 and many other cloud storage services [throttle requests based on object prefix](https://aws.amazon.com/premiumsupport/knowledge-center/s3-request-limit-avoid-throttling/). +Data stored in S3 with a traditional Hive storage layout can face S3 request throttling as objects are stored under the same file path prefix. + +Iceberg by default uses the Hive storage layout but can be switched to use the `ObjectStoreLocationProvider`. +With `ObjectStoreLocationProvider`, a deterministic hash is generated for each stored file, with the hash appended +directly after the `write.data.path`. This ensures files written to s3 are equally distributed across multiple [prefixes](https://aws.amazon.com/premiumsupport/knowledge-center/s3-object-key-naming-pattern/) in the S3 bucket. Resulting in minimized throttling and maximized throughput for S3-related IO operations. When using `ObjectStoreLocationProvider` having a shared and short `write.data.path` across your Iceberg tables will improve performance. + +For more information on how S3 scales API QPS, check out the 2018 re:Invent session on [Best Practices for Amazon S3 and Amazon S3 Glacier]( https://youtu.be/rHeTn9pHNKo?t=3219). At [53:39](https://youtu.be/rHeTn9pHNKo?t=3219) it covers how S3 scales/partitions & at [54:50](https://youtu.be/rHeTn9pHNKo?t=3290) it discusses the 30-60 minute wait time before new partitions are created. + +To use the `ObjectStorageLocationProvider` add `'write.object-storage.enabled'=true` in the table's properties. +Below is an example Spark SQL command to create a table using the `ObjectStorageLocationProvider`: +```sql +CREATE TABLE my_catalog.my_ns.my_table ( + id bigint, + data string, + category string) +USING iceberg +OPTIONS ( + 'write.object-storage.enabled'=true, + 'write.data.path'='s3://my-table-data-bucket') +PARTITIONED BY (category); +``` + +We can then insert a single row into this new table +```SQL +INSERT INTO my_catalog.my_ns.my_table VALUES (1, "Pizza", "orders"); +``` + +Which will write the data to S3 with a hash (`2d3905f8`) appended directly after the `write.object-storage.path`, ensuring reads to the table are spread evenly across [S3 bucket prefixes](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html), and improving performance. +``` +s3://my-table-data-bucket/2d3905f8/my_ns.db/my_table/category=orders/00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet +``` + +Note, the path resolution logic for `ObjectStoreLocationProvider` is `write.data.path` then `/data`. +However, for the older versions up to 0.12.0, the logic is as follows: +- before 0.12.0, `write.object-storage.path` must be set. +- at 0.12.0, `write.object-storage.path` then `write.folder-storage.path` then `/data`. + +For more details, please refer to the [LocationProvider Configuration](../custom-catalog/#custom-location-provider-implementation) section. + +### S3 Strong Consistency + +In November 2020, S3 announced [strong consistency](https://aws.amazon.com/s3/consistency/) for all read operations, and Iceberg is updated to fully leverage this feature. +There is no redundant consistency wait and check which might negatively impact performance during IO operations. + +### Hadoop S3A FileSystem + +Before `S3FileIO` was introduced, many Iceberg users choose to use `HadoopFileIO` to write data to S3 through the [S3A FileSystem](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java). +As introduced in the previous sections, `S3FileIO` adopts the latest AWS clients and S3 features for optimized security and performance + and is thus recommended for S3 use cases rather than the S3A FileSystem. + +`S3FileIO` writes data with `s3://` URI scheme, but it is also compatible with schemes written by the S3A FileSystem. +This means for any table manifests containing `s3a://` or `s3n://` file paths, `S3FileIO` is still able to read them. +This feature allows people to easily switch from S3A to `S3FileIO`. + +If for any reason you have to use S3A, here are the instructions: + +1. To store data using S3A, specify the `warehouse` catalog property to be an S3A path, e.g. `s3a://my-bucket/my-warehouse` +2. For `HiveCatalog`, to also store metadata using S3A, specify the Hadoop config property `hive.metastore.warehouse.dir` to be an S3A path. +3. Add [hadoop-aws](https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws) as a runtime dependency of your compute engine. +4. Configure AWS settings based on [hadoop-aws documentation](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html) (make sure you check the version, S3A configuration varies a lot based on the version you use). + +### S3 Write Checksum Verification + +To ensure integrity of uploaded objects, checksum validations for S3 writes can be turned on by setting catalog property `s3.checksum-enabled` to `true`. +This is turned off by default. + +### S3 Tags + +Custom [tags](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-tagging.html) can be added to S3 objects while writing and deleting. +For example, to write S3 tags with Spark 3.3, you can start the Spark SQL shell with: +``` +spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my/key/prefix \ + --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \ + --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ + --conf spark.sql.catalog.my_catalog.s3.write.tags.my_key1=my_val1 \ + --conf spark.sql.catalog.my_catalog.s3.write.tags.my_key2=my_val2 +``` +For the above example, the objects in S3 will be saved with tags: `my_key1=my_val1` and `my_key2=my_val2`. Do note that the specified write tags will be saved only while object creation. + +When the catalog property `s3.delete-enabled` is set to `false`, the objects are not hard-deleted from S3. +This is expected to be used in combination with S3 delete tagging, so objects are tagged and removed using [S3 lifecycle policy](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html). +The property is set to `true` by default. + +With the `s3.delete.tags` config, objects are tagged with the configured key-value pairs before deletion. +Users can configure tag-based object lifecycle policy at bucket level to transition objects to different tiers. +For example, to add S3 delete tags with Spark 3.3, you can start the Spark SQL shell with: + +``` +sh spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.my_catalog.warehouse=s3://iceberg-warehouse/s3-tagging \ + --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \ + --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ + --conf spark.sql.catalog.my_catalog.s3.delete.tags.my_key3=my_val3 \ + --conf spark.sql.catalog.my_catalog.s3.delete-enabled=false +``` + +For the above example, the objects in S3 will be saved with tags: `my_key3=my_val3` before deletion. +Users can also use the catalog property `s3.delete.num-threads` to mention the number of threads to be used for adding delete tags to the S3 objects. + +When the catalog property `s3.write.table-tag-enabled` and `s3.write.namespace-tag-enabled` is set to `true` then the objects in S3 will be saved with tags: `iceberg.table=` and `iceberg.namespace=`. +Users can define access and data retention policy per namespace or table based on these tags. +For example, to write table and namespace name as S3 tags with Spark 3.3, you can start the Spark SQL shell with: +``` +sh spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.my_catalog.warehouse=s3://iceberg-warehouse/s3-tagging \ + --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \ + --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ + --conf spark.sql.catalog.my_catalog.s3.write.table-tag-enabled=true \ + --conf spark.sql.catalog.my_catalog.s3.write.namespace-tag-enabled=true +``` +For more details on tag restrictions, please refer [User-Defined Tag Restrictions](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/allocation-tag-restrictions.html). + +### S3 Access Points + +[Access Points](https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-access-points.html) can be used to perform +S3 operations by specifying a mapping of bucket to access points. This is useful for multi-region access, cross-region access, +disaster recovery, etc. + +For using cross-region access points, we need to additionally set `use-arn-region-enabled` catalog property to +`true` to enable `S3FileIO` to make cross-region calls, it's not required for same / multi-region access points. + +For example, to use S3 access-point with Spark 3.3, you can start the Spark SQL shell with: +``` +spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket2/my/key/prefix \ + --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \ + --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ + --conf spark.sql.catalog.my_catalog.s3.use-arn-region-enabled=false \ + --conf spark.sql.catalog.test.s3.access-points.my-bucket1=arn:aws:s3::123456789012:accesspoint:mfzwi23gnjvgw.mrap \ + --conf spark.sql.catalog.test.s3.access-points.my-bucket2=arn:aws:s3::123456789012:accesspoint:mfzwi23gnjvgw.mrap +``` +For the above example, the objects in S3 on `my-bucket1` and `my-bucket2` buckets will use `arn:aws:s3::123456789012:accesspoint:mfzwi23gnjvgw.mrap` +access-point for all S3 operations. + +For more details on using access-points, please refer [Using access points with compatible Amazon S3 operations](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points-usage-examples.html). + +### S3 Acceleration + +[S3 Acceleration](https://aws.amazon.com/s3/transfer-acceleration/) can be used to speed up transfers to and from Amazon S3 by as much as 50-500% for long-distance transfer of larger objects. + +To use S3 Acceleration, we need to set `s3.acceleration-enabled` catalog property to `true` to enable `S3FileIO` to make accelerated S3 calls. + +For example, to use S3 Acceleration with Spark 3.3, you can start the Spark SQL shell with: +``` +spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket2/my/key/prefix \ + --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \ + --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ + --conf spark.sql.catalog.my_catalog.s3.acceleration-enabled=true +``` + +For more details on using S3 Acceleration, please refer to [Configuring fast, secure file transfers using Amazon S3 Transfer Acceleration](https://docs.aws.amazon.com/AmazonS3/latest/userguide/transfer-acceleration.html). + +### S3 Dual-stack + +[S3 Dual-stack](https://docs.aws.amazon.com/AmazonS3/latest/userguide/dual-stack-endpoints.html) allows a client to access an S3 bucket through a dual-stack endpoint. +When clients request a dual-stack endpoint, the bucket URL resolves to an IPv6 address if possible, otherwise fallback to IPv4. + +To use S3 Dual-stack, we need to set `s3.dualstack-enabled` catalog property to `true` to enable `S3FileIO` to make dual-stack S3 calls. + +For example, to use S3 Dual-stack with Spark 3.3, you can start the Spark SQL shell with: +``` +spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket2/my/key/prefix \ + --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \ + --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ + --conf spark.sql.catalog.my_catalog.s3.dualstack-enabled=true +``` + +For more details on using S3 Dual-stack, please refer [Using dual-stack endpoints from the AWS CLI and the AWS SDKs](https://docs.aws.amazon.com/AmazonS3/latest/userguide/dual-stack-endpoints.html#dual-stack-endpoints-cli) + +## AWS Client Customization + +Many organizations have customized their way of configuring AWS clients with their own credential provider, access proxy, retry strategy, etc. +Iceberg allows users to plug in their own implementation of `org.apache.iceberg.aws.AwsClientFactory` by setting the `client.factory` catalog property. + +### Cross-Account and Cross-Region Access + +It is a common use case for organizations to have a centralized AWS account for Glue metastore and S3 buckets, and use different AWS accounts and regions for different teams to access those resources. +In this case, a [cross-account IAM role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use.html) is needed to access those centralized resources. +Iceberg provides an AWS client factory `AssumeRoleAwsClientFactory` to support this common use case. +This also serves as an example for users who would like to implement their own AWS client factory. + +This client factory has the following configurable catalog properties: + +| Property | Default | Description | +| --------------------------------- | ---------------------------------------- | ------------------------------------------------------ | +| client.assume-role.arn | null, requires user input | ARN of the role to assume, e.g. arn:aws:iam::123456789:role/myRoleToAssume | +| client.assume-role.region | null, requires user input | All AWS clients except the STS client will use the given region instead of the default region chain | +| client.assume-role.external-id | null | An optional [external ID](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-user_externalid.html) | +| client.assume-role.timeout-sec | 1 hour | Timeout of each assume role session. At the end of the timeout, a new set of role session credentials will be fetched through an STS client. | + +By using this client factory, an STS client is initialized with the default credential and region to assume the specified role. +The Glue, S3 and DynamoDB clients are then initialized with the assume-role credential and region to access resources. +Here is an example to start Spark shell with this client factory: + +```shell +spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:{{% icebergVersion %}},org.apache.iceberg:iceberg-aws-bundle:{{% icebergVersion %}} \ + --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my/key/prefix \ + --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \ + --conf spark.sql.catalog.my_catalog.client.factory=org.apache.iceberg.aws.AssumeRoleAwsClientFactory \ + --conf spark.sql.catalog.my_catalog.client.assume-role.arn=arn:aws:iam::123456789:role/myRoleToAssume \ + --conf spark.sql.catalog.my_catalog.client.assume-role.region=ap-northeast-1 +``` + +### HTTP Client Configurations +AWS clients support two types of HTTP Client, [URL Connection HTTP Client](https://mvnrepository.com/artifact/software.amazon.awssdk/url-connection-client) +and [Apache HTTP Client](https://mvnrepository.com/artifact/software.amazon.awssdk/apache-client). +By default, AWS clients use **Apache** HTTP Client to communicate with the service. +This HTTP client supports various functionalities and customized settings, such as expect-continue handshake and TCP KeepAlive, at the cost of extra dependency and additional startup latency. +In contrast, URL Connection HTTP Client optimizes for minimum dependencies and startup latency but supports less functionality than other implementations. + +For more details of configuration, see sections [URL Connection HTTP Client Configurations](#url-connection-http-client-configurations) and [Apache HTTP Client Configurations](#apache-http-client-configurations). + +Configure the following property to set the type of HTTP client: + +| Property | Default | Description | +|------------------|---------|------------------------------------------------------------------------------------------------------------| +| http-client.type | apache | Types of HTTP Client.
`urlconnection`: URL Connection HTTP Client
`apache`: Apache HTTP Client | + +#### URL Connection HTTP Client Configurations + +URL Connection HTTP Client has the following configurable properties: + +| Property | Default | Description | +|-------------------------------------------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| http-client.urlconnection.socket-timeout-ms | null | An optional [socket timeout](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/http/urlconnection/UrlConnectionHttpClient.Builder.html#socketTimeout(java.time.Duration)) in milliseconds | +| http-client.urlconnection.connection-timeout-ms | null | An optional [connection timeout](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/http/urlconnection/UrlConnectionHttpClient.Builder.html#connectionTimeout(java.time.Duration)) in milliseconds | + +Users can use catalog properties to override the defaults. For example, to configure the socket timeout for URL Connection HTTP Client when starting a spark shell, one can add: +```shell +--conf spark.sql.catalog.my_catalog.http-client.urlconnection.socket-timeout-ms=80 +``` + +#### Apache HTTP Client Configurations + +Apache HTTP Client has the following configurable properties: + +| Property | Default | Description | +|-------------------------------------------------------|---------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| http-client.apache.socket-timeout-ms | null | An optional [socket timeout](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/http/apache/ApacheHttpClient.Builder.html#socketTimeout(java.time.Duration)) in milliseconds | +| http-client.apache.connection-timeout-ms | null | An optional [connection timeout](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/http/apache/ApacheHttpClient.Builder.html#connectionTimeout(java.time.Duration)) in milliseconds | +| http-client.apache.connection-acquisition-timeout-ms | null | An optional [connection acquisition timeout](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/http/apache/ApacheHttpClient.Builder.html#connectionAcquisitionTimeout(java.time.Duration)) in milliseconds | +| http-client.apache.connection-max-idle-time-ms | null | An optional [connection max idle timeout](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/http/apache/ApacheHttpClient.Builder.html#connectionMaxIdleTime(java.time.Duration)) in milliseconds | +| http-client.apache.connection-time-to-live-ms | null | An optional [connection time to live](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/http/apache/ApacheHttpClient.Builder.html#connectionTimeToLive(java.time.Duration)) in milliseconds | +| http-client.apache.expect-continue-enabled | null, disabled by default | An optional `true/false` setting that controls whether [expect continue](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/http/apache/ApacheHttpClient.Builder.html#expectContinueEnabled(java.lang.Boolean)) is enabled | +| http-client.apache.max-connections | null | An optional [max connections](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/http/apache/ApacheHttpClient.Builder.html#maxConnections(java.lang.Integer)) in integer | +| http-client.apache.tcp-keep-alive-enabled | null, disabled by default | An optional `true/false` setting that controls whether [tcp keep alive](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/http/apache/ApacheHttpClient.Builder.html#tcpKeepAlive(java.lang.Boolean)) is enabled | +| http-client.apache.use-idle-connection-reaper-enabled | null, enabled by default | An optional `true/false` setting that controls whether [use idle connection reaper](https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/http/apache/ApacheHttpClient.Builder.html#useIdleConnectionReaper(java.lang.Boolean)) is used | + +Users can use catalog properties to override the defaults. For example, to configure the max connections for Apache HTTP Client when starting a spark shell, one can add: +```shell +--conf spark.sql.catalog.my_catalog.http-client.apache.max-connections=5 +``` + +## Run Iceberg on AWS + +### Amazon Athena + +[Amazon Athena](https://aws.amazon.com/athena/) provides a serverless query engine that could be used to perform read, write, update and optimization tasks against Iceberg tables. +More details could be found [here](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html). + +### Amazon EMR + +[Amazon EMR](https://aws.amazon.com/emr/) can provision clusters with [Spark](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html) (EMR 6 for Spark 3, EMR 5 for Spark 2), +[Hive](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html), [Flink](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-flink.html), +[Trino](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto.html) that can run Iceberg. + +Starting with EMR version 6.5.0, EMR clusters can be configured to have the necessary Apache Iceberg dependencies installed without requiring bootstrap actions. +Please refer to the [official documentation](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg-use-cluster.html) on how to create a cluster with Iceberg installed. + +For versions before 6.5.0, you can use a [bootstrap action](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html) similar to the following to pre-install all necessary dependencies: +```sh +#!/bin/bash + +ICEBERG_VERSION={{% icebergVersion %}} +MAVEN_URL=https://repo1.maven.org/maven2 +ICEBERG_MAVEN_URL=$MAVEN_URL/org/apache/iceberg +# NOTE: this is just an example shared class path between Spark and Flink, +# please choose a proper class path for production. +LIB_PATH=/usr/share/aws/aws-java-sdk/ + +ICEBERG_PACKAGES=( + "iceberg-spark-runtime-3.3_2.12" + "iceberg-flink-runtime" + "iceberg-aws-bundle" +) + +install_dependencies () { + install_path=$1 + download_url=$2 + version=$3 + shift + pkgs=("$@") + for pkg in "${pkgs[@]}"; do + sudo wget -P $install_path $download_url/$pkg/$version/$pkg-$version.jar + done +} + +install_dependencies $LIB_PATH $ICEBERG_MAVEN_URL $ICEBERG_VERSION "${ICEBERG_PACKAGES[@]}" +``` + +### AWS Glue + +[AWS Glue](https://aws.amazon.com/glue/) provides a serverless data integration service +that could be used to perform read, write and update tasks against Iceberg tables. +More details could be found [here](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-iceberg.html). + + +### AWS EKS + +[AWS Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/) can be used to start any Spark, Flink, Hive, Presto or Trino clusters to work with Iceberg. +Search the [Iceberg blogs](../../../blogs) page for tutorials around running Iceberg with Docker and Kubernetes. + +### Amazon Kinesis + +[Amazon Kinesis Data Analytics](https://aws.amazon.com/about-aws/whats-new/2019/11/you-can-now-run-fully-managed-apache-flink-applications-with-apache-kafka/) provides a platform +to run fully managed Apache Flink applications. You can include Iceberg in your application Jar and run it in the platform. diff --git a/docs/branching-and-tagging.md b/docs/branching-and-tagging.md new file mode 100644 index 000000000000..957675c81c03 --- /dev/null +++ b/docs/branching-and-tagging.md @@ -0,0 +1,123 @@ +--- +title: "Branching and Tagging" +url: branching +aliases: + - "tables/branching" +menu: + main: + parent: Tables + identifier: tables_branching + weight: 0 +--- + + + +# Branching and Tagging + +## Overview + +Iceberg table metadata maintains a snapshot log, which represents the changes applied to a table. +Snapshots are fundamental in Iceberg as they are the basis for reader isolation and time travel queries. +For controlling metadata size and storage costs, Iceberg provides snapshot lifecycle management procedures such as [`expire_snapshots`](../spark-procedures/#expire-snapshots) for removing unused snapshots and no longer necessary data files based on table snapshot retention properties. + +**For more sophisticated snapshot lifecycle management, Iceberg supports branches and tags which are named references to snapshots with their own independent lifecycles. This lifecycle is controlled by branch and tag level retention policies.** +Branches are independent lineages of snapshots and point to the head of the lineage. +Branches and tags have a maximum reference age property which control when the reference to the snapshot itself should be expired. +Branches have retention properties which define the minimum number of snapshots to retain on a branch as well as the maximum age of individual snapshots to retain on the branch. +These properties are used when the expireSnapshots procedure is run. +For details on the algorithm for expireSnapshots, refer to the [spec](../../../spec#snapshot-retention-policy). + +## Use Cases + +Branching and tagging can be used for handling GDPR requirements and retaining important historical snapshots for auditing. +Branches can also be used as part of data engineering workflows, for enabling experimental branches for testing and validating new jobs. +See below for some examples of how branching and tagging can facilitate these use cases. + +### Historical Tags + +Tags can be used for retaining important historical snapshots for auditing purposes. + +![Historical Tags](../img/historical-snapshot-tag.png) + +The above diagram demonstrates retaining important historical snapshot with the following retention policy, defined +via Spark SQL. + +1. Retain 1 snapshot per week for 1 month. This can be achieved by tagging the weekly snapshot and setting the tag retention to be a month. +snapshots will be kept, and the branch reference itself will be retained for 1 week. +```sql +-- Create a tag for the first end of week snapshot. Retain the snapshot for a week +ALTER TABLE prod.db.table CREATE TAG `EOW-01` AS OF VERSION 7 RETAIN 7 DAYS; +``` + +2. Retain 1 snapshot per month for 6 months. This can be achieved by tagging the monthly snapshot and setting the tag retention to be 6 months. +```sql +-- Create a tag for the first end of month snapshot. Retain the snapshot for 6 months +ALTER TABLE prod.db.table CREATE TAG `EOM-01` AS OF VERSION 30 RETAIN 180 DAYS; +``` + +3. Retain 1 snapshot per year forever. This can be achieved by tagging the annual snapshot. The default retention for branches and tags is forever. +```sql +-- Create a tag for the end of the year and retain it forever. +ALTER TABLE prod.db.table CREATE TAG `EOY-2023` AS OF VERSION 365; +``` + +4. Create a temporary "test-branch" which is retained for 7 days and the latest 2 snapshots on the branch are retained. +```sql +-- Create a branch "test-branch" which will be retained for 7 days along with the latest 2 snapshots +ALTER TABLE prod.db.table CREATE BRANCH `test-branch` RETAIN 7 DAYS WITH SNAPSHOT RETENTION 2 SNAPSHOTS; +``` + +### Audit Branch + +![Audit Branch](../img/audit-branch.png) + +The above diagram shows an example of using an audit branch for validating a write workflow. + +1. First ensure `write.wap.enabled` is set. +```sql +ALTER TABLE db.table SET TBLPROPERTIES ( + 'write.wap.enabled'='true' +); +``` +2. Create `audit-branch` starting from snapshot 3, which will be written to and retained for 1 week. +```sql +ALTER TABLE db.table CREATE BRANCH `audit-branch` AS OF VERSION 3 RETAIN 7 DAYS; +``` +3. Writes are performed on a separate `audit-branch` independent from the main table history. +```sql +-- WAP Branch write +SET spark.wap.branch = audit-branch +INSERT INTO prod.db.table VALUES (3, 'c'); +``` +4. A validation workflow can validate (e.g. data quality) the state of `audit-branch`. +5. After validation, the main branch can be `fastForward` to the head of `audit-branch` to update the main table state. +```sql +CALL catalog_name.system.fast_forward('prod.db.table', 'main', 'audit-branch'); +``` +6. The branch reference will be removed when `expireSnapshots` is run 1 week later. + +## Usage + +Creating, querying and writing to branches and tags are supported in the Iceberg Java library, and in Spark and Flink engine integrations. + +- [Iceberg Java Library](../java-api-quickstart/#branching-and-tagging) +- [Spark DDLs](../spark-ddl/#branching-and-tagging-ddl) +- [Spark Reads](../spark-queries/#time-travel) +- [Spark Branch Writes](../spark-writes/#writing-to-branches) +- [Flink Reads](../flink-queries/#reading-branches-and-tags-with-SQL) +- [Flink Branch Writes](../flink-writes/#branch-writes) \ No newline at end of file diff --git a/docs/configuration.md b/docs/configuration.md new file mode 100644 index 000000000000..7931394f3c0b --- /dev/null +++ b/docs/configuration.md @@ -0,0 +1,200 @@ +--- +title: "Configuration" +url: configuration +aliases: + - "tables/configuration" +menu: + main: + parent: Tables + identifier: tables_configuration + weight: 0 +--- + + +# Configuration + +## Table properties + +Iceberg tables support table properties to configure table behavior, like the default split size for readers. + +### Read properties + +| Property | Default | Description | +| --------------------------------- | ------------------ | ------------------------------------------------------ | +| read.split.target-size | 134217728 (128 MB) | Target size when combining data input splits | +| read.split.metadata-target-size | 33554432 (32 MB) | Target size when combining metadata input splits | +| read.split.planning-lookback | 10 | Number of bins to consider when combining input splits | +| read.split.open-file-cost | 4194304 (4 MB) | The estimated cost to open a file, used as a minimum weight when combining splits. | +| read.parquet.vectorization.enabled| true | Controls whether Parquet vectorized reads are used | +| read.parquet.vectorization.batch-size| 5000 | The batch size for parquet vectorized reads | +| read.orc.vectorization.enabled | false | Controls whether orc vectorized reads are used | +| read.orc.vectorization.batch-size | 5000 | The batch size for orc vectorized reads | + +### Write properties + +| Property | Default | Description | +|------------------------------------------------------|-----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| write.format.default | parquet | Default file format for the table; parquet, avro, or orc | +| write.delete.format.default | data file format | Default delete file format for the table; parquet, avro, or orc | +| write.parquet.row-group-size-bytes | 134217728 (128 MB) | Parquet row group size | +| write.parquet.page-size-bytes | 1048576 (1 MB) | Parquet page size | +| write.parquet.page-row-limit | 20000 | Parquet page row limit | +| write.parquet.dict-size-bytes | 2097152 (2 MB) | Parquet dictionary page size | +| write.parquet.compression-codec | zstd | Parquet compression codec: zstd, brotli, lz4, gzip, snappy, uncompressed | +| write.parquet.compression-level | null | Parquet compression level | +| write.parquet.bloom-filter-enabled.column.col1 | (not set) | Hint to parquet to write a bloom filter for the column: col1 | +| write.parquet.bloom-filter-max-bytes | 1048576 (1 MB) | The maximum number of bytes for a bloom filter bitset | +| write.avro.compression-codec | gzip | Avro compression codec: gzip(deflate with 9 level), zstd, snappy, uncompressed | +| write.avro.compression-level | null | Avro compression level | +| write.orc.stripe-size-bytes | 67108864 (64 MB) | Define the default ORC stripe size, in bytes | +| write.orc.block-size-bytes | 268435456 (256 MB) | Define the default file system block size for ORC files | +| write.orc.compression-codec | zlib | ORC compression codec: zstd, lz4, lzo, zlib, snappy, none | +| write.orc.compression-strategy | speed | ORC compression strategy: speed, compression | +| write.orc.bloom.filter.columns | (not set) | Comma separated list of column names for which a Bloom filter must be created | +| write.orc.bloom.filter.fpp | 0.05 | False positive probability for Bloom filter (must > 0.0 and < 1.0) | +| write.location-provider.impl | null | Optional custom implementation for LocationProvider | +| write.metadata.compression-codec | none | Metadata compression codec; none or gzip | +| write.metadata.metrics.max-inferred-column-defaults | 100 | Defines the maximum number of columns for which metrics are collected | +| write.metadata.metrics.default | truncate(16) | Default metrics mode for all columns in the table; none, counts, truncate(length), or full | +| write.metadata.metrics.column.col1 | (not set) | Metrics mode for column 'col1' to allow per-column tuning; none, counts, truncate(length), or full | +| write.target-file-size-bytes | 536870912 (512 MB) | Controls the size of files generated to target about this many bytes | +| write.delete.target-file-size-bytes | 67108864 (64 MB) | Controls the size of delete files generated to target about this many bytes | +| write.distribution-mode | none | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder | +| write.delete.distribution-mode | hash | Defines distribution of write delete data | +| write.update.distribution-mode | hash | Defines distribution of write update data | +| write.merge.distribution-mode | none | Defines distribution of write merge data | +| write.wap.enabled | false | Enables write-audit-publish writes | +| write.summary.partition-limit | 0 | Includes partition-level summary stats in snapshot summaries if the changed partition count is less than this limit | +| write.metadata.delete-after-commit.enabled | false | Controls whether to delete the oldest **tracked** version metadata files after commit | +| write.metadata.previous-versions-max | 100 | The max number of previous version metadata files to keep before deleting after commit | +| write.spark.fanout.enabled | false | Enables the fanout writer in Spark that does not require data to be clustered; uses more memory | +| write.object-storage.enabled | false | Enables the object storage location provider that adds a hash component to file paths | +| write.data.path | table location + /data | Base location for data files | +| write.metadata.path | table location + /metadata | Base location for metadata files | +| write.delete.mode | copy-on-write | Mode used for delete commands: copy-on-write or merge-on-read (v2 only) | +| write.delete.isolation-level | serializable | Isolation level for delete commands: serializable or snapshot | +| write.update.mode | copy-on-write | Mode used for update commands: copy-on-write or merge-on-read (v2 only) | +| write.update.isolation-level | serializable | Isolation level for update commands: serializable or snapshot | +| write.merge.mode | copy-on-write | Mode used for merge commands: copy-on-write or merge-on-read (v2 only) | +| write.merge.isolation-level | serializable | Isolation level for merge commands: serializable or snapshot | + +### Table behavior properties + +| Property | Default | Description | +| ---------------------------------- | ---------------- | ------------------------------------------------------------- | +| commit.retry.num-retries | 4 | Number of times to retry a commit before failing | +| commit.retry.min-wait-ms | 100 | Minimum time in milliseconds to wait before retrying a commit | +| commit.retry.max-wait-ms | 60000 (1 min) | Maximum time in milliseconds to wait before retrying a commit | +| commit.retry.total-timeout-ms | 1800000 (30 min) | Total retry timeout period in milliseconds for a commit | +| commit.status-check.num-retries | 3 | Number of times to check whether a commit succeeded after a connection is lost before failing due to an unknown commit state | +| commit.status-check.min-wait-ms | 1000 (1s) | Minimum time in milliseconds to wait before retrying a status-check | +| commit.status-check.max-wait-ms | 60000 (1 min) | Maximum time in milliseconds to wait before retrying a status-check | +| commit.status-check.total-timeout-ms| 1800000 (30 min) | Total timeout period in which the commit status-check must succeed, in milliseconds | +| commit.manifest.target-size-bytes | 8388608 (8 MB) | Target size when merging manifest files | +| commit.manifest.min-count-to-merge | 100 | Minimum number of manifests to accumulate before merging | +| commit.manifest-merge.enabled | true | Controls whether to automatically merge manifests on writes | +| history.expire.max-snapshot-age-ms | 432000000 (5 days) | Default max age of snapshots to keep on the table and all of its branches while expiring snapshots | +| history.expire.min-snapshots-to-keep | 1 | Default min number of snapshots to keep on the table and all of its branches while expiring snapshots | +| history.expire.max-ref-age-ms | `Long.MAX_VALUE` (forever) | For snapshot references except the `main` branch, default max age of snapshot references to keep while expiring snapshots. The `main` branch never expires. | + +### Reserved table properties +Reserved table properties are only used to control behaviors when creating or updating a table. +The value of these properties are not persisted as a part of the table metadata. + +| Property | Default | Description | +| -------------- | -------- | ------------------------------------------------------------- | +| format-version | 2 | Table's format version (can be 1 or 2) as defined in the [Spec](../../../spec/#format-versioning). Defaults to 2 since version 1.4.0. | + +### Compatibility flags + +| Property | Default | Description | +| --------------------------------------------- | -------- | ------------------------------------------------------------- | +| compatibility.snapshot-id-inheritance.enabled | false | Enables committing snapshots without explicit snapshot IDs (always true if the format version is > 1) | + +## Catalog properties + +Iceberg catalogs support using catalog properties to configure catalog behaviors. Here is a list of commonly used catalog properties: + +| Property | Default | Description | +| --------------------------------- | ------------------ | ------------------------------------------------------ | +| catalog-impl | null | a custom `Catalog` implementation to use by an engine | +| io-impl | null | a custom `FileIO` implementation to use in a catalog | +| warehouse | null | the root path of the data warehouse | +| uri | null | a URI string, such as Hive metastore URI | +| clients | 2 | client pool size | +| cache-enabled | true | Whether to cache catalog entries | +| cache.expiration-interval-ms | 30000 | How long catalog entries are locally cached, in milliseconds; 0 disables caching, negative values disable expiration | +| metrics-reporter-impl | org.apache.iceberg.metrics.LoggingMetricsReporter | Custom `MetricsReporter` implementation to use in a catalog. See the [Metrics reporting](metrics-reporting) section for additional details | + +`HadoopCatalog` and `HiveCatalog` can access the properties in their constructors. +Any other custom catalog can access the properties by implementing `Catalog.initialize(catalogName, catalogProperties)`. +The properties can be manually constructed or passed in from a compute engine like Spark or Flink. +Spark uses its session properties as catalog properties, see more details in the [Spark configuration](../spark-configuration#catalog-configuration) section. +Flink passes in catalog properties through `CREATE CATALOG` statement, see more details in the [Flink](../flink/#creating-catalogs-and-using-catalogs) section. + +### Lock catalog properties + +Here are the catalog properties related to locking. They are used by some catalog implementations to control the locking behavior during commits. + +| Property | Default | Description | +| --------------------------------- | ------------------ | ------------------------------------------------------ | +| lock-impl | null | a custom implementation of the lock manager, the actual interface depends on the catalog used | +| lock.table | null | an auxiliary table for locking, such as in [AWS DynamoDB lock manager](../aws/#dynamodb-for-commit-locking) | +| lock.acquire-interval-ms | 5000 (5 s) | the interval to wait between each attempt to acquire a lock | +| lock.acquire-timeout-ms | 180000 (3 min) | the maximum time to try acquiring a lock | +| lock.heartbeat-interval-ms | 3000 (3 s) | the interval to wait between each heartbeat after acquiring a lock | +| lock.heartbeat-timeout-ms | 15000 (15 s) | the maximum time without a heartbeat to consider a lock expired | + + +## Hadoop configuration + +The following properties from the Hadoop configuration are used by the Hive Metastore connector. +The HMS table locking is a 2-step process: +1. Lock Creation: Create lock in HMS and queue for acquisition +2. Lock Check: Check if lock successfully acquired + +| Property | Default | Description | +|-------------------------------------------|-----------------|------------------------------------------------------------------------------| +| iceberg.hive.client-pool-size | 5 | The size of the Hive client pool when tracking tables in HMS | +| iceberg.hive.lock-creation-timeout-ms | 180000 (3 min) | Maximum time in milliseconds to create a lock in the HMS | +| iceberg.hive.lock-creation-min-wait-ms | 50 | Minimum time in milliseconds between retries of creating the lock in the HMS | +| iceberg.hive.lock-creation-max-wait-ms | 5000 | Maximum time in milliseconds between retries of creating the lock in the HMS | +| iceberg.hive.lock-timeout-ms | 180000 (3 min) | Maximum time in milliseconds to acquire a lock | +| iceberg.hive.lock-check-min-wait-ms | 50 | Minimum time in milliseconds between checking the acquisition of the lock | +| iceberg.hive.lock-check-max-wait-ms | 5000 | Maximum time in milliseconds between checking the acquisition of the lock | +| iceberg.hive.lock-heartbeat-interval-ms | 240000 (4 min) | The heartbeat interval for the HMS locks. | +| iceberg.hive.metadata-refresh-max-retries | 2 | Maximum number of retries when the metadata file is missing | +| iceberg.hive.table-level-lock-evict-ms | 600000 (10 min) | The timeout for the JVM table lock is | +| iceberg.engine.hive.lock-enabled | true | Use HMS locks to ensure atomicity of commits | + +Note: `iceberg.hive.lock-check-max-wait-ms` and `iceberg.hive.lock-heartbeat-interval-ms` should be less than the [transaction timeout](https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.txn.timeout) +of the Hive Metastore (`hive.txn.timeout` or `metastore.txn.timeout` in the newer versions). Otherwise, the heartbeats on the lock (which happens during the lock checks) would end up expiring in the +Hive Metastore before the lock is retried from Iceberg. + +Warn: Setting `iceberg.engine.hive.lock-enabled`=`false` will cause HiveCatalog to commit to tables without using Hive locks. +This should only be set to `false` if all following conditions are met: + - [HIVE-26882](https://issues.apache.org/jira/browse/HIVE-26882) +is available on the Hive Metastore server + - All other HiveCatalogs committing to tables that this HiveCatalog commits to are also on Iceberg 1.3 or later + - All other HiveCatalogs committing to tables that this HiveCatalog commits to have also disabled Hive locks on commit. + +**Failing to ensure these conditions risks corrupting the table.** + +Even with `iceberg.engine.hive.lock-enabled` set to `false`, a HiveCatalog can still use locks for individual tables by setting the table property `engine.hive.lock-enabled`=`true`. +This is useful in the case where other HiveCatalogs cannot be upgraded and set to commit without using Hive locks. + diff --git a/docs/dell.md b/docs/dell.md new file mode 100644 index 000000000000..2560332e5bed --- /dev/null +++ b/docs/dell.md @@ -0,0 +1,136 @@ +--- +title: "Dell" +url: dell +menu: + main: + parent: Integrations + identifier: dell_integration + weight: 0 +--- + + + +# Iceberg Dell Integration + +## Dell ECS Integration + +Iceberg can be used with Dell's Enterprise Object Storage (ECS) by using the ECS catalog since 0.15.0. + +See [Dell ECS](https://www.dell.com/en-us/dt/storage/ecs/index.htm) for more information on Dell ECS. + +### Parameters + +When using Dell ECS with Iceberg, these configuration parameters are required: + +| Name | Description | +| ------------------------ | --------------------------------- | +| ecs.s3.endpoint | ECS S3 service endpoint | +| ecs.s3.access-key-id | ECS Username | +| ecs.s3.secret-access-key | S3 Secret Key | +| warehouse | The location of data and metadata | + +The warehouse should use the following formats: + +| Example | Description | +| -------------------------- | --------------------------------------------------------------- | +| ecs://bucket-a | Use the whole bucket as the data | +| ecs://bucket-a/ | Use the whole bucket as the data. The last `/` is ignored. | +| ecs://bucket-a/namespace-a | Use a prefix to access the data only in this specific namespace | + +The Iceberg `runtime` jar supports different versions of Spark and Flink. You should pick the correct version. + +Even though the [Dell ECS client](https://github.com/EMCECS/ecs-object-client-java) jar is backward compatible, Dell EMC still recommends using the latest version of the client. + +### Spark + +To use the Dell ECS catalog with Spark 3.5.0, you should create a Spark session like: + +```bash +ICEBERG_VERSION=1.4.2 +SPARK_VERSION=3.5_2.12 +ECS_CLIENT_VERSION=3.3.2 + +DEPENDENCIES="org.apache.iceberg:iceberg-spark-runtime-${SPARK_VERSION}:${ICEBERG_VERSION},\ +org.apache.iceberg:iceberg-dell:${ICEBERG_VERSION},\ +com.emc.ecs:object-client-bundle:${ECS_CLIENT_VERSION}" + +spark-sql --packages ${DEPENDENCIES} \ + --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ + --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.my_catalog.warehouse=ecs://bucket-a/namespace-a \ + --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.dell.ecs.EcsCatalog \ + --conf spark.sql.catalog.my_catalog.ecs.s3.endpoint=http://10.x.x.x:9020 \ + --conf spark.sql.catalog.my_catalog.ecs.s3.access-key-id= \ + --conf spark.sql.catalog.my_catalog.ecs.s3.secret-access-key= +``` + +Then, use `my_catalog` to access the data in ECS. You can use `SHOW NAMESPACES IN my_catalog` and `SHOW TABLES IN my_catalog` to fetch the namespaces and tables of the catalog. + +The related problems of catalog usage: + +1. The `SparkSession.catalog` won't access the 3rd-party catalog of Spark in both Python and Scala, so please use DDL SQL to list all tables and namespaces. + + +### Flink + +Use the Dell ECS catalog with Flink, you first must create a Flink environment. + +```bash +# HADOOP_HOME is your hadoop root directory after unpack the binary package. +export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath` + +# download Iceberg dependency +MAVEN_URL=https://repo1.maven.org/maven2 +ICEBERG_VERSION=0.15.0 +FLINK_VERSION=1.14 +wget ${MAVEN_URL}/org/apache/iceberg/iceberg-flink-runtime-${FLINK_VERSION}/${ICEBERG_VERSION}/iceberg-flink-runtime-${FLINK_VERSION}-${ICEBERG_VERSION}.jar +wget ${MAVEN_URL}/org/apache/iceberg/iceberg-dell/${ICEBERG_VERSION}/iceberg-dell-${ICEBERG_VERSION}.jar + +# download ECS object client +ECS_CLIENT_VERSION=3.3.2 +wget ${MAVEN_URL}/com/emc/ecs/object-client-bundle/${ECS_CLIENT_VERSION}/object-client-bundle-${ECS_CLIENT_VERSION}.jar + +# open the SQL client. +/path/to/bin/sql-client.sh embedded \ + -j iceberg-flink-runtime-${FLINK_VERSION}-${ICEBERG_VERSION}.jar \ + -j iceberg-dell-${ICEBERG_VERSION}.jar \ + -j object-client-bundle-${ECS_CLIENT_VERSION}.jar \ + shell +``` + +Then, use Flink SQL to create a catalog named `my_catalog`: + +```SQL +CREATE CATALOG my_catalog WITH ( + 'type'='iceberg', + 'warehouse' = 'ecs://bucket-a/namespace-a', + 'catalog-impl'='org.apache.iceberg.dell.ecs.EcsCatalog', + 'ecs.s3.endpoint' = 'http://10.x.x.x:9020', + 'ecs.s3.access-key-id' = '', + 'ecs.s3.secret-access-key' = ''); +``` + +Then, you can run `USE CATALOG my_catalog`, `SHOW DATABASES`, and `SHOW TABLES` to fetch the namespaces and tables of the catalog. + +### Limitations + +When you use the catalog with Dell ECS only, you should care about these limitations: + +1. `RENAME` statements are supported without other protections. When you try to rename a table, you need to guarantee all commits are finished in the original table. +2. `RENAME` statements only rename the table without moving any data files. This can lead to a table's data being stored in a path outside of the configured warehouse path. +3. The CAS operations used by table commits are based on the checksum of the object. There is a very small probability of a checksum conflict. diff --git a/docs/delta-lake-migration.md b/docs/delta-lake-migration.md new file mode 100644 index 000000000000..b0e7b221f960 --- /dev/null +++ b/docs/delta-lake-migration.md @@ -0,0 +1,123 @@ +--- +title: "Delta Lake Migration" +url: delta-lake-migration +menu: + main: + parent: "Migration" + identifier: delta_lake_migration + weight: 300 +--- + + +# Delta Lake Table Migration +Delta Lake is a table format that supports Parquet file format and provides time travel and versioning features. When migrating data from Delta Lake to Iceberg, +it is common to migrate all snapshots to maintain the history of the data. + +Currently, Iceberg supports the Snapshot Table action for migrating from Delta Lake to Iceberg tables. +Since Delta Lake tables maintain transactions, all available transactions will be committed to the new Iceberg table as transactions in order. +For Delta Lake tables, any additional data files added after the initial migration will be included in their corresponding transactions and subsequently added to the new Iceberg table using the Add Transaction action. +The Add Transaction action, a variant of the Add File action, is still under development. + +## Enabling Migration from Delta Lake to Iceberg +The `iceberg-delta-lake` module is not bundled with Spark and Flink engine runtimes. To enable migration from delta lake features, the minimum required dependencies are: +- [iceberg-delta-lake](https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-delta-lake/1.2.1/iceberg-delta-lake-1.2.1.jar) +- [delta-standalone-0.6.0](https://repo1.maven.org/maven2/io/delta/delta-standalone_2.13/0.6.0/delta-standalone_2.13-0.6.0.jar) +- [delta-storage-2.2.0](https://repo1.maven.org/maven2/io/delta/delta-storage/2.2.0/delta-storage-2.2.0.jar) + +### Compatibilities +The module is built and tested with `Delta Standalone:0.6.0` and supports Delta Lake tables with the following protocol version: +* `minReaderVersion`: 1 +* `minWriterVersion`: 2 + +Please refer to [Delta Lake Table Protocol Versioning](https://docs.delta.io/latest/versioning.html) for more details about Delta Lake protocol versions. + +### API +The `iceberg-delta-lake` module provides an interface named `DeltaLakeToIcebergMigrationActionsProvider`, which contains actions that helps converting from Delta Lake to Iceberg. +The supported actions are: +* `snapshotDeltaLakeTable`: snapshot an existing Delta Lake table to an Iceberg table + +### Default Implementation +The `iceberg-delta-lake` module also provides a default implementation of the interface which can be accessed by +```java +DeltaLakeToIcebergMigrationActionsProvider defaultActions = DeltaLakeToIcebergMigrationActionsProvider.defaultActions() +``` + +## Snapshot Delta Lake Table to Iceberg +The action `snapshotDeltaLakeTable` reads the Delta Lake table's transactions and converts them to a new Iceberg table with the same schema and partitioning in one iceberg transaction. +The original Delta Lake table remains unchanged. + +The newly created table can be changed or written to without affecting the source table, but the snapshot uses the original table's data files. +Existing data files are added to the Iceberg table's metadata and can be read using a name-to-id mapping created from the original table schema. + +When inserts or overwrites run on the snapshot, new files are placed in the snapshot table's location. The location is default to be the same as that +of the source Delta Lake Table. Users can also specify a different location for the snapshot table. + +{{< hint info >}} +Because tables created by `snapshotDeltaLakeTable` are not the sole owners of their data files, they are prohibited from +actions like `expire_snapshots` which would physically delete data files. Iceberg deletes, which only effect metadata, +are still allowed. In addition, any operations which affect the original data files will disrupt the Snapshot's +integrity. DELETE statements executed against the original Delta Lake table will remove original data files and the +`snapshotDeltaLakeTable` table will no longer be able to access them. +{{< /hint >}} + +#### Usage +| Required Input | Configured By | Description | +|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------| +| Source Table Location | Argument [`sourceTableLocation`](https://iceberg.apache.org/javadoc/latest/org/apache/iceberg/delta/DeltaLakeToIcebergMigrationActionsProvider.html#snapshotDeltaLakeTable(java.lang.String)) | The location of the source Delta Lake table | +| New Iceberg Table Identifier | Configuration API [`as`](https://iceberg.apache.org/javadoc/latest/org/apache/iceberg/delta/SnapshotDeltaLakeTable.html#as(org.apache.iceberg.catalog.TableIdentifier)) | The identifier specifies the namespace and table name for the new iceberg table | +| Iceberg Catalog | Configuration API [`icebergCatalog`](https://iceberg.apache.org/javadoc/latest/org/apache/iceberg/delta/SnapshotDeltaLakeTable.html#icebergCatalog(org.apache.iceberg.catalog.Catalog)) | The catalog used to create the new iceberg table | +| Hadoop Configuration | Configuration API [`deltaLakeConfiguration`](https://iceberg.apache.org/javadoc/latest/org/apache/iceberg/delta/SnapshotDeltaLakeTable.html#deltaLakeConfiguration(org.apache.hadoop.conf.Configuration)) | The Hadoop Configuration used to read the source Delta Lake table. | + +For detailed usage and other optional configurations, please refer to the [SnapshotDeltaLakeTable API](https://iceberg.apache.org/javadoc/latest/org/apache/iceberg/delta/SnapshotDeltaLakeTable.html) + +#### Output +| Output Name | Type | Description | +| ------------|------|-------------| +| `imported_files_count` | long | Number of files added to the new table | + +#### Added Table Properties +The following table properties are added to the Iceberg table to be created by default: + +| Property Name | Value | Description | +|-------------------------------|-------------------------------------------|--------------------------------------------------------------------| +| `snapshot_source` | `delta` | Indicates that the table is snapshot from a delta lake table | +| `original_location` | location of the delta lake table | The absolute path to the location of the original delta lake table | +| `schema.name-mapping.default` | JSON name mapping derived from the schema | The name mapping string used to read Delta Lake table's data files | + +#### Examples +```java +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.catalog.Catalog; +import org.apache.hadoop.conf.Configuration; +import org.apache.iceberg.delta.DeltaLakeToIcebergMigrationActionsProvider; + +String sourceDeltaLakeTableLocation = "s3://my-bucket/delta-table"; +String destTableLocation = "s3://my-bucket/iceberg-table"; +TableIdentifier destTableIdentifier = TableIdentifier.of("my_db", "my_table"); +Catalog icebergCatalog = ...; // Iceberg Catalog fetched from engines like Spark or created via CatalogUtil.loadCatalog +Configuration hadoopConf = ...; // Hadoop Configuration fetched from engines like Spark and have proper file system configuration to access the Delta Lake table. + +DeltaLakeToIcebergMigrationActionsProvider.defaultActions() + .snapshotDeltaLakeTable(sourceDeltaLakeTableLocation) + .as(destTableIdentifier) + .icebergCatalog(icebergCatalog) + .tableLocation(destTableLocation) + .deltaLakeConfiguration(hadoopConf) + .tableProperty("my_property", "my_value") + .execute(); +``` diff --git a/docs/evolution.md b/docs/evolution.md new file mode 100644 index 000000000000..7baf463ef50e --- /dev/null +++ b/docs/evolution.md @@ -0,0 +1,109 @@ +--- +title: Evolution +url: evolution +aliases: + - "tables/evolution" +menu: + main: + parent: Tables + identifier: tables_evolution + weight: 0 +--- + + +# Evolution + +Iceberg supports **in-place table evolution**. You can [evolve a table schema](#schema-evolution) just like SQL -- even in nested structures -- or [change partition layout](#partition-evolution) when data volume changes. Iceberg does not require costly distractions, like rewriting table data or migrating to a new table. + +For example, Hive table partitioning cannot change so moving from a daily partition layout to an hourly partition layout requires a new table. And because queries are dependent on partitions, queries must be rewritten for the new table. In some cases, even changes as simple as renaming a column are either not supported, or can cause [data correctness](#correctness) problems. + +## Schema evolution + +Iceberg supports the following schema evolution changes: + +* **Add** -- add a new column to the table or to a nested struct +* **Drop** -- remove an existing column from the table or a nested struct +* **Rename** -- rename an existing column or field in a nested struct +* **Update** -- widen the type of a column, struct field, map key, map value, or list element +* **Reorder** -- change the order of columns or fields in a nested struct + +Iceberg schema updates are **metadata changes**, so no data files need to be rewritten to perform the update. + +Note that map keys do not support adding or dropping struct fields that would change equality. + +### Correctness + +Iceberg guarantees that **schema evolution changes are independent and free of side-effects**, without rewriting files: + +1. Added columns never read existing values from another column. +2. Dropping a column or field does not change the values in any other column. +3. Updating a column or field does not change values in any other column. +4. Changing the order of columns or fields in a struct does not change the values associated with a column or field name. + +Iceberg uses unique IDs to track each column in a table. When you add a column, it is assigned a new ID so existing data is never used by mistake. + +* Formats that track columns by name can inadvertently un-delete a column if a name is reused, which violates #1. +* Formats that track columns by position cannot delete columns without changing the names that are used for each column, which violates #2. + + +## Partition evolution + +Iceberg table partitioning can be updated in an existing table because queries do not reference partition values directly. + +When you evolve a partition spec, the old data written with an earlier spec remains unchanged. New data is written using the new spec in a new layout. Metadata for each of the partition versions is kept separately. Because of this, when you start writing queries, you get split planning. This is where each partition layout plans files separately using the filter it derives for that specific partition layout. Here's a visual representation of a contrived example: + +![Partition evolution diagram](../img/partition-spec-evolution.png) +*The data for 2008 is partitioned by month. Starting from 2009 the table is updated so that the data is instead partitioned by day. Both partitioning layouts are able to coexist in the same table.* + +Iceberg uses [hidden partitioning](../partitioning), so you don't *need* to write queries for a specific partition layout to be fast. Instead, you can write queries that select the data you need, and Iceberg automatically prunes out files that don't contain matching data. + +Partition evolution is a metadata operation and does not eagerly rewrite files. + +Iceberg's Java table API provides `updateSpec` API to update partition spec. +For example, the following code could be used to update the partition spec to add a new partition field that places `id` column values into 8 buckets and remove an existing partition field `category`: + +```java +Table sampleTable = ...; +sampleTable.updateSpec() + .addField(bucket("id", 8)) + .removeField("category") + .commit(); +``` + +Spark supports updating partition spec through its `ALTER TABLE` SQL statement, see more details in [Spark SQL](../spark-ddl/#alter-table--add-partition-field). + +## Sort order evolution + +Similar to partition spec, Iceberg sort order can also be updated in an existing table. +When you evolve a sort order, the old data written with an earlier order remains unchanged. +Engines can always choose to write data in the latest sort order or unsorted when sorting is prohibitively expensive. + +Iceberg's Java table API provides `replaceSortOrder` API to update sort order. +For example, the following code could be used to create a new sort order +with `id` column sorted in ascending order with nulls last, +and `category` column sorted in descending order with nulls first: + +```java +Table sampleTable = ...; +sampleTable.replaceSortOrder() + .asc("id", NullOrder.NULLS_LAST) + .dec("category", NullOrder.NULL_FIRST) + .commit(); +``` + +Spark supports updating sort order through its `ALTER TABLE` SQL statement, see more details in [Spark SQL](../spark-ddl/#alter-table--write-ordered-by). diff --git a/docs/flink-actions.md b/docs/flink-actions.md new file mode 100644 index 000000000000..32bc3b9df557 --- /dev/null +++ b/docs/flink-actions.md @@ -0,0 +1,43 @@ +--- +title: "Flink Actions" +url: flink-actions +aliases: + - "flink/flink-actions" +menu: + main: + parent: Flink + identifier: flink_actions + weight: 500 +--- + + +## Rewrite files action. + +Iceberg provides API to rewrite small files into large files by submitting Flink batch jobs. The behavior of this Flink action is the same as Spark's [rewriteDataFiles](../maintenance/#compact-data-files). + +```java +import org.apache.iceberg.flink.actions.Actions; + +TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path"); +Table table = tableLoader.loadTable(); +RewriteDataFilesActionResult result = Actions.forTable(table) + .rewriteDataFiles() + .execute(); +``` + +For more details of the rewrite files action, please refer to [RewriteDataFilesAction](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/flink/actions/RewriteDataFilesAction.html) diff --git a/docs/flink-configuration.md b/docs/flink-configuration.md new file mode 100644 index 000000000000..8cb1f799580b --- /dev/null +++ b/docs/flink-configuration.md @@ -0,0 +1,167 @@ +--- +title: "Flink Configuration" +url: flink-configuration +aliases: + - "flink/flink-configuration" +menu: + main: + parent: Flink + identifier: flink_configuration + weight: 600 +--- + + +# Flink Configuration + +## Catalog Configuration + +A catalog is created and named by executing the following query (replace `` with your catalog name and +``=`` with catalog implementation config): + +```sql +CREATE CATALOG WITH ( + 'type'='iceberg', + ``=`` +); +``` + +The following properties can be set globally and are not limited to a specific catalog implementation: + +| Property | Required | Values | Description | +| ---------------------------- |----------| -------------------------- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| type | ✔️ | iceberg | Must be `iceberg`. | +| catalog-type | | `hive`, `hadoop` or `rest` | `hive`, `hadoop` or `rest` for built-in catalogs, or left unset for custom catalog implementations using catalog-impl. | +| catalog-impl | | | The fully-qualified class name of a custom catalog implementation. Must be set if `catalog-type` is unset. | +| property-version | | | Version number to describe the property version. This property can be used for backwards compatibility in case the property format changes. The current property version is `1`. | +| cache-enabled | | `true` or `false` | Whether to enable catalog cache, default value is `true`. | +| cache.expiration-interval-ms | | | How long catalog entries are locally cached, in milliseconds; negative values like `-1` will disable expiration, value 0 is not allowed to set. default value is `-1`. | + +The following properties can be set if using the Hive catalog: + +| Property | Required | Values | Description | +| --------------- |----------| ------ |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| uri | ✔️ | | The Hive metastore's thrift URI. | +| clients | | | The Hive metastore client pool size, default value is 2. | +| warehouse | | | The Hive warehouse location, users should specify this path if neither set the `hive-conf-dir` to specify a location containing a `hive-site.xml` configuration file nor add a correct `hive-site.xml` to classpath. | +| hive-conf-dir | | | Path to a directory containing a `hive-site.xml` configuration file which will be used to provide custom Hive configuration values. The value of `hive.metastore.warehouse.dir` from `/hive-site.xml` (or hive configure file from classpath) will be overwritten with the `warehouse` value if setting both `hive-conf-dir` and `warehouse` when creating iceberg catalog. | +| hadoop-conf-dir | | | Path to a directory containing `core-site.xml` and `hdfs-site.xml` configuration files which will be used to provide custom Hadoop configuration values. | + +The following properties can be set if using the Hadoop catalog: + +| Property | Required | Values | Description | +| --------- |-------------| ------ | ---------------------------------------------------------- | +| warehouse | ✔️ | | The HDFS directory to store metadata files and data files. | + +The following properties can be set if using the REST catalog: + +| Property | Required | Values | Description | +| ---------- |----------| ------ |-----------------------------------------------------------------------------| +| uri | ✔️ | | The URL to the REST Catalog. | +| credential | | | A credential to exchange for a token in the OAuth2 client credentials flow. | +| token | | | A token which will be used to interact with the server. | + +## Runtime configuration + +### Read options + +Flink read options are passed when configuring the Flink IcebergSource: + +``` +IcebergSource.forRowData() + .tableLoader(TableLoader.fromCatalog(...)) + .assignerFactory(new SimpleSplitAssignerFactory()) + .streaming(true) + .streamingStartingStrategy(StreamingStartingStrategy.INCREMENTAL_FROM_SNAPSHOT_ID) + .startSnapshotId(3821550127947089987L) + .monitorInterval(Duration.ofMillis(10L)) // or .set("monitor-interval", "10s") \ set(FlinkReadOptions.MONITOR_INTERVAL, "10s") + .build() +``` + +For Flink SQL, read options can be passed in via SQL hints like this: + +``` +SELECT * FROM tableName /*+ OPTIONS('monitor-interval'='10s') */ +... +``` + +Options can be passed in via Flink configuration, which will be applied to current session. Note that not all options support this mode. + +``` +env.getConfig() + .getConfiguration() + .set(FlinkReadOptions.SPLIT_FILE_OPEN_COST_OPTION, 1000L); +... +``` + +`Read option` has the highest priority, followed by `Flink configuration` and then `Table property`. + +| Read option | Flink configuration | Table property | Default | Description | +|-------------------------------|-------------------------------------------------|------------------------------|----------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| snapshot-id | N/A | N/A | null | For time travel in batch mode. Read data from the specified snapshot-id. | +| case-sensitive | connector.iceberg.case-sensitive | N/A | false | If true, match column name in a case sensitive way. | +| as-of-timestamp | N/A | N/A | null | For time travel in batch mode. Read data from the most recent snapshot as of the given time in milliseconds. | +| starting-strategy | connector.iceberg.starting-strategy | N/A | INCREMENTAL_FROM_LATEST_SNAPSHOT | Starting strategy for streaming execution. TABLE_SCAN_THEN_INCREMENTAL: Do a regular table scan then switch to the incremental mode. The incremental mode starts from the current snapshot exclusive. INCREMENTAL_FROM_LATEST_SNAPSHOT: Start incremental mode from the latest snapshot inclusive. If it is an empty map, all future append snapshots should be discovered. INCREMENTAL_FROM_EARLIEST_SNAPSHOT: Start incremental mode from the earliest snapshot inclusive. If it is an empty map, all future append snapshots should be discovered. INCREMENTAL_FROM_SNAPSHOT_ID: Start incremental mode from a snapshot with a specific id inclusive. INCREMENTAL_FROM_SNAPSHOT_TIMESTAMP: Start incremental mode from a snapshot with a specific timestamp inclusive. If the timestamp is between two snapshots, it should start from the snapshot after the timestamp. Just for FIP27 Source. | +| start-snapshot-timestamp | N/A | N/A | null | Start to read data from the most recent snapshot as of the given time in milliseconds. | +| start-snapshot-id | N/A | N/A | null | Start to read data from the specified snapshot-id. | +| end-snapshot-id | N/A | N/A | The latest snapshot id | Specifies the end snapshot. | +| branch | N/A | N/A | main | Specifies the branch to read from in batch mode | +| tag | N/A | N/A | null | Specifies the tag to read from in batch mode | +| start-tag | N/A | N/A | null | Specifies the starting tag to read from for incremental reads | +| end-tag | N/A | N/A | null | Specifies the ending tag to to read from for incremental reads | +| split-size | connector.iceberg.split-size | read.split.target-size | 128 MB | Target size when combining input splits. | +| split-lookback | connector.iceberg.split-file-open-cost | read.split.planning-lookback | 10 | Number of bins to consider when combining input splits. | +| split-file-open-cost | connector.iceberg.split-file-open-cost | read.split.open-file-cost | 4MB | The estimated cost to open a file, used as a minimum weight when combining splits. | +| streaming | connector.iceberg.streaming | N/A | false | Sets whether the current task runs in streaming or batch mode. | +| monitor-interval | connector.iceberg.monitor-interval | N/A | 60s | Monitor interval to discover splits from new snapshots. Applicable only for streaming read. | +| include-column-stats | connector.iceberg.include-column-stats | N/A | false | Create a new scan from this that loads the column stats with each data file. Column stats include: value count, null value count, lower bounds, and upper bounds. | +| max-planning-snapshot-count | connector.iceberg.max-planning-snapshot-count | N/A | Integer.MAX_VALUE | Max number of snapshots limited per split enumeration. Applicable only to streaming read. | +| limit | connector.iceberg.limit | N/A | -1 | Limited output number of rows. | +| max-allowed-planning-failures | connector.iceberg.max-allowed-planning-failures | N/A | 3 | Max allowed consecutive failures for scan planning before failing the job. Set to -1 for never failing the job for scan planing failure. | +| watermark-column | connector.iceberg.watermark-column | N/A | null | Specifies the watermark column to use for watermark generation. If this option is present, the `splitAssignerFactory` will be overridden with `OrderedSplitAssignerFactory`. | +| watermark-column-time-unit | connector.iceberg.watermark-column-time-unit | N/A | TimeUnit.MICROSECONDS | Specifies the watermark time unit to use for watermark generation. The possible values are DAYS, HOURS, MINUTES, SECONDS, MILLISECONDS, MICROSECONDS, NANOSECONDS. | + +### Write options + +Flink write options are passed when configuring the FlinkSink, like this: + +``` +FlinkSink.Builder builder = FlinkSink.forRow(dataStream, SimpleDataUtil.FLINK_SCHEMA) + .table(table) + .tableLoader(tableLoader) + .set("write-format", "orc") + .set(FlinkWriteOptions.OVERWRITE_MODE, "true"); +``` + +For Flink SQL, write options can be passed in via SQL hints like this: + +``` +INSERT INTO tableName /*+ OPTIONS('upsert-enabled'='true') */ +... +``` + +| Flink option | Default | Description | +| ---------------------- | ------------------------------------------ | ------------------------------------------------------------ | +| write-format | Table write.format.default | File format to use for this write operation; parquet, avro, or orc | +| target-file-size-bytes | As per table property | Overrides this table's write.target-file-size-bytes | +| upsert-enabled | Table write.upsert.enabled | Overrides this table's write.upsert.enabled | +| overwrite-enabled | false | Overwrite the table's data, overwrite mode shouldn't be enable when configuring to use UPSERT data stream. | +| distribution-mode | Table write.distribution-mode | Overrides this table's write.distribution-mode | +| compression-codec | Table write.(fileformat).compression-codec | Overrides this table's compression codec for this write | +| compression-level | Table write.(fileformat).compression-level | Overrides this table's compression level for Parquet and Avro tables for this write | +| compression-strategy | Table write.orc.compression-strategy | Overrides this table's compression strategy for ORC tables for this write | +| write-parallelism | Upstream operator parallelism | Overrides the writer parallelism | diff --git a/docs/flink-connector.md b/docs/flink-connector.md new file mode 100644 index 000000000000..aaf27fff9bab --- /dev/null +++ b/docs/flink-connector.md @@ -0,0 +1,149 @@ +--- +title: "Flink Connector" +url: flink-connector +aliases: + - "flink/flink-connector" +menu: + main: + parent: Flink + identifier: flink_connector + weight: 200 +--- + + +# Flink Connector +Apache Flink supports creating Iceberg table directly without creating the explicit Flink catalog in Flink SQL. That means we can just create an iceberg table by specifying `'connector'='iceberg'` table option in Flink SQL which is similar to usage in the Flink official [document](https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/table/overview/). + +In Flink, the SQL `CREATE TABLE test (..) WITH ('connector'='iceberg', ...)` will create a Flink table in current Flink catalog (use [GenericInMemoryCatalog](https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/table/catalogs/#genericinmemorycatalog) by default), +which is just mapping to the underlying iceberg table instead of maintaining iceberg table directly in current Flink catalog. + +To create the table in Flink SQL by using SQL syntax `CREATE TABLE test (..) WITH ('connector'='iceberg', ...)`, Flink iceberg connector provides the following table properties: + +* `connector`: Use the constant `iceberg`. +* `catalog-name`: User-specified catalog name. It's required because the connector don't have any default value. +* `catalog-type`: `hive` or `hadoop` for built-in catalogs (defaults to `hive`), or left unset for custom catalog implementations using `catalog-impl`. +* `catalog-impl`: The fully-qualified class name of a custom catalog implementation. Must be set if `catalog-type` is unset. See also [custom catalog](../flink/flink-getting-started.md#custom-catalog) for more details. +* `catalog-database`: The iceberg database name in the backend catalog, use the current flink database name by default. +* `catalog-table`: The iceberg table name in the backend catalog. Default to use the table name in the flink `CREATE TABLE` sentence. + +## Table managed in Hive catalog. + +Before executing the following SQL, please make sure you've configured the Flink SQL client correctly according to the quick start [document](../flink). + +The following SQL will create a Flink table in the current Flink catalog, which maps to the iceberg table `default_database.flink_table` managed in iceberg catalog. + +```sql +CREATE TABLE flink_table ( + id BIGINT, + data STRING +) WITH ( + 'connector'='iceberg', + 'catalog-name'='hive_prod', + 'uri'='thrift://localhost:9083', + 'warehouse'='hdfs://nn:8020/path/to/warehouse' +); +``` + +If you want to create a Flink table mapping to a different iceberg table managed in Hive catalog (such as `hive_db.hive_iceberg_table` in Hive), then you can create Flink table as following: + +```sql +CREATE TABLE flink_table ( + id BIGINT, + data STRING +) WITH ( + 'connector'='iceberg', + 'catalog-name'='hive_prod', + 'catalog-database'='hive_db', + 'catalog-table'='hive_iceberg_table', + 'uri'='thrift://localhost:9083', + 'warehouse'='hdfs://nn:8020/path/to/warehouse' +); +``` + +{{< hint info >}} +The underlying catalog database (`hive_db` in the above example) will be created automatically if it does not exist when writing records into the Flink table. +{{< /hint >}} + +## Table managed in hadoop catalog + +The following SQL will create a Flink table in current Flink catalog, which maps to the iceberg table `default_database.flink_table` managed in hadoop catalog. + +```sql +CREATE TABLE flink_table ( + id BIGINT, + data STRING +) WITH ( + 'connector'='iceberg', + 'catalog-name'='hadoop_prod', + 'catalog-type'='hadoop', + 'warehouse'='hdfs://nn:8020/path/to/warehouse' +); +``` + +## Table managed in custom catalog + +The following SQL will create a Flink table in current Flink catalog, which maps to the iceberg table `default_database.flink_table` managed in +a custom catalog of type `com.my.custom.CatalogImpl`. + +```sql +CREATE TABLE flink_table ( + id BIGINT, + data STRING +) WITH ( + 'connector'='iceberg', + 'catalog-name'='custom_prod', + 'catalog-impl'='com.my.custom.CatalogImpl', + -- More table properties for the customized catalog + 'my-additional-catalog-config'='my-value', + ... +); +``` + +Please check sections under the Integrations tab for all custom catalogs. + +## A complete example. + +Take the Hive catalog as an example: + +```sql +CREATE TABLE flink_table ( + id BIGINT, + data STRING +) WITH ( + 'connector'='iceberg', + 'catalog-name'='hive_prod', + 'uri'='thrift://localhost:9083', + 'warehouse'='file:///path/to/warehouse' +); + +INSERT INTO flink_table VALUES (1, 'AAA'), (2, 'BBB'), (3, 'CCC'); + +SET execution.result-mode=tableau; +SELECT * FROM flink_table; + ++----+------+ +| id | data | ++----+------+ +| 1 | AAA | +| 2 | BBB | +| 3 | CCC | ++----+------+ +3 rows in set +``` + +For more details, please refer to the Iceberg [Flink document](../flink). \ No newline at end of file diff --git a/docs/flink-ddl.md b/docs/flink-ddl.md new file mode 100644 index 000000000000..1ab550ec55f3 --- /dev/null +++ b/docs/flink-ddl.md @@ -0,0 +1,229 @@ +--- +title: "Flink DDL" +url: flink-ddl +aliases: + - "flink/flink-ddl" +menu: + main: + parent: Flink + identifier: flink_ddl + weight: 200 +--- + + +## DDL commands + +### `CREATE Catalog` + +#### Hive catalog + +This creates an Iceberg catalog named `hive_catalog` that can be configured using `'catalog-type'='hive'`, which loads tables from Hive metastore: + +```sql +CREATE CATALOG hive_catalog WITH ( + 'type'='iceberg', + 'catalog-type'='hive', + 'uri'='thrift://localhost:9083', + 'clients'='5', + 'property-version'='1', + 'warehouse'='hdfs://nn:8020/warehouse/path' +); +``` + +The following properties can be set if using the Hive catalog: + +* `uri`: The Hive metastore's thrift URI. (Required) +* `clients`: The Hive metastore client pool size, default value is 2. (Optional) +* `warehouse`: The Hive warehouse location, users should specify this path if neither set the `hive-conf-dir` to specify a location containing a `hive-site.xml` configuration file nor add a correct `hive-site.xml` to classpath. +* `hive-conf-dir`: Path to a directory containing a `hive-site.xml` configuration file which will be used to provide custom Hive configuration values. The value of `hive.metastore.warehouse.dir` from `/hive-site.xml` (or hive configure file from classpath) will be overwritten with the `warehouse` value if setting both `hive-conf-dir` and `warehouse` when creating iceberg catalog. +* `hadoop-conf-dir`: Path to a directory containing `core-site.xml` and `hdfs-site.xml` configuration files which will be used to provide custom Hadoop configuration values. + +#### Hadoop catalog + +Iceberg also supports a directory-based catalog in HDFS that can be configured using `'catalog-type'='hadoop'`: + +```sql +CREATE CATALOG hadoop_catalog WITH ( + 'type'='iceberg', + 'catalog-type'='hadoop', + 'warehouse'='hdfs://nn:8020/warehouse/path', + 'property-version'='1' +); +``` + +The following properties can be set if using the Hadoop catalog: + +* `warehouse`: The HDFS directory to store metadata files and data files. (Required) + +Execute the sql command `USE CATALOG hadoop_catalog` to set the current catalog. + +#### REST catalog + +This creates an iceberg catalog named `rest_catalog` that can be configured using `'catalog-type'='rest'`, which loads tables from a REST catalog: + +```sql +CREATE CATALOG rest_catalog WITH ( + 'type'='iceberg', + 'catalog-type'='rest', + 'uri'='https://localhost/' +); +``` + +The following properties can be set if using the REST catalog: + +* `uri`: The URL to the REST Catalog (Required) +* `credential`: A credential to exchange for a token in the OAuth2 client credentials flow (Optional) +* `token`: A token which will be used to interact with the server (Optional) + +#### Custom catalog + +Flink also supports loading a custom Iceberg `Catalog` implementation by specifying the `catalog-impl` property: + +```sql +CREATE CATALOG my_catalog WITH ( + 'type'='iceberg', + 'catalog-impl'='com.my.custom.CatalogImpl', + 'my-additional-catalog-config'='my-value' +); +``` + +#### Create through YAML config + +Catalogs can be registered in `sql-client-defaults.yaml` before starting the SQL client. + +```yaml +catalogs: + - name: my_catalog + type: iceberg + catalog-type: hadoop + warehouse: hdfs://nn:8020/warehouse/path +``` + +#### Create through SQL Files + +The Flink SQL Client supports the `-i` startup option to execute an initialization SQL file to set up environment when starting up the SQL Client. + +```sql +-- define available catalogs +CREATE CATALOG hive_catalog WITH ( + 'type'='iceberg', + 'catalog-type'='hive', + 'uri'='thrift://localhost:9083', + 'warehouse'='hdfs://nn:8020/warehouse/path' +); + +USE CATALOG hive_catalog; +``` + +Using `-i ` option to initialize SQL Client session: + +```bash +/path/to/bin/sql-client.sh -i /path/to/init.sql +``` + +### `CREATE DATABASE` + +By default, Iceberg will use the `default` database in Flink. Using the following example to create a separate database in order to avoid creating tables under the `default` database: + +```sql +CREATE DATABASE iceberg_db; +USE iceberg_db; +``` + +### `CREATE TABLE` + +```sql +CREATE TABLE `hive_catalog`.`default`.`sample` ( + id BIGINT COMMENT 'unique id', + data STRING NOT NULL +) WITH ('format-version'='2'); +``` + +Table create commands support the commonly used [Flink create clauses](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/) including: + +* `PARTITION BY (column1, column2, ...)` to configure partitioning, Flink does not yet support hidden partitioning. +* `COMMENT 'table document'` to set a table description. +* `WITH ('key'='value', ...)` to set [table configuration](../configuration) which will be stored in Iceberg table properties. + +Currently, it does not support computed column and watermark definition etc. + +#### `PRIMARY KEY` + +Primary key constraint can be declared for a column or a set of columns, which must be unique and do not contain null. +It's required for [`UPSERT` mode](../flink-writes/#upsert). + +```sql +CREATE TABLE `hive_catalog`.`default`.`sample` ( + id BIGINT COMMENT 'unique id', + data STRING NOT NULL, + PRIMARY KEY(`id`) NOT ENFORCED +) WITH ('format-version'='2'); +``` + +#### `PARTITIONED BY` + +To create a partition table, use `PARTITIONED BY`: + +```sql +CREATE TABLE `hive_catalog`.`default`.`sample` ( + id BIGINT COMMENT 'unique id', + data STRING NOT NULL +) +PARTITIONED BY (data) +WITH ('format-version'='2'); +``` + +Iceberg supports hidden partitioning but Flink doesn't support partitioning by a function on columns. There is no way to support hidden partitions in the Flink DDL. + +### `CREATE TABLE LIKE` + +To create a table with the same schema, partitioning, and table properties as another table, use `CREATE TABLE LIKE`. + +```sql +CREATE TABLE `hive_catalog`.`default`.`sample` ( + id BIGINT COMMENT 'unique id', + data STRING +); + +CREATE TABLE `hive_catalog`.`default`.`sample_like` LIKE `hive_catalog`.`default`.`sample`; +``` + +For more details, refer to the [Flink `CREATE TABLE` documentation](https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/table/sql/create/). + + +### `ALTER TABLE` + +Iceberg only support altering table properties: + +```sql +ALTER TABLE `hive_catalog`.`default`.`sample` SET ('write.format.default'='avro'); +``` + +### `ALTER TABLE .. RENAME TO` + +```sql +ALTER TABLE `hive_catalog`.`default`.`sample` RENAME TO `hive_catalog`.`default`.`new_sample`; +``` + +### `DROP TABLE` + +To delete a table, run: + +```sql +DROP TABLE `hive_catalog`.`default`.`sample`; +``` diff --git a/docs/flink-getting-started.md b/docs/flink-getting-started.md new file mode 100644 index 000000000000..c5dbf594b55e --- /dev/null +++ b/docs/flink-getting-started.md @@ -0,0 +1,407 @@ +--- +title: "Flink Getting Started" +url: flink +aliases: + - "flink/flink" +menu: + main: + parent: Flink + identifier: flink_getting_started + weight: 100 +--- + + +# Flink + +Apache Iceberg supports both [Apache Flink](https://flink.apache.org/)'s DataStream API and Table API. See the [Multi-Engine Support#apache-flink](https://iceberg.apache.org/multi-engine-support/#apache-flink) page for the integration of Apache Flink. + +| Feature support | Flink | Notes | +| ----------------------------------------------------------- |-------|----------------------------------------------------------------------------------------| +| [SQL create catalog](#creating-catalogs-and-using-catalogs) | ✔️ | | +| [SQL create database](#create-database) | ✔️ | | +| [SQL create table](#create-table) | ✔️ | | +| [SQL create table like](#create-table-like) | ✔️ | | +| [SQL alter table](#alter-table) | ✔️ | Only support altering table properties, column and partition changes are not supported | +| [SQL drop_table](#drop-table) | ✔️ | | +| [SQL select](#querying-with-sql) | ✔️ | Support both streaming and batch mode | +| [SQL insert into](#insert-into) | ✔️ ️ | Support both streaming and batch mode | +| [SQL insert overwrite](#insert-overwrite) | ✔️ ️ | | +| [DataStream read](#reading-with-datastream) | ✔️ ️ | | +| [DataStream append](#appending-data) | ✔️ ️ | | +| [DataStream overwrite](#overwrite-data) | ✔️ ️ | | +| [Metadata tables](#inspecting-tables) | ✔️ | | +| [Rewrite files action](#rewrite-files-action) | ✔️ ️ | | + +## Preparation when using Flink SQL Client + +To create Iceberg table in Flink, it is recommended to use [Flink SQL Client](https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/sqlClient.html) as it's easier for users to understand the concepts. + +Download Flink from the [Apache download page](https://flink.apache.org/downloads.html). Iceberg uses Scala 2.12 when compiling the Apache `iceberg-flink-runtime` jar, so it's recommended to use Flink 1.16 bundled with Scala 2.12. + +```bash +FLINK_VERSION=1.16.2 +SCALA_VERSION=2.12 +APACHE_FLINK_URL=https://archive.apache.org/dist/flink/ +wget ${APACHE_FLINK_URL}/flink-${FLINK_VERSION}/flink-${FLINK_VERSION}-bin-scala_${SCALA_VERSION}.tgz +tar xzvf flink-${FLINK_VERSION}-bin-scala_${SCALA_VERSION}.tgz +``` + +Start a standalone Flink cluster within Hadoop environment: + +```bash +# HADOOP_HOME is your hadoop root directory after unpack the binary package. +APACHE_HADOOP_URL=https://archive.apache.org/dist/hadoop/ +HADOOP_VERSION=2.8.5 +wget ${APACHE_HADOOP_URL}/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz +tar xzvf hadoop-${HADOOP_VERSION}.tar.gz +HADOOP_HOME=`pwd`/hadoop-${HADOOP_VERSION} + +export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath` + +# Start the flink standalone cluster +./bin/start-cluster.sh +``` + +Start the Flink SQL client. There is a separate `flink-runtime` module in the Iceberg project to generate a bundled jar, which could be loaded by Flink SQL client directly. To build the `flink-runtime` bundled jar manually, build the `iceberg` project, and it will generate the jar under `/flink-runtime/build/libs`. Or download the `flink-runtime` jar from the [Apache repository](https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.16/{{% icebergVersion %}}/). + +```bash +# HADOOP_HOME is your hadoop root directory after unpack the binary package. +export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath` + +# Below works for 1.15 or less +./bin/sql-client.sh embedded -j /iceberg-flink-runtime-1.15-{{% icebergVersion %}}.jar shell + +# 1.16 or above has a regression in loading external jar via -j option. See FLINK-30035 for details. +put iceberg-flink-runtime-1.16-{{% icebergVersion %}}.jar in flink/lib dir +./bin/sql-client.sh embedded shell +``` + +By default, Iceberg ships with Hadoop jars for Hadoop catalog. To use Hive catalog, load the Hive jars when opening the Flink SQL client. Fortunately, Flink has provided a [bundled hive jar](https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-hive-2.3.9_2.12/1.16.2/flink-sql-connector-hive-2.3.9_2.12-1.16.2.jar) for the SQL client. An example on how to download the dependencies and get started: + +```bash +# HADOOP_HOME is your hadoop root directory after unpack the binary package. +export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath` + +ICEBERG_VERSION={{% icebergVersion %}} +MAVEN_URL=https://repo1.maven.org/maven2 +ICEBERG_MAVEN_URL=${MAVEN_URL}/org/apache/iceberg +ICEBERG_PACKAGE=iceberg-flink-runtime +wget ${ICEBERG_MAVEN_URL}/${ICEBERG_PACKAGE}-${FLINK_VERSION_MAJOR}/${ICEBERG_VERSION}/${ICEBERG_PACKAGE}-${FLINK_VERSION_MAJOR}-${ICEBERG_VERSION}.jar -P lib/ + +HIVE_VERSION=2.3.9 +SCALA_VERSION=2.12 +FLINK_VERSION=1.16.2 +FLINK_CONNECTOR_URL=${MAVEN_URL}/org/apache/flink +FLINK_CONNECTOR_PACKAGE=flink-sql-connector-hive +wget ${FLINK_CONNECTOR_URL}/${FLINK_CONNECTOR_PACKAGE}-${HIVE_VERSION}_${SCALA_VERSION}/${FLINK_VERSION}/${FLINK_CONNECTOR_PACKAGE}-${HIVE_VERSION}_${SCALA_VERSION}-${FLINK_VERSION}.jar + +./bin/sql-client.sh embedded shell +``` + +## Flink's Python API + +{{< hint info >}} +PyFlink 1.6.1 [does not work on OSX with a M1 cpu](https://issues.apache.org/jira/browse/FLINK-28786) +{{< /hint >}} + +Install the Apache Flink dependency using `pip`: + +```python +pip install apache-flink==1.16.2 +``` + +Provide a `file://` path to the `iceberg-flink-runtime` jar, which can be obtained by building the project and looking at `/flink-runtime/build/libs`, or downloading it from the [Apache official repository](https://repo.maven.apache.org/maven2/org/apache/iceberg/iceberg-flink-runtime/). Third-party jars can be added to `pyflink` via: + +- `env.add_jars("file:///my/jar/path/connector.jar")` +- `table_env.get_config().get_configuration().set_string("pipeline.jars", "file:///my/jar/path/connector.jar")` + +This is also mentioned in the official [docs](https://ci.apache.org/projects/flink/flink-docs-release-1.16/docs/dev/python/dependency_management/). The example below uses `env.add_jars(..)`: + +```python +import os + +from pyflink.datastream import StreamExecutionEnvironment + +env = StreamExecutionEnvironment.get_execution_environment() +iceberg_flink_runtime_jar = os.path.join(os.getcwd(), "iceberg-flink-runtime-1.16-{{% icebergVersion %}}.jar") + +env.add_jars("file://{}".format(iceberg_flink_runtime_jar)) +``` + +Next, create a `StreamTableEnvironment` and execute Flink SQL statements. The below example shows how to create a custom catalog via the Python Table API: + +```python +from pyflink.table import StreamTableEnvironment +table_env = StreamTableEnvironment.create(env) +table_env.execute_sql(""" +CREATE CATALOG my_catalog WITH ( + 'type'='iceberg', + 'catalog-impl'='com.my.custom.CatalogImpl', + 'my-additional-catalog-config'='my-value' +) +""") +``` + +Run a query: + +```python +(table_env + .sql_query("SELECT PULocationID, DOLocationID, passenger_count FROM my_catalog.nyc.taxis LIMIT 5") + .execute() + .print()) +``` + +``` ++----+----------------------+----------------------+--------------------------------+ +| op | PULocationID | DOLocationID | passenger_count | ++----+----------------------+----------------------+--------------------------------+ +| +I | 249 | 48 | 1.0 | +| +I | 132 | 233 | 1.0 | +| +I | 164 | 107 | 1.0 | +| +I | 90 | 229 | 1.0 | +| +I | 137 | 249 | 1.0 | ++----+----------------------+----------------------+--------------------------------+ +5 rows in set +``` + +For more details, please refer to the [Python Table API](https://ci.apache.org/projects/flink/flink-docs-release-1.16/docs/dev/python/table/intro_to_table_api/). + +## Adding catalogs. + +Flink support to create catalogs by using Flink SQL. + +### Catalog Configuration + +A catalog is created and named by executing the following query (replace `` with your catalog name and +``=`` with catalog implementation config): + +```sql +CREATE CATALOG WITH ( + 'type'='iceberg', + ``=`` +); +``` + +The following properties can be set globally and are not limited to a specific catalog implementation: + +* `type`: Must be `iceberg`. (required) +* `catalog-type`: `hive`, `hadoop` or `rest` for built-in catalogs, or left unset for custom catalog implementations using catalog-impl. (Optional) +* `catalog-impl`: The fully-qualified class name of a custom catalog implementation. Must be set if `catalog-type` is unset. (Optional) +* `property-version`: Version number to describe the property version. This property can be used for backwards compatibility in case the property format changes. The current property version is `1`. (Optional) +* `cache-enabled`: Whether to enable catalog cache, default value is `true`. (Optional) +* `cache.expiration-interval-ms`: How long catalog entries are locally cached, in milliseconds; negative values like `-1` will disable expiration, value 0 is not allowed to set. default value is `-1`. (Optional) + +### Hive catalog + +This creates an Iceberg catalog named `hive_catalog` that can be configured using `'catalog-type'='hive'`, which loads tables from Hive metastore: + +```sql +CREATE CATALOG hive_catalog WITH ( + 'type'='iceberg', + 'catalog-type'='hive', + 'uri'='thrift://localhost:9083', + 'clients'='5', + 'property-version'='1', + 'warehouse'='hdfs://nn:8020/warehouse/path' +); +``` + +The following properties can be set if using the Hive catalog: + +* `uri`: The Hive metastore's thrift URI. (Required) +* `clients`: The Hive metastore client pool size, default value is 2. (Optional) +* `warehouse`: The Hive warehouse location, users should specify this path if neither set the `hive-conf-dir` to specify a location containing a `hive-site.xml` configuration file nor add a correct `hive-site.xml` to classpath. +* `hive-conf-dir`: Path to a directory containing a `hive-site.xml` configuration file which will be used to provide custom Hive configuration values. The value of `hive.metastore.warehouse.dir` from `/hive-site.xml` (or hive configure file from classpath) will be overwritten with the `warehouse` value if setting both `hive-conf-dir` and `warehouse` when creating iceberg catalog. +* `hadoop-conf-dir`: Path to a directory containing `core-site.xml` and `hdfs-site.xml` configuration files which will be used to provide custom Hadoop configuration values. + +## Creating a table + +```sql +CREATE TABLE `hive_catalog`.`default`.`sample` ( + id BIGINT COMMENT 'unique id', + data STRING +); +``` + +## Writing + +To append new data to a table with a Flink streaming job, use `INSERT INTO`: + +```sql +INSERT INTO `hive_catalog`.`default`.`sample` VALUES (1, 'a'); +INSERT INTO `hive_catalog`.`default`.`sample` SELECT id, data from other_kafka_table; +``` + +To replace data in the table with the result of a query, use `INSERT OVERWRITE` in batch job (flink streaming job does not support `INSERT OVERWRITE`). Overwrites are atomic operations for Iceberg tables. + +Partitions that have rows produced by the SELECT query will be replaced, for example: + +```sql +INSERT OVERWRITE `hive_catalog`.`default`.`sample` VALUES (1, 'a'); +``` + +Iceberg also support overwriting given partitions by the `select` values: + +```sql +INSERT OVERWRITE `hive_catalog`.`default`.`sample` PARTITION(data='a') SELECT 6; +``` + +Flink supports writing `DataStream` and `DataStream` to the sink iceberg table natively. + +```java +StreamExecutionEnvironment env = ...; + +DataStream input = ... ; +Configuration hadoopConf = new Configuration(); +TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path", hadoopConf); + +FlinkSink.forRowData(input) + .tableLoader(tableLoader) + .append(); + +env.execute("Test Iceberg DataStream"); +``` + +### Branch Writes +Writing to branches in Iceberg tables is also supported via the `toBranch` API in `FlinkSink` +For more information on branches please refer to [branches](../tables/branching). +```java +FlinkSink.forRowData(input) + .tableLoader(tableLoader) + .toBranch("audit-branch") + .append(); +``` + +## Reading + +Submit a Flink __batch__ job using the following sentences: + +```sql +-- Execute the flink job in batch mode for current session context +SET execution.runtime-mode = batch; +SELECT * FROM `hive_catalog`.`default`.`sample`; +``` + +Iceberg supports processing incremental data in flink __streaming__ jobs which starts from a historical snapshot-id: + +```sql +-- Submit the flink job in streaming mode for current session. +SET execution.runtime-mode = streaming; + +-- Enable this switch because streaming read SQL will provide few job options in flink SQL hint options. +SET table.dynamic-table-options.enabled=true; + +-- Read all the records from the iceberg current snapshot, and then read incremental data starting from that snapshot. +SELECT * FROM `hive_catalog`.`default`.`sample` /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s')*/ ; + +-- Read all incremental data starting from the snapshot-id '3821550127947089987' (records from this snapshot will be excluded). +SELECT * FROM `hive_catalog`.`default`.`sample` /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s', 'start-snapshot-id'='3821550127947089987')*/ ; +``` + +SQL is also the recommended way to inspect tables. To view all of the snapshots in a table, use the snapshots metadata table: + +```sql +SELECT * FROM `hive_catalog`.`default`.`sample`.`snapshots` +``` + +Iceberg support streaming or batch read in Java API: + +``` +DataStream batch = FlinkSource.forRowData() + .env(env) + .tableLoader(tableLoader) + .streaming(false) + .build(); +``` + + + + +## Type conversion + +Iceberg's integration for Flink automatically converts between Flink and Iceberg types. When writing to a table with types that are not supported by Flink, like UUID, Iceberg will accept and convert values from the Flink type. + +### Flink to Iceberg + +Flink types are converted to Iceberg types according to the following table: + +| Flink | Iceberg | Notes | +| ------------------- | -------------------------- | ------------- | +| boolean | boolean | | +| tinyint | integer | | +| smallint | integer | | +| integer | integer | | +| bigint | long | | +| float | float | | +| double | double | | +| char | string | | +| varchar | string | | +| string | string | | +| binary | binary | | +| varbinary | fixed | | +| decimal | decimal | | +| date | date | | +| time | time | | +| timestamp | timestamp without timezone | | +| timestamp_ltz | timestamp with timezone | | +| array | list | | +| map | map | | +| multiset | map | | +| row | struct | | +| raw | | Not supported | +| interval | | Not supported | +| structured | | Not supported | +| timestamp with zone | | Not supported | +| distinct | | Not supported | +| null | | Not supported | +| symbol | | Not supported | +| logical | | Not supported | + +### Iceberg to Flink + +Iceberg types are converted to Flink types according to the following table: + +| Iceberg | Flink | +| -------------------------- | --------------------- | +| boolean | boolean | +| struct | row | +| list | array | +| map | map | +| integer | integer | +| long | bigint | +| float | float | +| double | double | +| date | date | +| time | time | +| timestamp without timezone | timestamp(6) | +| timestamp with timezone | timestamp_ltz(6) | +| string | varchar(2147483647) | +| uuid | binary(16) | +| fixed(N) | binary(N) | +| binary | varbinary(2147483647) | +| decimal(P, S) | decimal(P, S) | + +## Future improvement. + +There are some features that are do not yet supported in the current Flink Iceberg integration work: + +* Don't support creating iceberg table with hidden partitioning. [Discussion](http://mail-archives.apache.org/mod_mbox/flink-dev/202008.mbox/%3cCABi+2jQCo3MsOa4+ywaxV5J-Z8TGKNZDX-pQLYB-dG+dVUMiMw@mail.gmail.com%3e) in flink mail list. +* Don't support creating iceberg table with computed column. +* Don't support creating iceberg table with watermark. +* Don't support adding columns, removing columns, renaming columns, changing columns. [FLINK-19062](https://issues.apache.org/jira/browse/FLINK-19062) is tracking this. +* \ No newline at end of file diff --git a/docs/flink-queries.md b/docs/flink-queries.md new file mode 100644 index 000000000000..cf68fa367c21 --- /dev/null +++ b/docs/flink-queries.md @@ -0,0 +1,566 @@ +--- +title: "Flink Queries" +url: flink-queries +aliases: + - "flink/flink-queries" +menu: + main: + parent: Flink + identifier: flink_queries + weight: 300 +--- + + +# Flink Queries + +Iceberg support streaming and batch read With [Apache Flink](https://flink.apache.org/)'s DataStream API and Table API. + +## Reading with SQL + +Iceberg support both streaming and batch read in Flink. Execute the following sql command to switch execution mode from `streaming` to `batch`, and vice versa: + +```sql +-- Execute the flink job in streaming mode for current session context +SET execution.runtime-mode = streaming; + +-- Execute the flink job in batch mode for current session context +SET execution.runtime-mode = batch; +``` + +### Flink batch read + +Submit a Flink __batch__ job using the following sentences: + +```sql +-- Execute the flink job in batch mode for current session context +SET execution.runtime-mode = batch; +SELECT * FROM sample; +``` + +### Flink streaming read + +Iceberg supports processing incremental data in Flink streaming jobs which starts from a historical snapshot-id: + +```sql +-- Submit the flink job in streaming mode for current session. +SET execution.runtime-mode = streaming; + +-- Enable this switch because streaming read SQL will provide few job options in flink SQL hint options. +SET table.dynamic-table-options.enabled=true; + +-- Read all the records from the iceberg current snapshot, and then read incremental data starting from that snapshot. +SELECT * FROM sample /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s')*/ ; + +-- Read all incremental data starting from the snapshot-id '3821550127947089987' (records from this snapshot will be excluded). +SELECT * FROM sample /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s', 'start-snapshot-id'='3821550127947089987')*/ ; +``` + +There are some options that could be set in Flink SQL hint options for streaming job, see [read options](#Read-options) for details. + +### FLIP-27 source for SQL + +Here are the SQL settings for the [FLIP-27](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface) source. All other SQL settings and options documented above are applicable to the FLIP-27 source. + +```sql +-- Opt in the FLIP-27 source. Default is false. +SET table.exec.iceberg.use-flip27-source = true; +``` + +### Reading branches and tags with SQL +Branch and tags can be read via SQL by specifying options. For more details +refer to [Flink Configuration](../flink-configuration/#read-options) + +```sql +--- Read from branch b1 +SELECT * FROM table /*+ OPTIONS('branch'='b1') */ ; + +--- Read from tag t1 +SELECT * FROM table /*+ OPTIONS('tag'='t1') */; + +--- Incremental scan from tag t1 to tag t2 +SELECT * FROM table /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s', 'start-tag'='t1', 'end-tag'='t2') */; +``` + +## Reading with DataStream + +Iceberg support streaming or batch read in Java API now. + +### Batch Read + +This example will read all records from iceberg table and then print to the stdout console in flink batch job: + +```java +StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(); +TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path"); +DataStream batch = FlinkSource.forRowData() + .env(env) + .tableLoader(tableLoader) + .streaming(false) + .build(); + +// Print all records to stdout. +batch.print(); + +// Submit and execute this batch read job. +env.execute("Test Iceberg Batch Read"); +``` + +### Streaming read + +This example will read incremental records which start from snapshot-id '3821550127947089987' and print to stdout console in flink streaming job: + +```java +StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(); +TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path"); +DataStream stream = FlinkSource.forRowData() + .env(env) + .tableLoader(tableLoader) + .streaming(true) + .startSnapshotId(3821550127947089987L) + .build(); + +// Print all records to stdout. +stream.print(); + +// Submit and execute this streaming read job. +env.execute("Test Iceberg Streaming Read"); +``` + +There are other options that can be set, please see the [FlinkSource#Builder](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/flink/source/FlinkSource.html). + +## Reading with DataStream (FLIP-27 source) + +[FLIP-27 source interface](https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface) +was introduced in Flink 1.12. It aims to solve several shortcomings of the old `SourceFunction` +streaming source interface. It also unifies the source interfaces for both batch and streaming executions. +Most source connectors (like Kafka, file) in Flink repo have migrated to the FLIP-27 interface. +Flink is planning to deprecate the old `SourceFunction` interface in the near future. + +A FLIP-27 based Flink `IcebergSource` is added in `iceberg-flink` module. The FLIP-27 `IcebergSource` is currently an experimental feature. + +### Batch Read + +This example will read all records from iceberg table and then print to the stdout console in flink batch job: + +```java +StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(); +TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path"); + +IcebergSource source = IcebergSource.forRowData() + .tableLoader(tableLoader) + .assignerFactory(new SimpleSplitAssignerFactory()) + .build(); + +DataStream batch = env.fromSource( + source, + WatermarkStrategy.noWatermarks(), + "My Iceberg Source", + TypeInformation.of(RowData.class)); + +// Print all records to stdout. +batch.print(); + +// Submit and execute this batch read job. +env.execute("Test Iceberg Batch Read"); +``` + +### Streaming read + +This example will start the streaming read from the latest table snapshot (inclusive). +Every 60s, it polls Iceberg table to discover new append-only snapshots. +CDC read is not supported yet. + +```java +StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(); +TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path"); + +IcebergSource source = IcebergSource.forRowData() + .tableLoader(tableLoader) + .assignerFactory(new SimpleSplitAssignerFactory()) + .streaming(true) + .streamingStartingStrategy(StreamingStartingStrategy.INCREMENTAL_FROM_LATEST_SNAPSHOT) + .monitorInterval(Duration.ofSeconds(60)) + .build(); + +DataStream stream = env.fromSource( + source, + WatermarkStrategy.noWatermarks(), + "My Iceberg Source", + TypeInformation.of(RowData.class)); + +// Print all records to stdout. +stream.print(); + +// Submit and execute this streaming read job. +env.execute("Test Iceberg Streaming Read"); +``` + +There are other options that could be set by Java API, please see the +[IcebergSource#Builder](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/flink/source/IcebergSource.html). + +### Reading branches and tags with DataStream +Branches and tags can also be read via the DataStream API + +```java +StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(); +TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path"); +// Read from branch +DataStream batch = FlinkSource.forRowData() + .env(env) + .tableLoader(tableLoader) + .branch("test-branch") + .streaming(false) + .build(); + +// Read from tag +DataStream batch = FlinkSource.forRowData() + .env(env) + .tableLoader(tableLoader) + .tag("test-tag") + .streaming(false) + .build(); + +// Streaming read from start-tag +DataStream batch = FlinkSource.forRowData() + .env(env) + .tableLoader(tableLoader) + .streaming(true) + .startTag("test-tag") + .build(); +``` + +### Read as Avro GenericRecord + +FLIP-27 Iceberg source provides `AvroGenericRecordReaderFunction` that converts +Flink `RowData` Avro `GenericRecord`. You can use the convert to read from +Iceberg table as Avro GenericRecord DataStream. + +Please make sure `flink-avro` jar is included in the classpath. +Also `iceberg-flink-runtime` shaded bundle jar can't be used +because the runtime jar shades the avro package. +Please use non-shaded `iceberg-flink` jar instead. + +```java +TableLoader tableLoader = ...; +Table table; +try (TableLoader loader = tableLoader) { + loader.open(); + table = loader.loadTable(); +} + +AvroGenericRecordReaderFunction readerFunction = AvroGenericRecordReaderFunction.fromTable(table); + +IcebergSource source = + IcebergSource.builder() + .tableLoader(tableLoader) + .readerFunction(readerFunction) + .assignerFactory(new SimpleSplitAssignerFactory()) + ... + .build(); + +DataStream stream = env.fromSource(source, WatermarkStrategy.noWatermarks(), + "Iceberg Source as Avro GenericRecord", new GenericRecordAvroTypeInfo(avroSchema)); +``` + +### Emitting watermarks +Emitting watermarks from the source itself could be beneficial for several purposes, like harnessing the +[Flink Watermark Alignment](https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment), +or prevent triggering [windows](https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/windows/) +too early when reading multiple data files concurrently. + +Enable watermark generation for an `IcebergSource` by setting the `watermarkColumn`. +The supported column types are `timestamp`, `timestamptz` and `long`. +Iceberg `timestamp` or `timestamptz` inherently contains the time precision. So there is no need +to specify the time unit. But `long` type column doesn't contain time unit information. Use +`watermarkTimeUnit` to configure the conversion for long columns. + +The watermarks are generated based on column metrics stored for data files and emitted once per split. +If multiple smaller files with different time ranges are combined into a single split, it can increase +the out-of-orderliness and extra data buffering in the Flink state. The main purpose of watermark alignment +is to reduce out-of-orderliness and excess data buffering in the Flink state. Hence it is recommended to +set `read.split.open-file-cost` to a very large value to prevent combining multiple smaller files into a +single split. The negative impact (of not combining small files into a single split) is on read throughput, +especially if there are many small files. In typical stateful processing jobs, source read throughput is not +the bottleneck. Hence this is probably a reasonable tradeoff. + +This feature requires column-level min-max stats. Make sure stats are generated for the watermark column +during write phase. By default, the column metrics are collected for the first 100 columns of the table. +If watermark column doesn't have stats enabled by default, use +[write properties](configuration.md#write-properties) starting with `write.metadata.metrics` when needed. + +The following example could be useful if watermarks are used for windowing. The source reads Iceberg data files +in order, using a timestamp column and emits watermarks: +```java +StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(); +TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path"); + +DataStream stream = + env.fromSource( + IcebergSource.forRowData() + .tableLoader(tableLoader) + // Watermark using timestamp column + .watermarkColumn("timestamp_column") + .build(), + // Watermarks are generated by the source, no need to generate it manually + WatermarkStrategy.noWatermarks() + // Extract event timestamp from records + .withTimestampAssigner((record, eventTime) -> record.getTimestamp(pos, precision).getMillisecond()), + SOURCE_NAME, + TypeInformation.of(RowData.class)); +``` + +Example for reading Iceberg table using a long event column for watermark alignment: +```java +StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(); +TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path"); + +DataStream stream = + env.fromSource( + IcebergSource source = IcebergSource.forRowData() + .tableLoader(tableLoader) + // Disable combining multiple files to a single split + .set(FlinkReadOptions.SPLIT_FILE_OPEN_COST, String.valueOf(TableProperties.SPLIT_SIZE_DEFAULT)) + // Watermark using long column + .watermarkColumn("long_column") + .watermarkTimeUnit(TimeUnit.MILLI_SCALE) + .build(), + // Watermarks are generated by the source, no need to generate it manually + WatermarkStrategy.noWatermarks() + .withWatermarkAlignment(watermarkGroup, maxAllowedWatermarkDrift), + SOURCE_NAME, + TypeInformation.of(RowData.class)); +``` + +## Options + +### Read options + +Flink read options are passed when configuring the Flink IcebergSource: + +``` +IcebergSource.forRowData() + .tableLoader(TableLoader.fromCatalog(...)) + .assignerFactory(new SimpleSplitAssignerFactory()) + .streaming(true) + .streamingStartingStrategy(StreamingStartingStrategy.INCREMENTAL_FROM_LATEST_SNAPSHOT) + .startSnapshotId(3821550127947089987L) + .monitorInterval(Duration.ofMillis(10L)) // or .set("monitor-interval", "10s") \ set(FlinkReadOptions.MONITOR_INTERVAL, "10s") + .build() +``` + +For Flink SQL, read options can be passed in via SQL hints like this: + +``` +SELECT * FROM tableName /*+ OPTIONS('monitor-interval'='10s') */ +... +``` + +Options can be passed in via Flink configuration, which will be applied to current session. Note that not all options support this mode. + +``` +env.getConfig() + .getConfiguration() + .set(FlinkReadOptions.SPLIT_FILE_OPEN_COST_OPTION, 1000L); +... +``` + +Check out all the options here: [read-options](/flink-configuration#read-options) + +## Inspecting tables + +To inspect a table's history, snapshots, and other metadata, Iceberg supports metadata tables. + +Metadata tables are identified by adding the metadata table name after the original table name. For example, history for `db.table` is read using `db.table$history`. + +### History + +To show table history: + +```sql +SELECT * FROM prod.db.table$history; +``` + +| made_current_at | snapshot_id | parent_id | is_current_ancestor | +| ----------------------- | ------------------- | ------------------- | ------------------- | +| 2019-02-08 03:29:51.215 | 5781947118336215154 | NULL | true | +| 2019-02-08 03:47:55.948 | 5179299526185056830 | 5781947118336215154 | true | +| 2019-02-09 16:24:30.13 | 296410040247533544 | 5179299526185056830 | false | +| 2019-02-09 16:32:47.336 | 2999875608062437330 | 5179299526185056830 | true | +| 2019-02-09 19:42:03.919 | 8924558786060583479 | 2999875608062437330 | true | +| 2019-02-09 19:49:16.343 | 6536733823181975045 | 8924558786060583479 | true | + +{{< hint info >}} +**This shows a commit that was rolled back.** In this example, snapshot 296410040247533544 and 2999875608062437330 have the same parent snapshot 5179299526185056830. Snapshot 296410040247533544 was rolled back and is *not* an ancestor of the current table state. +{{< /hint >}} + +### Metadata Log Entries + +To show table metadata log entries: + +```sql +SELECT * from prod.db.table$metadata_log_entries; +``` + +| timestamp | file | latest_snapshot_id | latest_schema_id | latest_sequence_number | +| ----------------------- | ------------------------------------------------------------ | ------------------ | ---------------- | ---------------------- | +| 2022-07-28 10:43:52.93 | s3://.../table/metadata/00000-9441e604-b3c2-498a-a45a-6320e8ab9006.metadata.json | null | null | null | +| 2022-07-28 10:43:57.487 | s3://.../table/metadata/00001-f30823df-b745-4a0a-b293-7532e0c99986.metadata.json | 170260833677645300 | 0 | 1 | +| 2022-07-28 10:43:58.25 | s3://.../table/metadata/00002-2cc2837a-02dc-4687-acc1-b4d86ea486f4.metadata.json | 958906493976709774 | 0 | 2 | + +### Snapshots + +To show the valid snapshots for a table: + +```sql +SELECT * FROM prod.db.table$snapshots; +``` + +| committed_at | snapshot_id | parent_id | operation | manifest_list | summary | +| ----------------------- | -------------- | --------- | --------- | -------------------------------------------------- | ------------------------------------------------------------ | +| 2019-02-08 03:29:51.215 | 57897183625154 | null | append | s3://.../table/metadata/snap-57897183625154-1.avro | { added-records -> 2478404, total-records -> 2478404, added-data-files -> 438, total-data-files -> 438, flink.job-id -> 2e274eecb503d85369fb390e8956c813 } | + +You can also join snapshots to table history. For example, this query will show table history, with the application ID that wrote each snapshot: + +```sql +select + h.made_current_at, + s.operation, + h.snapshot_id, + h.is_current_ancestor, + s.summary['flink.job-id'] +from prod.db.table$history h +join prod.db.table$snapshots s + on h.snapshot_id = s.snapshot_id +order by made_current_at; +``` + +| made_current_at | operation | snapshot_id | is_current_ancestor | summary[flink.job-id] | +| ----------------------- | --------- | -------------- | ------------------- | -------------------------------- | +| 2019-02-08 03:29:51.215 | append | 57897183625154 | true | 2e274eecb503d85369fb390e8956c813 | + +### Files + +To show a table's current data files: + +```sql +SELECT * FROM prod.db.table$files; +``` + +| content | file_path | file_format | spec_id | partition | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_counts | nan_value_counts | lower_bounds | upper_bounds | key_metadata | split_offsets | equality_ids | sort_order_id | +| ------- | ------------------------------------------------------------ | ----------- | ------- | ---------------- | ------------ | ------------------ | ------------------ | ---------------- | ----------------- | ---------------- | --------------- | --------------- | ------------ | ------------- | ------------ | ------------- | +| 0 | s3:/.../table/data/00000-3-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 0 | {1999-01-01, 01} | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [] | [1 -> , 2 -> c] | [1 -> , 2 -> c] | null | [4] | null | null | +| 0 | s3:/.../table/data/00001-4-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 0 | {1999-01-01, 02} | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [] | [1 -> , 2 -> b] | [1 -> , 2 -> b] | null | [4] | null | null | +| 0 | s3:/.../table/data/00002-5-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 0 | {1999-01-01, 03} | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [] | [1 -> , 2 -> a] | [1 -> , 2 -> a] | null | [4] | null | null | + +### Manifests + +To show a table's current file manifests: + +```sql +SELECT * FROM prod.db.table$manifests; +``` + +| path | length | partition_spec_id | added_snapshot_id | added_data_files_count | existing_data_files_count | deleted_data_files_count | partition_summaries | +| ------------------------------------------------------------ | ------ | ----------------- | ------------------- | ---------------------- | ------------------------- | ------------------------ | ------------------------------------ | +| s3://.../table/metadata/45b5290b-ee61-4788-b324-b1e2735c0e10-m0.avro | 4479 | 0 | 6668963634911763636 | 8 | 0 | 0 | [[false,null,2019-05-13,2019-05-15]] | + +Note: + +1. Fields within `partition_summaries` column of the manifests table correspond to `field_summary` structs within [manifest list](../../../spec#manifest-lists), with the following order: + - `contains_null` + - `contains_nan` + - `lower_bound` + - `upper_bound` +2. `contains_nan` could return null, which indicates that this information is not available from the file's metadata. + This usually occurs when reading from V1 table, where `contains_nan` is not populated. + +### Partitions + +To show a table's current partitions: + +```sql +SELECT * FROM prod.db.table$partitions; +``` + +| partition | spec_id | record_count | file_count | total_data_file_size_in_bytes | position_delete_record_count | position_delete_file_count | equality_delete_record_count | equality_delete_file_count | last_updated_at(μs) | last_updated_snapshot_id | +| -------------- |---------|---------------|------------|--------------------------|------------------------------|----------------------------|------------------------------|----------------------------|---------------------|--------------------------| +| {20211001, 11} | 0 | 1 | 1 | 100 | 2 | 1 | 0 | 0 | 1633086034192000 | 9205185327307503337 | +| {20211002, 11} | 0 | 4 | 3 | 500 | 1 | 1 | 0 | 0 | 1633172537358000 | 867027598972211003 | +| {20211001, 10} | 0 | 7 | 4 | 700 | 0 | 0 | 0 | 0 | 1633082598716000 | 3280122546965981531 | +| {20211002, 10} | 0 | 3 | 2 | 400 | 0 | 0 | 1 | 1 | 1633169159489000 | 6941468797545315876 | + +Note: +For unpartitioned tables, the partitions table will not contain the partition and spec_id fields. + +### All Metadata Tables + +These tables are unions of the metadata tables specific to the current snapshot, and return metadata across all snapshots. + +{{< hint danger >}} +The "all" metadata tables may produce more than one row per data file or manifest file because metadata files may be part of more than one table snapshot. +{{< /hint >}} + +#### All Data Files + +To show all of the table's data files and each file's metadata: + +```sql +SELECT * FROM prod.db.table$all_data_files; +``` + +| content | file_path | file_format | partition | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_counts | nan_value_counts | lower_bounds | upper_bounds | key_metadata | split_offsets | equality_ids | sort_order_id | +| ------- | ------------------------------------------------------------ | ----------- | ---------- | ------------ | ------------------ | ------------------ | ------------------ | ----------------- | ---------------- | ----------------------- | ----------------------- | ------------ | ------------- | ------------ | ------------- | +| 0 | s3://.../dt=20210102/00000-0-756e2512-49ae-45bb-aae3-c0ca475e7879-00001.parquet | PARQUET | {20210102} | 14 | 2444 | {1 -> 94, 2 -> 17} | {1 -> 14, 2 -> 14} | {1 -> 0, 2 -> 0} | {} | {1 -> 1, 2 -> 20210102} | {1 -> 2, 2 -> 20210102} | null | [4] | null | 0 | +| 0 | s3://.../dt=20210103/00000-0-26222098-032f-472b-8ea5-651a55b21210-00001.parquet | PARQUET | {20210103} | 14 | 2444 | {1 -> 94, 2 -> 17} | {1 -> 14, 2 -> 14} | {1 -> 0, 2 -> 0} | {} | {1 -> 1, 2 -> 20210103} | {1 -> 3, 2 -> 20210103} | null | [4] | null | 0 | +| 0 | s3://.../dt=20210104/00000-0-a3bb1927-88eb-4f1c-bc6e-19076b0d952e-00001.parquet | PARQUET | {20210104} | 14 | 2444 | {1 -> 94, 2 -> 17} | {1 -> 14, 2 -> 14} | {1 -> 0, 2 -> 0} | {} | {1 -> 1, 2 -> 20210104} | {1 -> 3, 2 -> 20210104} | null | [4] | null | 0 | + +#### All Manifests + +To show all of the table's manifest files: + +```sql +SELECT * FROM prod.db.table$all_manifests; +``` + +| path | length | partition_spec_id | added_snapshot_id | added_data_files_count | existing_data_files_count | deleted_data_files_count | partition_summaries | +| ------------------------------------------------------------ | ------ | ----------------- | ------------------- | ---------------------- | ------------------------- | ------------------------ | ------------------------------------ | +| s3://.../metadata/a85f78c5-3222-4b37-b7e4-faf944425d48-m0.avro | 6376 | 0 | 6272782676904868561 | 2 | 0 | 0 | [{false, false, 20210101, 20210101}] | + +Note: + +1. Fields within `partition_summaries` column of the manifests table correspond to `field_summary` structs within [manifest list](../../../spec#manifest-lists), with the following order: + - `contains_null` + - `contains_nan` + - `lower_bound` + - `upper_bound` +2. `contains_nan` could return null, which indicates that this information is not available from the file's metadata. + This usually occurs when reading from V1 table, where `contains_nan` is not populated. + +### References + +To show a table's known snapshot references: + +```sql +SELECT * FROM prod.db.table$refs; +``` + +| name | type | snapshot_id | max_reference_age_in_ms | min_snapshots_to_keep | max_snapshot_age_in_ms | +| ------- | ------ | ------------------- | ----------------------- | --------------------- | ---------------------- | +| main | BRANCH | 4686954189838128572 | 10 | 20 | 30 | +| testTag | TAG | 4686954189838128572 | 10 | null | null | + diff --git a/docs/flink-writes.md b/docs/flink-writes.md new file mode 100644 index 000000000000..53859de21b5e --- /dev/null +++ b/docs/flink-writes.md @@ -0,0 +1,282 @@ +--- +title: "Flink Writes" +url: flink-writes +aliases: + - "flink/flink-writes" +menu: + main: + parent: Flink + identifier: flink_writes + weight: 400 +--- + +# Flink Writes + +Iceberg support batch and streaming writes With [Apache Flink](https://flink.apache.org/)'s DataStream API and Table API. + +## Writing with SQL + +Iceberg support both `INSERT INTO` and `INSERT OVERWRITE`. + +### `INSERT INTO` + +To append new data to a table with a Flink streaming job, use `INSERT INTO`: + +```sql +INSERT INTO `hive_catalog`.`default`.`sample` VALUES (1, 'a'); +INSERT INTO `hive_catalog`.`default`.`sample` SELECT id, data from other_kafka_table; +``` + +### `INSERT OVERWRITE` + +To replace data in the table with the result of a query, use `INSERT OVERWRITE` in batch job (flink streaming job does not support `INSERT OVERWRITE`). Overwrites are atomic operations for Iceberg tables. + +Partitions that have rows produced by the SELECT query will be replaced, for example: + +```sql +INSERT OVERWRITE sample VALUES (1, 'a'); +``` + +Iceberg also support overwriting given partitions by the `select` values: + +```sql +INSERT OVERWRITE `hive_catalog`.`default`.`sample` PARTITION(data='a') SELECT 6; +``` + +For a partitioned iceberg table, when all the partition columns are set a value in `PARTITION` clause, it is inserting into a static partition, otherwise if partial partition columns (prefix part of all partition columns) are set a value in `PARTITION` clause, it is writing the query result into a dynamic partition. +For an unpartitioned iceberg table, its data will be completely overwritten by `INSERT OVERWRITE`. + +### `UPSERT` + +Iceberg supports `UPSERT` based on the primary key when writing data into v2 table format. There are two ways to enable upsert. + +1. Enable the `UPSERT` mode as table-level property `write.upsert.enabled`. Here is an example SQL statement to set the table property when creating a table. It would be applied for all write paths to this table (batch or streaming) unless overwritten by write options as described later. + +```sql +CREATE TABLE `hive_catalog`.`default`.`sample` ( + `id` INT COMMENT 'unique id', + `data` STRING NOT NULL, + PRIMARY KEY(`id`) NOT ENFORCED +) with ('format-version'='2', 'write.upsert.enabled'='true'); +``` + +2. Enabling `UPSERT` mode using `upsert-enabled` in the [write options](#Write-options) provides more flexibility than a table level config. Note that you still need to use v2 table format and specify the primary key when creating the table. + +```sql +INSERT INTO tableName /*+ OPTIONS('upsert-enabled'='true') */ +... +``` + +{{< hint info >}} +OVERWRITE and UPSERT can't be set together. In UPSERT mode, if the table is partitioned, the partition fields should be included in equality fields. +{{< /hint >}} + + + +## Writing with DataStream + +Iceberg support writing to iceberg table from different DataStream input. + + +### Appending data. + +Flink supports writing `DataStream` and `DataStream` to the sink iceberg table natively. + +```java +StreamExecutionEnvironment env = ...; + +DataStream input = ... ; +Configuration hadoopConf = new Configuration(); +TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path", hadoopConf); + +FlinkSink.forRowData(input) + .tableLoader(tableLoader) + .append(); + +env.execute("Test Iceberg DataStream"); +``` + +The iceberg API also allows users to write generic `DataStream` to iceberg table, more example could be found in this [unit test](https://github.com/apache/iceberg/blob/main/flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/sink/TestFlinkIcebergSink.java). + +### Overwrite data + +Set the `overwrite` flag in FlinkSink builder to overwrite the data in existing iceberg tables: + +```java +StreamExecutionEnvironment env = ...; + +DataStream input = ... ; +Configuration hadoopConf = new Configuration(); +TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path", hadoopConf); + +FlinkSink.forRowData(input) + .tableLoader(tableLoader) + .overwrite(true) + .append(); + +env.execute("Test Iceberg DataStream"); +``` + +### Upsert data + +Set the `upsert` flag in FlinkSink builder to upsert the data in existing iceberg table. The table must use v2 table format and have a primary key. + +```java +StreamExecutionEnvironment env = ...; + +DataStream input = ... ; +Configuration hadoopConf = new Configuration(); +TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path", hadoopConf); + +FlinkSink.forRowData(input) + .tableLoader(tableLoader) + .upsert(true) + .append(); + +env.execute("Test Iceberg DataStream"); +``` + +{{< hint info >}} +OVERWRITE and UPSERT can't be set together. In UPSERT mode, if the table is partitioned, the partition fields should be included in equality fields. +{{< /hint >}} + +### Write with Avro GenericRecord + +Flink Iceberg sink provides `AvroGenericRecordToRowDataMapper` that converts +Avro `GenericRecord` to Flink `RowData`. You can use the mapper to write +Avro GenericRecord DataStream to Iceberg. + +Please make sure `flink-avro` jar is included in the classpath. +Also `iceberg-flink-runtime` shaded bundle jar can't be used +because the runtime jar shades the avro package. +Please use non-shaded `iceberg-flink` jar instead. + +```java +DataStream dataStream = ...; + +Schema icebergSchema = table.schema(); + + +// The Avro schema converted from Iceberg schema can't be used +// due to precision difference between how Iceberg schema (micro) +// and Flink AvroToRowDataConverters (milli) deal with time type. +// Instead, use the Avro schema defined directly. +// See AvroGenericRecordToRowDataMapper Javadoc for more details. +org.apache.avro.Schema avroSchema = AvroSchemaUtil.convert(icebergSchema, table.name()); + +GenericRecordAvroTypeInfo avroTypeInfo = new GenericRecordAvroTypeInfo(avroSchema); +RowType rowType = FlinkSchemaUtil.convert(icebergSchema); + +FlinkSink.builderFor( + dataStream, + AvroGenericRecordToRowDataMapper.forAvroSchema(avroSchema), + FlinkCompatibilityUtil.toTypeInfo(rowType)) + .table(table) + .tableLoader(tableLoader) + .append(); +``` + +### Branch Writes +Writing to branches in Iceberg tables is also supported via the `toBranch` API in `FlinkSink` +For more information on branches please refer to [branches](../tables/branching). +```java +FlinkSink.forRowData(input) + .tableLoader(tableLoader) + .toBranch("audit-branch") + .append(); +``` + +### Metrics + +The following Flink metrics are provided by the Flink Iceberg sink. + +Parallel writer metrics are added under the sub group of `IcebergStreamWriter`. +They should have the following key-value tags. + +* table: full table name (like iceberg.my_db.my_table) +* subtask_index: writer subtask index starting from 0 + + Metric name | Metric type | Description | +| ------------------------- |------------|-----------------------------------------------------------------------------------------------------| +| lastFlushDurationMs | Gauge | The duration (in milli) that writer subtasks take to flush and upload the files during checkpoint. | +| flushedDataFiles | Counter | Number of data files flushed and uploaded. | +| flushedDeleteFiles | Counter | Number of delete files flushed and uploaded. | +| flushedReferencedDataFiles| Counter | Number of data files referenced by the flushed delete files. | +| dataFilesSizeHistogram | Histogram | Histogram distribution of data file sizes (in bytes). | +| deleteFilesSizeHistogram | Histogram | Histogram distribution of delete file sizes (in bytes). | + +Committer metrics are added under the sub group of `IcebergFilesCommitter`. +They should have the following key-value tags. + +* table: full table name (like iceberg.my_db.my_table) + + Metric name | Metric type | Description | +|---------------------------------|--------|----------------------------------------------------------------------------| +| lastCheckpointDurationMs | Gauge | The duration (in milli) that the committer operator checkpoints its state. | +| lastCommitDurationMs | Gauge | The duration (in milli) that the Iceberg table commit takes. | +| committedDataFilesCount | Counter | Number of data files committed. | +| committedDataFilesRecordCount | Counter | Number of records contained in the committed data files. | +| committedDataFilesByteCount | Counter | Number of bytes contained in the committed data files. | +| committedDeleteFilesCount | Counter | Number of delete files committed. | +| committedDeleteFilesRecordCount | Counter | Number of records contained in the committed delete files. | +| committedDeleteFilesByteCount | Counter | Number of bytes contained in the committed delete files. | +| elapsedSecondsSinceLastSuccessfulCommit| Gauge | Elapsed time (in seconds) since last successful Iceberg commit. | + +`elapsedSecondsSinceLastSuccessfulCommit` is an ideal alerting metric +to detect failed or missing Iceberg commits. + +* Iceberg commit happened after successful Flink checkpoint in the `notifyCheckpointComplete` callback. + It could happen that Iceberg commits failed (for whatever reason), while Flink checkpoints succeeding. +* It could also happen that `notifyCheckpointComplete` wasn't triggered (for whatever bug). + As a result, there won't be any Iceberg commits attempted. + +If the checkpoint interval (and expected Iceberg commit interval) is 5 minutes, set up alert with rule like `elapsedSecondsSinceLastSuccessfulCommit > 60 minutes` to detect failed or missing Iceberg commits in the past hour. + + + +## Options + +### Write options + +Flink write options are passed when configuring the FlinkSink, like this: + +```java +FlinkSink.Builder builder = FlinkSink.forRow(dataStream, SimpleDataUtil.FLINK_SCHEMA) + .table(table) + .tableLoader(tableLoader) + .set("write-format", "orc") + .set(FlinkWriteOptions.OVERWRITE_MODE, "true"); +``` + +For Flink SQL, write options can be passed in via SQL hints like this: + +```sql +INSERT INTO tableName /*+ OPTIONS('upsert-enabled'='true') */ +... +``` + +Check out all the options here: [write-options](/flink-configuration#write-options) + +## Notes + +Flink streaming write jobs rely on snapshot summary to keep the last committed checkpoint ID, and +store uncommitted data as temporary files. Therefore, [expiring snapshots](../tables/maintenance#expire-snapshots) +and [deleting orphan files](../tables/maintenance#delete-orphan-files) could possibly corrupt +the state of the Flink job. To avoid that, make sure to keep the last snapshot created by the Flink +job (which can be identified by the `flink.job-id` property in the summary), and only delete +orphan files that are old enough. \ No newline at end of file diff --git a/docs/hive-migration.md b/docs/hive-migration.md new file mode 100644 index 000000000000..21d38d789e71 --- /dev/null +++ b/docs/hive-migration.md @@ -0,0 +1,61 @@ +--- +title: "Hive Migration" +url: hive-migration +menu: + main: + parent: "Migration" + identifier: hive_migration + weight: 200 +--- + + +# Hive Table Migration +Apache Hive supports ORC, Parquet, and Avro file formats that could be migrated to Iceberg. +When migrating data to an Iceberg table, which provides versioning and transactional updates, only the most recent data files need to be migrated. + +Iceberg supports all three migration actions: Snapshot Table, Migrate Table, and Add Files for migrating from Hive tables to Iceberg tables. Since Hive tables do not maintain snapshots, +the migration process essentially involves creating a new Iceberg table with the existing schema and committing all data files across all partitions to the new Iceberg table. +After the initial migration, any new data files are added to the new Iceberg table using the Add Files action. + +## Enabling Migration from Hive to Iceberg +The Hive table migration actions are supported by the Spark Integration module via Spark Procedures. +The procedures are bundled in the Spark runtime jar, which is available in the [Iceberg Release Downloads](https://iceberg.apache.org/releases/#downloads). + +## Snapshot Hive Table to Iceberg +To snapshot a Hive table, users can run the following Spark SQL: +```sql +CALL catalog_name.system.snapshot('db.source', 'db.dest') +``` +See [Spark Procedure: snapshot](../spark-procedures/#snapshot) for more details. + +## Migrate Hive Table To Iceberg +To migrate a Hive table to Iceberg, users can run the following Spark SQL: +```sql +CALL catalog_name.system.migrate('db.sample') +``` +See [Spark Procedure: migrate](../spark-procedures/#migrate) for more details. + +## Add Files From Hive Table to Iceberg +To add data files from a Hive table to a given Iceberg table, users can run the following Spark SQL: +```sql +CALL spark_catalog.system.add_files( +table => 'db.tbl', +source_table => 'db.src_tbl' +) +``` +See [Spark Procedure: add_files](../spark-procedures/#add_files) for more details. diff --git a/docs/hive.md b/docs/hive.md new file mode 100644 index 000000000000..adb79e64ef7a --- /dev/null +++ b/docs/hive.md @@ -0,0 +1,603 @@ +--- +title: "Hive" +url: hive +weight: 400 +menu: main +--- + + +# Hive + +Iceberg supports reading and writing Iceberg tables through [Hive](https://hive.apache.org) by using +a [StorageHandler](https://cwiki.apache.org/confluence/display/Hive/StorageHandlers). + +## Feature support + +Iceberg compatibility with Hive 2.x and Hive 3.1.2/3 supports the following features: + +* Creating a table +* Dropping a table +* Reading a table +* Inserting into a table (INSERT INTO) + +{{< hint warning >}} +DML operations work only with MapReduce execution engine. +{{< /hint >}} + +With Hive version 4.0.0-alpha-2 and above, Iceberg integration when using HiveCatalog supports the following additional features: + +* Altering a table with expiring snapshots. +* Create a table like an existing table (CTLT table) +* Support adding parquet compression type via Table properties [Compression types](https://spark.apache.org/docs/2.4.3/sql-data-sources-parquet.html#configuration) +* Altering a table metadata location +* Supporting table rollback +* Honours sort orders on existing tables when writing a table [Sort orders specification](https://iceberg.apache.org/spec/#sort-orders) + +With Hive version 4.0.0-alpha-1 and above, Iceberg integration when using HiveCatalog supports the following additional features: + +* Creating an Iceberg identity-partitioned table +* Creating an Iceberg table with any partition spec, including the various transforms supported by Iceberg +* Creating a table from an existing table (CTAS table) +* Altering a table while keeping Iceberg and Hive schemas in sync +* Altering the partition schema (updating columns) +* Altering the partition schema by specifying partition transforms +* Truncating a table +* Migrating tables in Avro, Parquet, or ORC (Non-ACID) format to Iceberg +* Reading the schema of a table +* Querying Iceberg metadata tables +* Time travel applications +* Inserting into a table (INSERT INTO) +* Inserting data overwriting existing data (INSERT OVERWRITE) + +{{< hint warning >}} +DML operations work only with Tez execution engine. +{{< /hint >}} + +## Enabling Iceberg support in Hive + +Hive 4 comes with `hive-iceberg` that ships Iceberg, so no additional downloads or jars are needed. For older versions of Hive a runtime jar has to be added. + +### Hive 4.0.0-beta-1 + +Hive 4.0.0-beta-1 comes with the Iceberg 1.3.0 included. + +### Hive 4.0.0-alpha-2 + +Hive 4.0.0-alpha-2 comes with the Iceberg 0.14.1 included. + +### Hive 4.0.0-alpha-1 + +Hive 4.0.0-alpha-1 comes with the Iceberg 0.13.1 included. + +### Hive 2.3.x, Hive 3.1.x + +In order to use Hive 2.3.x or Hive 3.1.x, you must load the Iceberg-Hive runtime jar and enable Iceberg support, either globally or for an individual table using a table property. + +#### Loading runtime jar + +To enable Iceberg support in Hive, the `HiveIcebergStorageHandler` and supporting classes need to be made available on +Hive's classpath. These are provided by the `iceberg-hive-runtime` jar file. For example, if using the Hive shell, this +can be achieved by issuing a statement like so: + +``` +add jar /path/to/iceberg-hive-runtime.jar; +``` + +There are many others ways to achieve this including adding the jar file to Hive's auxiliary classpath so it is +available by default. Please refer to Hive's documentation for more information. + +#### Enabling support + +If the Iceberg storage handler is not in Hive's classpath, then Hive cannot load or update the metadata for an Iceberg +table when the storage handler is set. To avoid the appearance of broken tables in Hive, Iceberg will not add the +storage handler to a table unless Hive support is enabled. The storage handler is kept in sync (added or removed) every +time Hive engine support for the table is updated, i.e. turned on or off in the table properties. There are two ways to +enable Hive support: globally in Hadoop Configuration and per-table using a table property. + +##### Hadoop configuration + +To enable Hive support globally for an application, set `iceberg.engine.hive.enabled=true` in its Hadoop configuration. +For example, setting this in the `hive-site.xml` loaded by Spark will enable the storage handler for all tables created +by Spark. + +{{< hint danger >}} Starting with Apache Iceberg `0.11.0`, when using Hive with Tez you also have to disable +vectorization (`hive.vectorized.execution.enabled=false`). {{< /hint >}} + +##### Table property configuration + +Alternatively, the property `engine.hive.enabled` can be set to `true` and added to the table properties when creating +the Iceberg table. Here is an example of doing it programmatically: + +```java +Catalog catalog=...; + Map tableProperties=Maps.newHashMap(); + tableProperties.put(TableProperties.ENGINE_HIVE_ENABLED,"true"); // engine.hive.enabled=true + catalog.createTable(tableId,schema,spec,tableProperties); +``` + +The table level configuration overrides the global Hadoop configuration. + +##### Hive on Tez configuration + +To use the Tez engine on Hive `3.1.2` or later, Tez needs to be upgraded to >= `0.10.1` which contains a necessary fix [TEZ-4248](https://issues.apache.org/jira/browse/TEZ-4248). + +To use the Tez engine on Hive `2.3.x`, you will need to manually build Tez from the `branch-0.9` branch due to a +backwards incompatibility issue with Tez `0.10.1`. + +In both cases, you will also need to set the following property in the `tez-site.xml` configuration file: `tez.mrreader.config.update.properties=hive.io.file.readcolumn.names,hive.io.file.readcolumn.ids`. + +## Catalog Management + +### Global Hive catalog + +From the Hive engine's perspective, there is only one global data catalog that is defined in the Hadoop configuration in +the runtime environment. In contrast, Iceberg supports multiple different data catalog types such as Hive, Hadoop, AWS +Glue, or custom catalog implementations. Iceberg also allows loading a table directly based on its path in the file +system. Those tables do not belong to any catalog. Users might want to read these cross-catalog and path-based tables +through the Hive engine for use cases like join. + +To support this, a table in the Hive metastore can represent three different ways of loading an Iceberg table, depending +on the table's `iceberg.catalog` property: + +1. The table will be loaded using a `HiveCatalog` that corresponds to the metastore configured in the Hive environment + if no `iceberg.catalog` is set +2. The table will be loaded using a custom catalog if `iceberg.catalog` is set to a catalog name (see below) +3. The table can be loaded directly using the table's root location if `iceberg.catalog` is set + to `location_based_table` + +For cases 2 and 3 above, users can create an overlay of an Iceberg table in the Hive metastore, so that different table +types can work together in the same Hive environment. See [CREATE EXTERNAL TABLE](#create-external-table) +and [CREATE TABLE](#create-table) for more details. + +### Custom Iceberg catalogs + +To globally register different catalogs, set the following Hadoop configurations: + +| Config Key | Description | +| --------------------------------------------- | ------------------------------------------------------ | +| iceberg.catalog..type | type of catalog: `hive`, `hadoop`, or left unset if using a custom catalog | +| iceberg.catalog..catalog-impl | catalog implementation, must not be null if type is empty | +| iceberg.catalog.. | any config key and value pairs for the catalog | + +Here are some examples using Hive CLI: + +Register a `HiveCatalog` called `another_hive`: + +``` +SET iceberg.catalog.another_hive.type=hive; +SET iceberg.catalog.another_hive.uri=thrift://example.com:9083; +SET iceberg.catalog.another_hive.clients=10; +SET iceberg.catalog.another_hive.warehouse=hdfs://example.com:8020/warehouse; +``` + +Register a `HadoopCatalog` called `hadoop`: + +``` +SET iceberg.catalog.hadoop.type=hadoop; +SET iceberg.catalog.hadoop.warehouse=hdfs://example.com:8020/warehouse; +``` + +Register an AWS `GlueCatalog` called `glue`: + +``` +SET iceberg.catalog.glue.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog; +SET iceberg.catalog.glue.warehouse=s3://my-bucket/my/key/prefix; +SET iceberg.catalog.glue.lock.table=myGlueLockTable; +``` + +## DDL Commands + +Not all the features below are supported with Hive 2.3.x and Hive 3.1.x. Please refer to the +[Feature support](#feature-support) paragraph for further details. + +One generally applicable difference is that Hive 4.0.0-alpha-1 provides the possibility to use +`STORED BY ICEBERG` instead of the old `STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'` + +### CREATE TABLE + +#### Non partitioned tables + +The Hive `CREATE EXTERNAL TABLE` command creates an Iceberg table when you specify the storage handler as follows: + +```sql +CREATE EXTERNAL TABLE x (i int) STORED BY ICEBERG; +``` + +If you want to create external tables using CREATE TABLE, configure the MetaStoreMetadataTransformer on the cluster, +and `CREATE TABLE` commands are transformed to create external tables. For example: + +```sql +CREATE TABLE x (i int) STORED BY ICEBERG; +``` + +You can specify the default file format (Avro, Parquet, ORC) at the time of the table creation. +The default is Parquet: + +```sql +CREATE TABLE x (i int) STORED BY ICEBERG STORED AS ORC; +``` + +#### Partitioned tables +You can create Iceberg partitioned tables using a command familiar to those who create non-Iceberg tables: + +```sql +CREATE TABLE x (i int) PARTITIONED BY (j int) STORED BY ICEBERG; +``` + +{{< hint info >}} +The resulting table does not create partitions in HMS, but instead, converts partition data into Iceberg identity partitions. +{{< /hint >}} + +Use the DESCRIBE command to get information about the Iceberg identity partitions: + +```sql +DESCRIBE x; +``` +The result is: + +| col_name | data_type | comment +| ---------------------------------- | -------------- | ------- +| i | int | +| j | int | +| | NULL | NULL +| # Partition Transform Information | NULL | NULL +| # col_name | transform_type | NULL +| j | IDENTITY | NULL + +You can create Iceberg partitions using the following Iceberg partition specification syntax +(supported only from Hive 4.0.0-alpha-1): + +```sql +CREATE TABLE x (i int, ts timestamp) PARTITIONED BY SPEC (month(ts), bucket(2, i)) STORED AS ICEBERG; +DESCRIBE x; +``` +The result is: + +| col_name | data_type | comment +| ---------------------------------- | -------------- | ------- +| i | int | +| ts | timestamp | +| | NULL | NULL +| # Partition Transform Information | NULL | NULL +| # col_name | transform_type | NULL +| ts | MONTH | NULL +| i | BUCKET\[2\] | NULL + +The supported transformations for Hive are the same as for Spark: +* years(ts): partition by year +* months(ts): partition by month +* days(ts) or date(ts): equivalent to dateint partitioning +* hours(ts) or date_hour(ts): equivalent to dateint and hour partitioning +* bucket(N, col): partition by hashed value mod N buckets +* truncate(L, col): partition by value truncated to L + - Strings are truncated to the given length + - Integers and longs truncate to bins: truncate(10, i) produces partitions 0, 10, 20, 30, + +{{< hint info >}} +The resulting table does not create partitions in HMS, but instead, converts partition data into Iceberg partitions. +{{< /hint >}} + +### CREATE TABLE AS SELECT + +`CREATE TABLE AS SELECT` operation resembles the native Hive operation with a single important difference. +The Iceberg table and the corresponding Hive table are created at the beginning of the query execution. +The data is inserted / committed when the query finishes. So for a transient period the table already exists but contains no data. + +```sql +CREATE TABLE target PARTITIONED BY SPEC (year(year_field), identity_field) STORED BY ICEBERG AS + SELECT * FROM source; +``` + +### CREATE TABLE LIKE TABLE + +```sql +CREATE TABLE target LIKE source STORED BY ICEBERG; +``` + +### CREATE EXTERNAL TABLE overlaying an existing Iceberg table + +The `CREATE EXTERNAL TABLE` command is used to overlay a Hive table "on top of" an existing Iceberg table. Iceberg +tables are created using either a [`Catalog`](../../../javadoc/{{% icebergVersion +%}}/index.html?org/apache/iceberg/catalog/Catalog.html), or an implementation of the [`Tables`](../../../javadoc/{{% +icebergVersion %}}/index.html?org/apache/iceberg/Tables.html) interface, and Hive needs to be configured accordingly to +operate on these different types of table. + +#### Hive catalog tables + +As described before, tables created by the `HiveCatalog` with Hive engine feature enabled are directly visible by the +Hive engine, so there is no need to create an overlay. + +#### Custom catalog tables + +For a table in a registered catalog, specify the catalog name in the statement using table property `iceberg.catalog`. +For example, the SQL below creates an overlay for a table in a `hadoop` type catalog named `hadoop_cat`: + +```sql +SET +iceberg.catalog.hadoop_cat.type=hadoop; +SET +iceberg.catalog.hadoop_cat.warehouse=hdfs://example.com:8020/hadoop_cat; + +CREATE +EXTERNAL TABLE database_a.table_a +STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' +TBLPROPERTIES ('iceberg.catalog'='hadoop_cat'); +``` + +When `iceberg.catalog` is missing from both table properties and the global Hadoop configuration, `HiveCatalog` will be +used as default. + +#### Path-based Hadoop tables + +Iceberg tables created using `HadoopTables` are stored entirely in a directory in a filesystem like HDFS. These tables +are considered to have no catalog. To indicate that, set `iceberg.catalog` property to `location_based_table`. For +example: + +```sql +CREATE +EXTERNAL TABLE table_a +STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' +LOCATION 'hdfs://some_bucket/some_path/table_a' +TBLPROPERTIES ('iceberg.catalog'='location_based_table'); +``` + +#### CREATE TABLE overlaying an existing Iceberg table + +You can also create a new table that is managed by a custom catalog. For example, the following code creates a table in +a custom Hadoop catalog: + +```sql +SET +iceberg.catalog.hadoop_cat.type=hadoop; +SET +iceberg.catalog.hadoop_cat.warehouse=hdfs://example.com:8020/hadoop_cat; + +CREATE TABLE database_a.table_a +( + id bigint, + name string +) PARTITIONED BY ( + dept string +) STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' +TBLPROPERTIES ('iceberg.catalog'='hadoop_cat'); +``` + +{{< hint danger >}} If the table to create already exists in the custom catalog, this will create a managed overlay +table. This means technically you can omit the `EXTERNAL` keyword when creating an overlay table. However, this is **not +recommended** because creating managed overlay tables could pose a risk to the shared data files in case of accidental +drop table commands from the Hive side, which would unintentionally remove all the data in the table. {{< /hint >}} + +### ALTER TABLE +#### Table properties +For HiveCatalog tables the Iceberg table properties and the Hive table properties stored in HMS are kept in sync. + +{{< hint info >}} +IMPORTANT: This feature is not available for other Catalog implementations. +{{< /hint >}} +```sql +ALTER TABLE t SET TBLPROPERTIES('...'='...'); +``` + +#### Schema evolution +The Hive table schema is kept in sync with the Iceberg table. If an outside source (Impala/Spark/Java API/etc) +changes the schema, the Hive table immediately reflects the changes. You alter the table schema using Hive commands: + +* Add a column +```sql +ALTER TABLE orders ADD COLUMNS (nickname string); +``` +* Rename a column +```sql +ALTER TABLE orders CHANGE COLUMN item fruit string; +``` +* Reorder columns +```sql +ALTER TABLE orders CHANGE COLUMN quantity quantity int AFTER price; +``` +* Change a column type - only if the Iceberg defined the column type change as safe +```sql +ALTER TABLE orders CHANGE COLUMN price price long; +``` +* Drop column by using REPLACE COLUMN to remove the old column +```sql +ALTER TABLE orders REPLACE COLUMNS (remaining string); +``` +{{< hint info >}} +Note, that dropping columns is only thing REPLACE COLUMNS can be used for +i.e. if columns are specified out-of-order an error will be thrown signalling this limitation. +{{< /hint >}} + +#### Partition evolution +You change the partitioning schema using the following commands: +* Change the partitioning schema to new identity partitions: +```sql +ALTER TABLE default.customers SET PARTITION SPEC (last_name); +``` +* Alternatively, provide a partition specification: +```sql +ALTER TABLE order SET PARTITION SPEC (month(ts)); +``` +#### Table migration +You can migrate Avro / Parquet / ORC external tables to Iceberg tables using the following command: +```sql +ALTER TABLE t SET TBLPROPERTIES ('storage_handler'='org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'); +``` +During the migration the data files are not changed, only the appropriate Iceberg metadata files are created. +After the migration, handle the table as a normal Iceberg table. + +### TRUNCATE TABLE +The following command truncates the Iceberg table: +```sql +TRUNCATE TABLE t; +``` +Using a partition specification is not allowed. + +### DROP TABLE + +Tables can be dropped using the `DROP TABLE` command: + +```sql +DROP TABLE [IF EXISTS] table_name [PURGE]; +``` + +### METADATA LOCATION + +The metadata location (snapshot location) only can be changed if the new path contains the exact same metadata json. +It can be done only after migrating the table to Iceberg, the two operation cannot be done in one step. + +```sql +ALTER TABLE t set TBLPROPERTIES ('metadata_location'='/hivemetadata/00003-a1ada2b8-fc86-4b5b-8c91-400b6b46d0f2.metadata.json'); +``` + +## DML Commands + +### SELECT +Select statements work the same on Iceberg tables in Hive. You will see the Iceberg benefits over Hive in compilation and execution: +* **No file system listings** - especially important on blob stores, like S3 +* **No partition listing from** the Metastore +* **Advanced partition filtering** - the partition keys are not needed in the queries when they could be calculated +* Could handle **higher number of partitions** than normal Hive tables + +Here are the features highlights for Iceberg Hive read support: +1. **Predicate pushdown**: Pushdown of the Hive SQL `WHERE` clause has been implemented so that these filters are used at the Iceberg `TableScan` level as well as by the Parquet and ORC Readers. +2. **Column projection**: Columns from the Hive SQL `SELECT` clause are projected down to the Iceberg readers to reduce the number of columns read. +3. **Hive query engines**: + - With Hive 2.3.x, 3.1.x both the MapReduce and Tez query execution engines are supported. + - With Hive 4.0.0-alpha-1 Tez query execution engine is supported. + +Some of the advanced / little used optimizations are not yet implemented for Iceberg tables, so you should check your individual queries. +Also currently the statistics stored in the MetaStore are used for query planning. This is something we are planning to improve in the future. + +### INSERT INTO + +Hive supports the standard single-table INSERT INTO operation: + +```sql +INSERT INTO table_a +VALUES ('a', 1); +INSERT INTO table_a +SELECT...; +``` + +Multi-table insert is also supported, but it will not be atomic. Commits occur one table at a time. +Partial changes will be visible during the commit process and failures can leave partial changes committed. +Changes within a single table will remain atomic. + +Here is an example of inserting into multiple tables at once in Hive SQL: + +```sql +FROM customers + INSERT INTO target1 SELECT customer_id, first_name + INSERT INTO target2 SELECT last_name, customer_id; +``` + +### INSERT OVERWRITE +INSERT OVERWRITE can replace data in the table with the result of a query. Overwrites are atomic operations for Iceberg tables. +For nonpartitioned tables the content of the table is always removed. For partitioned tables the partitions +that have rows produced by the SELECT query will be replaced. +```sql +INSERT OVERWRITE TABLE target SELECT * FROM source; +``` + +### QUERYING METADATA TABLES +Hive supports querying of the Iceberg Metadata tables. The tables could be used as normal +Hive tables, so it is possible to use projections / joins / filters / etc. +To reference a metadata table the full name of the table should be used, like: +... + +Currently the following metadata tables are available in Hive: +* files +* entries +* snapshots +* manifests +* partitions + +```sql +SELECT * FROM default.table_a.files; +``` + +### TIMETRAVEL +Hive supports snapshot id based and time base timetravel queries. +For these views it is possible to use projections / joins / filters / etc. +The function is available with the following syntax: +```sql +SELECT * FROM table_a FOR SYSTEM_TIME AS OF '2021-08-09 10:35:57'; +SELECT * FROM table_a FOR SYSTEM_VERSION AS OF 1234567; +``` + +You can expire snapshots of an Iceberg table using an ALTER TABLE query from Hive. You should periodically expire snapshots to delete data files that is no longer needed, and reduce the size of table metadata. + +Each write to an Iceberg table from Hive creates a new snapshot, or version, of a table. Snapshots can be used for time-travel queries, or the table can be rolled back to any valid snapshot. Snapshots accumulate until they are expired by the expire_snapshots operation. +Enter a query to expire snapshots having the following timestamp: `2021-12-09 05:39:18.689000000` +```sql +ALTER TABLE test_table EXECUTE expire_snapshots('2021-12-09 05:39:18.689000000'); +``` + +### Type compatibility + +Hive and Iceberg support different set of types. Iceberg can perform type conversion automatically, but not for all +combinations, so you may want to understand the type conversion in Iceberg in prior to design the types of columns in +your tables. You can enable auto-conversion through Hadoop configuration (not enabled by default): + +| Config key | Default | Description | +| -----------------------------------------| --------------------------- | --------------------------------------------------- | +| iceberg.mr.schema.auto.conversion | false | if Hive should perform type auto-conversion | + +### Hive type to Iceberg type + +This type conversion table describes how Hive types are converted to the Iceberg types. The conversion applies on both +creating Iceberg table and writing to Iceberg table via Hive. + +| Hive | Iceberg | Notes | +|------------------|-------------------------|-------| +| boolean | boolean | | +| short | integer | auto-conversion | +| byte | integer | auto-conversion | +| integer | integer | | +| long | long | | +| float | float | | +| double | double | | +| date | date | | +| timestamp | timestamp without timezone | | +| timestamplocaltz | timestamp with timezone | Hive 3 only | +| interval_year_month | | not supported | +| interval_day_time | | not supported | +| char | string | auto-conversion | +| varchar | string | auto-conversion | +| string | string | | +| binary | binary | | +| decimal | decimal | | +| struct | struct | | +| list | list | | +| map | map | | +| union | | not supported | + +### Table rollback + +Rolling back iceberg table's data to the state at an older table snapshot. + +Rollback to the last snapshot before a specific timestamp + +```sql +ALTER TABLE ice_t EXECUTE ROLLBACK('2022-05-12 00:00:00') +``` + +Rollback to a specific snapshot ID +```sql +ALTER TABLE ice_t EXECUTE ROLLBACK(1111); +``` diff --git a/docs/java-api-quickstart.md b/docs/java-api-quickstart.md new file mode 100644 index 000000000000..f8c6b3e60942 --- /dev/null +++ b/docs/java-api-quickstart.md @@ -0,0 +1,325 @@ +--- +title: "Java Quickstart" +url: java-api-quickstart +aliases: + - "java/quickstart" +menu: + main: + parent: "API" + identifier: java_api_quickstart + weight: 100 +--- + + +# Java API Quickstart + +## Create a table + +Tables are created using either a [`Catalog`](../../../javadoc/{{% icebergVersion %}}/index.html?org/apache/iceberg/catalog/Catalog.html) or an implementation of the [`Tables`](../../../javadoc/{{% icebergVersion %}}/index.html?org/apache/iceberg/Tables.html) interface. + +### Using a Hive catalog + +The Hive catalog connects to a Hive metastore to keep track of Iceberg tables. +You can initialize a Hive catalog with a name and some properties. +(see: [Catalog properties](../configuration/#catalog-properties)) + +**Note:** Currently, `setConf` is always required for hive catalogs, but this will change in the future. + +```java +import org.apache.iceberg.hive.HiveCatalog; + +HiveCatalog catalog = new HiveCatalog(); +catalog.setConf(spark.sparkContext().hadoopConfiguration()); // Configure using Spark's Hadoop configuration + +Map properties = new HashMap(); +properties.put("warehouse", "..."); +properties.put("uri", "..."); + +catalog.initialize("hive", properties); +``` + +The `Catalog` interface defines methods for working with tables, like `createTable`, `loadTable`, `renameTable`, and `dropTable`. `HiveCatalog` implements the `Catalog` interface. + +To create a table, pass an `Identifier` and a `Schema` along with other initial metadata: + +```java +import org.apache.iceberg.Table; +import org.apache.iceberg.catalog.TableIdentifier; + +TableIdentifier name = TableIdentifier.of("logging", "logs"); +Table table = catalog.createTable(name, schema, spec); + +// or to load an existing table, use the following line +// Table table = catalog.loadTable(name); +``` + +The logs [schema](#create-a-schema) and [partition spec](#create-a-partition-spec) are created below. + + +### Using a Hadoop catalog + +A Hadoop catalog doesn't need to connect to a Hive MetaStore, but can only be used with HDFS or similar file systems that support atomic rename. Concurrent writes with a Hadoop catalog are not safe with a local FS or S3. To create a Hadoop catalog: + +```java +import org.apache.hadoop.conf.Configuration; +import org.apache.iceberg.hadoop.HadoopCatalog; + +Configuration conf = new Configuration(); +String warehousePath = "hdfs://host:8020/warehouse_path"; +HadoopCatalog catalog = new HadoopCatalog(conf, warehousePath); +``` + +Like the Hive catalog, `HadoopCatalog` implements `Catalog`, so it also has methods for working with tables, like `createTable`, `loadTable`, and `dropTable`. + +This example creates a table with Hadoop catalog: + +```java +import org.apache.iceberg.Table; +import org.apache.iceberg.catalog.TableIdentifier; + +TableIdentifier name = TableIdentifier.of("logging", "logs"); +Table table = catalog.createTable(name, schema, spec); + +// or to load an existing table, use the following line +// Table table = catalog.loadTable(name); +``` + +The logs [schema](#create-a-schema) and [partition spec](#create-a-partition-spec) are created below. + + +### Using Hadoop tables + +Iceberg also supports tables that are stored in a directory in HDFS. Concurrent writes with a Hadoop tables are not safe when stored in the local FS or S3. Directory tables don't support all catalog operations, like rename, so they use the `Tables` interface instead of `Catalog`. + +To create a table in HDFS, use `HadoopTables`: + +```java +import org.apache.hadoop.conf.Configuration; +import org.apache.iceberg.hadoop.HadoopTables; +import org.apache.iceberg.Table; + +Configuration conf = new Configuration(); +HadoopTables tables = new HadoopTables(conf); +Table table = tables.create(schema, spec, table_location); + +// or to load an existing table, use the following line +// Table table = tables.load(table_location); +``` + +{{< hint danger >}} +Hadoop tables shouldn't be used with file systems that do not support atomic rename. Iceberg relies on rename to synchronize concurrent commits for directory tables. +{{< /hint >}} + +### Tables in Spark + +Spark uses both `HiveCatalog` and `HadoopTables` to load tables. Hive is used when the identifier passed to `load` or `save` is not a path, otherwise Spark assumes it is a path-based table. + +To read and write to tables from Spark see: + +* [SQL queries in Spark](../spark-queries#querying-with-sql) +* [`INSERT INTO` in Spark](../spark-writes#insert-into) +* [`MERGE INTO` in Spark](../spark-writes#merge-into) + + +## Schemas + +### Create a schema + +This example creates a schema for a `logs` table: + +```java +import org.apache.iceberg.Schema; +import org.apache.iceberg.types.Types; + +Schema schema = new Schema( + Types.NestedField.required(1, "level", Types.StringType.get()), + Types.NestedField.required(2, "event_time", Types.TimestampType.withZone()), + Types.NestedField.required(3, "message", Types.StringType.get()), + Types.NestedField.optional(4, "call_stack", Types.ListType.ofRequired(5, Types.StringType.get())) + ); +``` + +When using the Iceberg API directly, type IDs are required. Conversions from other schema formats, like Spark, Avro, and Parquet will automatically assign new IDs. + +When a table is created, all IDs in the schema are re-assigned to ensure uniqueness. + +### Convert a schema from Avro + +To create an Iceberg schema from an existing Avro schema, use converters in `AvroSchemaUtil`: + +```java +import org.apache.avro.Schema; +import org.apache.avro.Schema.Parser; +import org.apache.iceberg.avro.AvroSchemaUtil; + +Schema avroSchema = new Parser().parse("{\"type\": \"record\" , ... }"); +Schema icebergSchema = AvroSchemaUtil.toIceberg(avroSchema); +``` + +### Convert a schema from Spark + +To create an Iceberg schema from an existing table, use converters in `SparkSchemaUtil`: + +```java +import org.apache.iceberg.spark.SparkSchemaUtil; + +Schema schema = SparkSchemaUtil.schemaForTable(sparkSession, table_name); +``` + +## Partitioning + +### Create a partition spec + +Partition specs describe how Iceberg should group records into data files. Partition specs are created for a table's schema using a builder. + +This example creates a partition spec for the `logs` table that partitions records by the hour of the log event's timestamp and by log level: + +```java +import org.apache.iceberg.PartitionSpec; + +PartitionSpec spec = PartitionSpec.builderFor(schema) + .hour("event_time") + .identity("level") + .build(); +``` + +For more information on the different partition transforms that Iceberg offers, visit [this page](../../../spec#partitioning). + +## Branching and Tagging + +### Creating branches and tags + +New branches and tags can be created via the Java library's ManageSnapshots API. + +```java + +/* Create a branch test-branch which is retained for 1 week, and the latest 2 snapshots on test-branch will always be retained. +Snapshots on test-branch which are created within the last hour will also be retained. */ + +String branch = "test-branch"; +table.manageSnapshots() + .createBranch(branch, 3) + .setMinSnapshotsToKeep(branch, 2) + .setMaxSnapshotAgeMs(branch, 3600000) + .setMaxRefAgeMs(branch, 604800000) + .commit(); + +// Create a tag historical-tag at snapshot 10 which is retained for a day +String tag = "historical-tag" +table.manageSnapshots() + .createTag(tag, 10) + .setMaxRefAgeMs(tag, 86400000) + .commit(); +``` + +### Committing to branches + +Writing to a branch can be performed by specifying `toBranch` in the operation. For the full list refer to [UpdateOperations](../../java/api/#update-operations). +```java +// Append FILE_A to branch test-branch +String branch = "test-branch"; + +table.newAppend() + .appendFile(FILE_A) + .toBranch(branch) + .commit(); + + +// Perform row level updates on "test-branch" +table.newRowDelta() + .addRows(DATA_FILE) + .addDeletes(DELETES) + .toBranch(branch) + .commit(); + + +// Perform a rewrite operation replacing small_file_1 and small_file_2 on "test-branch" with compacted_file. +table.newRewrite() + .rewriteFiles(ImmutableSet.of(small_file_1,small_file_2), ImmutableSet.of(compacted_file)) + .toBranch(branch) + .commit(); + +``` + +### Reading from branches and tags +Reading from a branch or tag can be done as usual via the Table Scan API, by passing in a branch or tag in the `useRef` API. When a branch is passed in, the snapshot that's used is the head of the branch. Note that currently reading from a branch and specifying an `asOfSnapshotId` in the scan is not supported. + +```java +// Read from the head snapshot of test-branch +TableScan branchRead = table.newScan().useRef("test-branch"); + +// Read from the snapshot referenced by audit-tag +TableScan tagRead = table.newScan().useRef("audit-tag"); +``` + +### Replacing and fast forwarding branches and tags + +The snapshots which existing branches and tags point to can be updated via the `replace` APIs. The fast forward operation is similar to git fast-forwarding. Fast forward can be used to advance a target branch to the head of a source branch or a tag when the target branch is an ancestor of the source. For both fast forward and replace, retention properties of the target branch are maintained by default. + +```java + +// Update "test-branch" to point to snapshot 4 +table.manageSnapshots() + .replaceBranch(branch, 4) + .commit() + +String tag = "audit-tag"; +// Replace "audit-tag" to point to snapshot 3 and update its retention +table.manageSnapshots() + .replaceBranch(tag, 4) + .setMaxRefAgeMs(1000) + .commit() + + +``` + +### Updating retention properties + +Retention properties for branches and tags can be updated as well. +Use the setMaxRefAgeMs for updating the retention property of the branch or tag itself. Branch snapshot retention properties can be updated via the `setMinSnapshotsToKeep` and `setMaxSnapshotAgeMs` APIs. + +```java +String branch = "test-branch"; +// Update retention properties for test-branch +table.manageSnapshots() + .setMinSnapshotsToKeep(branch, 10) + .setMaxSnapshotAgeMs(branch, 7200000) + .setMaxRefAgeMs(branch, 604800000) + .commit(); + +// Update retention properties for test-tag +table.manageSnapshots() + .setMaxRefAgeMs("test-tag", 604800000) + .commit(); +``` + +### Removing branches and tags + +Branches and tags can be removed via the `removeBranch` and `removeTag` APIs respectively + +```java +// Remove test-branch +table.manageSnapshots() + .removeBranch("test-branch") + .commit() + +// Remove test-tag +table.manageSnapshots() + .removeTag("test-tag") + .commit() +``` diff --git a/docs/java-api.md b/docs/java-api.md new file mode 100644 index 000000000000..62b51e096fb2 --- /dev/null +++ b/docs/java-api.md @@ -0,0 +1,264 @@ +--- +title: "Java API" +url: api +aliases: + - "java/api" +menu: + main: + parent: "API" + identifier: java_api + weight: 200 +--- + + +# Iceberg Java API + +## Tables + +The main purpose of the Iceberg API is to manage table metadata, like schema, partition spec, metadata, and data files that store table data. + +Table metadata and operations are accessed through the `Table` interface. This interface will return table information. + +### Table metadata + +The [`Table` interface](../../../javadoc/{{% icebergVersion %}}/index.html?org/apache/iceberg/Table.html) provides access to the table metadata: + +* `schema` returns the current table [schema](../schemas) +* `spec` returns the current table partition spec +* `properties` returns a map of key-value [properties](../configuration) +* `currentSnapshot` returns the current table snapshot +* `snapshots` returns all valid snapshots for the table +* `snapshot(id)` returns a specific snapshot by ID +* `location` returns the table's base location + +Tables also provide `refresh` to update the table to the latest version, and expose helpers: + +* `io` returns the `FileIO` used to read and write table files +* `locationProvider` returns a `LocationProvider` used to create paths for data and metadata files + + +### Scanning + +#### File level + +Iceberg table scans start by creating a `TableScan` object with `newScan`. + +```java +TableScan scan = table.newScan(); +``` + +To configure a scan, call `filter` and `select` on the `TableScan` to get a new `TableScan` with those changes. + +```java +TableScan filteredScan = scan.filter(Expressions.equal("id", 5)) +``` + +Calls to configuration methods create a new `TableScan` so that each `TableScan` is immutable and won't change unexpectedly if shared across threads. + +When a scan is configured, `planFiles`, `planTasks`, and `schema` are used to return files, tasks, and the read projection. + +```java +TableScan scan = table.newScan() + .filter(Expressions.equal("id", 5)) + .select("id", "data"); + +Schema projection = scan.schema(); +Iterable tasks = scan.planTasks(); +``` + +Use `asOfTime` or `useSnapshot` to configure the table snapshot for time travel queries. + +#### Row level + +Iceberg table scans start by creating a `ScanBuilder` object with `IcebergGenerics.read`. + +```java +ScanBuilder scanBuilder = IcebergGenerics.read(table) +``` + +To configure a scan, call `where` and `select` on the `ScanBuilder` to get a new `ScanBuilder` with those changes. + +```java +scanBuilder.where(Expressions.equal("id", 5)) +``` + +When a scan is configured, call method `build` to execute scan. `build` return `CloseableIterable` + +```java +CloseableIterable result = IcebergGenerics.read(table) + .where(Expressions.lessThan("id", 5)) + .build(); +``` +where `Record` is Iceberg record for iceberg-data module `org.apache.iceberg.data.Record`. + +### Update operations + +`Table` also exposes operations that update the table. These operations use a builder pattern, [`PendingUpdate`](../../../javadoc/{{% icebergVersion %}}/index.html?org/apache/iceberg/PendingUpdate.html), that commits when `PendingUpdate#commit` is called. + +For example, updating the table schema is done by calling `updateSchema`, adding updates to the builder, and finally calling `commit` to commit the pending changes to the table: + +```java +table.updateSchema() + .addColumn("count", Types.LongType.get()) + .commit(); +``` + +Available operations to update a table are: + +* `updateSchema` -- update the table schema +* `updateProperties` -- update table properties +* `updateLocation` -- update the table's base location +* `newAppend` -- used to append data files +* `newFastAppend` -- used to append data files, will not compact metadata +* `newOverwrite` -- used to append data files and remove files that are overwritten +* `newDelete` -- used to delete data files +* `newRewrite` -- used to rewrite data files; will replace existing files with new versions +* `newTransaction` -- create a new table-level transaction +* `rewriteManifests` -- rewrite manifest data by clustering files, for faster scan planning +* `rollback` -- rollback the table state to a specific snapshot + +### Transactions + +Transactions are used to commit multiple table changes in a single atomic operation. A transaction is used to create individual operations using factory methods, like `newAppend`, just like working with a `Table`. Operations created by a transaction are committed as a group when `commitTransaction` is called. + +For example, deleting and appending a file in the same transaction: +```java +Transaction t = table.newTransaction(); + +// commit operations to the transaction +t.newDelete().deleteFromRowFilter(filter).commit(); +t.newAppend().appendFile(data).commit(); + +// commit all the changes to the table +t.commitTransaction(); +``` + +## Types + +Iceberg data types are located in the [`org.apache.iceberg.types` package](../../../javadoc/{{% icebergVersion %}}/index.html?org/apache/iceberg/types/package-summary.html). + +### Primitives + +Primitive type instances are available from static methods in each type class. Types without parameters use `get`, and types like `decimal` use factory methods: + +```java +Types.IntegerType.get() // int +Types.DoubleType.get() // double +Types.DecimalType.of(9, 2) // decimal(9, 2) +``` + +### Nested types + +Structs, maps, and lists are created using factory methods in type classes. + +Like struct fields, map keys or values and list elements are tracked as nested fields. Nested fields track [field IDs](../evolution#correctness) and nullability. + +Struct fields are created using `NestedField.optional` or `NestedField.required`. Map value and list element nullability is set in the map and list factory methods. + +```java +// struct<1 id: int, 2 data: optional string> +StructType struct = Struct.of( + Types.NestedField.required(1, "id", Types.IntegerType.get()), + Types.NestedField.optional(2, "data", Types.StringType.get()) + ) +``` +```java +// map<1 key: int, 2 value: optional string> +MapType map = MapType.ofOptional( + 1, 2, + Types.IntegerType.get(), + Types.StringType.get() + ) +``` +```java +// array<1 element: int> +ListType list = ListType.ofRequired(1, IntegerType.get()); +``` + + +## Expressions + +Iceberg's expressions are used to configure table scans. To create expressions, use the factory methods in [`Expressions`](../../../javadoc/{{% icebergVersion %}}/index.html?org/apache/iceberg/expressions/Expressions.html). + +Supported predicate expressions are: + +* `isNull` +* `notNull` +* `equal` +* `notEqual` +* `lessThan` +* `lessThanOrEqual` +* `greaterThan` +* `greaterThanOrEqual` +* `in` +* `notIn` +* `startsWith` +* `notStartsWith` + +Supported expression operations are: + +* `and` +* `or` +* `not` + +Constant expressions are: + +* `alwaysTrue` +* `alwaysFalse` + +### Expression binding + +When created, expressions are unbound. Before an expression is used, it will be bound to a data type to find the field ID the expression name represents, and to convert predicate literals. + +For example, before using the expression `lessThan("x", 10)`, Iceberg needs to determine which column `"x"` refers to and convert `10` to that column's data type. + +If the expression could be bound to the type `struct<1 x: long, 2 y: long>` or to `struct<11 x: int, 12 y: int>`. + +### Expression example + +```java +table.newScan() + .filter(Expressions.greaterThanOrEqual("x", 5)) + .filter(Expressions.lessThan("x", 10)) +``` + + +## Modules + +Iceberg table support is organized in library modules: + +* `iceberg-common` contains utility classes used in other modules +* `iceberg-api` contains the public Iceberg API, including expressions, types, tables, and operations +* `iceberg-arrow` is an implementation of the Iceberg type system for reading and writing data stored in Iceberg tables using Apache Arrow as the in-memory data format +* `iceberg-aws` contains implementations of the Iceberg API to be used with tables stored on AWS S3 and/or for tables defined using the AWS Glue data catalog +* `iceberg-core` contains implementations of the Iceberg API and support for Avro data files, **this is what processing engines should depend on** +* `iceberg-parquet` is an optional module for working with tables backed by Parquet files +* `iceberg-orc` is an optional module for working with tables backed by ORC files (*experimental*) +* `iceberg-hive-metastore` is an implementation of Iceberg tables backed by the Hive metastore Thrift client + +This project Iceberg also has modules for adding Iceberg support to processing engines and associated tooling: + +* `iceberg-spark` is an implementation of Spark's Datasource V2 API for Iceberg with submodules for each spark versions (use runtime jars for a shaded version) +* `iceberg-flink` is an implementation of Flink's Table and DataStream API for Iceberg (use iceberg-flink-runtime for a shaded version) +* `iceberg-hive3` is an implementation of Hive 3 specific SerDe's for Timestamp, TimestampWithZone, and Date object inspectors (use iceberg-hive-runtime for a shaded version). +* `iceberg-mr` is an implementation of MapReduce and Hive InputFormats and SerDes for Iceberg (use iceberg-hive-runtime for a shaded version for use with Hive) +* `iceberg-nessie` is a module used to integrate Iceberg table metadata history and operations with [Project Nessie](https://projectnessie.org/) +* `iceberg-data` is a client library used to read Iceberg tables from JVM applications +* `iceberg-pig` is an implementation of Pig's LoadFunc API for Iceberg +* `iceberg-runtime` generates a shaded runtime jar for Spark to integrate with iceberg tables + diff --git a/docs/java-custom-catalog.md b/docs/java-custom-catalog.md new file mode 100644 index 000000000000..29a4af1204f1 --- /dev/null +++ b/docs/java-custom-catalog.md @@ -0,0 +1,276 @@ +--- +title: "Java Custom Catalog" +url: custom-catalog +aliases: + - "java/custom-catalog" +menu: + main: + parent: "API" + identifier: java_custom_catalog + weight: 300 +--- + + +# Custom Catalog + +It's possible to read an iceberg table either from an hdfs path or from a hive table. It's also possible to use a custom metastore in place of hive. The steps to do that are as follows. + +- [Custom TableOperations](#custom-table-operations-implementation) +- [Custom Catalog](#custom-catalog-implementation) +- [Custom FileIO](#custom-file-io-implementation) +- [Custom LocationProvider](#custom-location-provider-implementation) +- [Custom IcebergSource](#custom-icebergsource) + +### Custom table operations implementation +Extend `BaseMetastoreTableOperations` to provide implementation on how to read and write metadata + +Example: +```java +class CustomTableOperations extends BaseMetastoreTableOperations { + private String dbName; + private String tableName; + private Configuration conf; + private FileIO fileIO; + + protected CustomTableOperations(Configuration conf, String dbName, String tableName) { + this.conf = conf; + this.dbName = dbName; + this.tableName = tableName; + } + + // The doRefresh method should provide implementation on how to get the metadata location + @Override + public void doRefresh() { + + // Example custom service which returns the metadata location given a dbName and tableName + String metadataLocation = CustomService.getMetadataForTable(conf, dbName, tableName); + + // When updating from a metadata file location, call the helper method + refreshFromMetadataLocation(metadataLocation); + + } + + // The doCommit method should provide implementation on how to update with metadata location atomically + @Override + public void doCommit(TableMetadata base, TableMetadata metadata) { + String oldMetadataLocation = base.location(); + + // Write new metadata using helper method + String newMetadataLocation = writeNewMetadata(metadata, currentVersion() + 1); + + // Example custom service which updates the metadata location for the given db and table atomically + CustomService.updateMetadataLocation(dbName, tableName, oldMetadataLocation, newMetadataLocation); + + } + + // The io method provides a FileIO which is used to read and write the table metadata files + @Override + public FileIO io() { + if (fileIO == null) { + fileIO = new HadoopFileIO(conf); + } + return fileIO; + } +} +``` + +A `TableOperations` instance is usually obtained by calling `Catalog.newTableOps(TableIdentifier)`. +See the next section about implementing and loading a custom catalog. + +### Custom catalog implementation +Extend `BaseMetastoreCatalog` to provide default warehouse locations and instantiate `CustomTableOperations` + +Example: +```java +public class CustomCatalog extends BaseMetastoreCatalog { + + private Configuration configuration; + + // must have a no-arg constructor to be dynamically loaded + // initialize(String name, Map properties) will be called to complete initialization + public CustomCatalog() { + } + + public CustomCatalog(Configuration configuration) { + this.configuration = configuration; + } + + @Override + protected TableOperations newTableOps(TableIdentifier tableIdentifier) { + String dbName = tableIdentifier.namespace().level(0); + String tableName = tableIdentifier.name(); + // instantiate the CustomTableOperations + return new CustomTableOperations(configuration, dbName, tableName); + } + + @Override + protected String defaultWarehouseLocation(TableIdentifier tableIdentifier) { + + // Can choose to use any other configuration name + String tableLocation = configuration.get("custom.iceberg.warehouse.location"); + + // Can be an s3 or hdfs path + if (tableLocation == null) { + throw new RuntimeException("custom.iceberg.warehouse.location configuration not set!"); + } + + return String.format( + "%s/%s.db/%s", tableLocation, + tableIdentifier.namespace().levels()[0], + tableIdentifier.name()); + } + + @Override + public boolean dropTable(TableIdentifier identifier, boolean purge) { + // Example service to delete table + CustomService.deleteTable(identifier.namespace().level(0), identifier.name()); + } + + @Override + public void renameTable(TableIdentifier from, TableIdentifier to) { + Preconditions.checkArgument(from.namespace().level(0).equals(to.namespace().level(0)), + "Cannot move table between databases"); + // Example service to rename table + CustomService.renameTable(from.namespace().level(0), from.name(), to.name()); + } + + // implement this method to read catalog name and properties during initialization + public void initialize(String name, Map properties) { + } +} +``` + +Catalog implementations can be dynamically loaded in most compute engines. +For Spark and Flink, you can specify the `catalog-impl` catalog property to load it. +Read the [Configuration](../configuration/#catalog-properties) section for more details. +For MapReduce, implement `org.apache.iceberg.mr.CatalogLoader` and set Hadoop property `iceberg.mr.catalog.loader.class` to load it. +If your catalog must read Hadoop configuration to access certain environment properties, make your catalog implement `org.apache.hadoop.conf.Configurable`. + +### Custom file IO implementation + +Extend `FileIO` and provide implementation to read and write data files + +Example: +```java +public class CustomFileIO implements FileIO { + + // must have a no-arg constructor to be dynamically loaded + // initialize(Map properties) will be called to complete initialization + public CustomFileIO() { + } + + @Override + public InputFile newInputFile(String s) { + // you also need to implement the InputFile interface for a custom input file + return new CustomInputFile(s); + } + + @Override + public OutputFile newOutputFile(String s) { + // you also need to implement the OutputFile interface for a custom output file + return new CustomOutputFile(s); + } + + @Override + public void deleteFile(String path) { + Path toDelete = new Path(path); + FileSystem fs = Util.getFs(toDelete); + try { + fs.delete(toDelete, false /* not recursive */); + } catch (IOException e) { + throw new RuntimeIOException(e, "Failed to delete file: %s", path); + } + } + + // implement this method to read catalog properties during initialization + public void initialize(Map properties) { + } +} +``` + +If you are already implementing your own catalog, you can implement `TableOperations.io()` to use your custom `FileIO`. +In addition, custom `FileIO` implementations can also be dynamically loaded in `HadoopCatalog` and `HiveCatalog` by specifying the `io-impl` catalog property. +Read the [Configuration](../configuration/#catalog-properties) section for more details. +If your `FileIO` must read Hadoop configuration to access certain environment properties, make your `FileIO` implement `org.apache.hadoop.conf.Configurable`. + +### Custom location provider implementation + +Extend `LocationProvider` and provide implementation to determine the file path to write data + +Example: +```java +public class CustomLocationProvider implements LocationProvider { + + private String tableLocation; + + // must have a 2-arg constructor like this, or a no-arg constructor + public CustomLocationProvider(String tableLocation, Map properties) { + this.tableLocation = tableLocation; + } + + @Override + public String newDataLocation(String filename) { + // can use any custom method to generate a file path given a file name + return String.format("%s/%s/%s", tableLocation, UUID.randomUUID().toString(), filename); + } + + @Override + public String newDataLocation(PartitionSpec spec, StructLike partitionData, String filename) { + // can use any custom method to generate a file path given a partition info and file name + return newDataLocation(filename); + } +} +``` + +If you are already implementing your own catalog, you can override `TableOperations.locationProvider()` to use your custom default `LocationProvider`. +To use a different custom location provider for a specific table, specify the implementation when creating the table using table property `write.location-provider.impl` + +Example: +```sql +CREATE TABLE hive.default.my_table ( + id bigint, + data string, + category string) +USING iceberg +OPTIONS ( + 'write.location-provider.impl'='com.my.CustomLocationProvider' +) +PARTITIONED BY (category); +``` + +### Custom IcebergSource +Extend `IcebergSource` and provide implementation to read from `CustomCatalog` + +Example: +```java +public class CustomIcebergSource extends IcebergSource { + + @Override + protected Table findTable(DataSourceOptions options, Configuration conf) { + Optional path = options.get("path"); + Preconditions.checkArgument(path.isPresent(), "Cannot open table: path is not set"); + + // Read table from CustomCatalog + CustomCatalog catalog = new CustomCatalog(conf); + TableIdentifier tableIdentifier = TableIdentifier.parse(path.get()); + return catalog.loadTable(tableIdentifier); + } +} +``` + +Register the `CustomIcebergSource` by updating `META-INF/services/org.apache.spark.sql.sources.DataSourceRegister` with its fully qualified name diff --git a/docs/jdbc.md b/docs/jdbc.md new file mode 100644 index 000000000000..40ee049759be --- /dev/null +++ b/docs/jdbc.md @@ -0,0 +1,76 @@ +--- +title: "JDBC" +url: jdbc +menu: + main: + parent: Integrations + identifier: jdbc_integration + weight: 0 +--- + + +# Iceberg JDBC Integration + +## JDBC Catalog + +Iceberg supports using a table in a relational database to manage Iceberg tables through JDBC. +The database that JDBC connects to must support atomic transaction to allow the JDBC catalog implementation to +properly support atomic Iceberg table commits and read serializable isolation. + +### Configurations + +Because each database and database service provider might require different configurations, +the JDBC catalog allows arbitrary configurations through: + +| Property | Default | Description | +| -------------------- | --------------------------------- | ------------------------------------------------------ | +| uri | | the JDBC connection string | +| jdbc. | | any key value pairs to configure the JDBC connection | + +### Examples + + +#### Spark + +You can start a Spark session with a MySQL JDBC connection using the following configurations: + +```shell +spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{% icebergVersion %}} \ + --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my/key/prefix \ + --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.jdbc.JdbcCatalog \ + --conf spark.sql.catalog.my_catalog.uri=jdbc:mysql://test.1234567890.us-west-2.rds.amazonaws.com:3306/default \ + --conf spark.sql.catalog.my_catalog.jdbc.verifyServerCertificate=true \ + --conf spark.sql.catalog.my_catalog.jdbc.useSSL=true \ + --conf spark.sql.catalog.my_catalog.jdbc.user=admin \ + --conf spark.sql.catalog.my_catalog.jdbc.password=pass +``` + +#### Java API + +```java +Class.forName("com.mysql.cj.jdbc.Driver"); // ensure JDBC driver is at runtime classpath +Map properties = new HashMap<>(); +properties.put(CatalogProperties.CATALOG_IMPL, JdbcCatalog.class.getName()); +properties.put(CatalogProperties.URI, "jdbc:mysql://localhost:3306/test"); +properties.put(JdbcCatalog.PROPERTY_PREFIX + "user", "admin"); +properties.put(JdbcCatalog.PROPERTY_PREFIX + "password", "pass"); +properties.put(CatalogProperties.WAREHOUSE_LOCATION, "s3://warehouse/path"); +Configuration hadoopConf = new Configuration(); // configs if you use HadoopFileIO +JdbcCatalog catalog = CatalogUtil.buildIcebergCatalog("test_jdbc_catalog", properties, hadoopConf); +``` \ No newline at end of file diff --git a/docs/maintenance.md b/docs/maintenance.md new file mode 100644 index 000000000000..3e5e2a080bf3 --- /dev/null +++ b/docs/maintenance.md @@ -0,0 +1,165 @@ +--- +title: Maintenance +url: maintenance +aliases: + - "tables/maintenance" +menu: + main: + parent: Tables + identifier: tables_maintenance + weight: 0 +--- + + +# Maintenance + +{{< hint info >}} +Maintenance operations require the `Table` instance. Please refer [Java API quickstart](../java-api-quickstart/#create-a-table) page to refer how to load an existing table. +{{< /hint >}} +## Recommended Maintenance + +### Expire Snapshots + +Each write to an Iceberg table creates a new _snapshot_, or version, of a table. Snapshots can be used for time-travel queries, or the table can be rolled back to any valid snapshot. + +Snapshots accumulate until they are expired by the [`expireSnapshots`](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/Table.html#expireSnapshots--) operation. Regularly expiring snapshots is recommended to delete data files that are no longer needed, and to keep the size of table metadata small. + +This example expires snapshots that are older than 1 day: + +```java +Table table = ... +long tsToExpire = System.currentTimeMillis() - (1000 * 60 * 60 * 24); // 1 day +table.expireSnapshots() + .expireOlderThan(tsToExpire) + .commit(); +``` + +See the [`ExpireSnapshots` Javadoc](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/ExpireSnapshots.html) to see more configuration options. + +There is also a Spark action that can run table expiration in parallel for large tables: + +```java +Table table = ... +SparkActions + .get() + .expireSnapshots(table) + .expireOlderThan(tsToExpire) + .execute(); +``` + +Expiring old snapshots removes them from metadata, so they are no longer available for time travel queries. + +{{< hint info >}} +Data files are not deleted until they are no longer referenced by a snapshot that may be used for time travel or rollback. +Regularly expiring snapshots deletes unused data files. +{{< /hint >}} + +### Remove old metadata files + +Iceberg keeps track of table metadata using JSON files. Each change to a table produces a new metadata file to provide atomicity. + +Old metadata files are kept for history by default. Tables with frequent commits, like those written by streaming jobs, may need to regularly clean metadata files. + +To automatically clean metadata files, set `write.metadata.delete-after-commit.enabled=true` in table properties. This will keep some metadata files (up to `write.metadata.previous-versions-max`) and will delete the oldest metadata file after each new one is created. + +| Property | Description | +| -------------------------------------------- |--------------------------------------------------------------------------| +| `write.metadata.delete-after-commit.enabled` | Whether to delete old **tracked** metadata files after each table commit | +| `write.metadata.previous-versions-max` | The number of old metadata files to keep | + +Note that this will only delete metadata files that are **tracked** in the metadata log and will not delete orphaned metadata files. +Example: With `write.metadata.delete-after-commit.enabled=false` and `write.metadata.previous-versions-max=10`, one will have 10 tracked metadata files and 90 orphaned metadata files after 100 commits. +Configuring `write.metadata.delete-after-commit.enabled=true` and `write.metadata.previous-versions-max=20` will not automatically delete metadata files. Tracked metadata files would be deleted again when reaching `write.metadata.previous-versions-max=20`. + +See [table write properties](../configuration/#write-properties) for more details. + +### Delete orphan files + +In Spark and other distributed processing engines, task or job failures can leave files that are not referenced by table metadata, and in some cases normal snapshot expiration may not be able to determine a file is no longer needed and delete it. + +To clean up these "orphan" files under a table location, use the `deleteOrphanFiles` action. + +```java +Table table = ... +SparkActions + .get() + .deleteOrphanFiles(table) + .execute(); +``` + +See the [DeleteOrphanFiles Javadoc](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/actions/DeleteOrphanFiles.html) to see more configuration options. + +This action may take a long time to finish if you have lots of files in data and metadata directories. It is recommended to execute this periodically, but you may not need to execute this often. + +{{< hint info >}} +It is dangerous to remove orphan files with a retention interval shorter than the time expected for any write to complete because it +might corrupt the table if in-progress files are considered orphaned and are deleted. The default interval is 3 days. +{{< /hint >}} + +{{< hint info >}} +Iceberg uses the string representations of paths when determining which files need to be removed. On some file systems, +the path can change over time, but it still represents the same file. For example, if you change authorities for an HDFS cluster, +none of the old path urls used during creation will match those that appear in a current listing. *This will lead to data loss when +RemoveOrphanFiles is run*. Please be sure the entries in your MetadataTables match those listed by the Hadoop +FileSystem API to avoid unintentional deletion. +{{< /hint >}} + +## Optional Maintenance + +Some tables require additional maintenance. For example, streaming queries may produce small data files that should be [compacted into larger files](#compact-data-files). And some tables can benefit from [rewriting manifest files](#rewrite-manifests) to make locating data for queries much faster. + +### Compact data files + +Iceberg tracks each data file in a table. More data files leads to more metadata stored in manifest files, and small data files causes an unnecessary amount of metadata and less efficient queries from file open costs. + +Iceberg can compact data files in parallel using Spark with the `rewriteDataFiles` action. This will combine small files into larger files to reduce metadata overhead and runtime file open cost. + +```java +Table table = ... +SparkActions + .get() + .rewriteDataFiles(table) + .filter(Expressions.equal("date", "2020-08-18")) + .option("target-file-size-bytes", Long.toString(500 * 1024 * 1024)) // 500 MB + .execute(); +``` + +The `files` metadata table is useful for inspecting data file sizes and determining when to compact partitions. + +See the [`RewriteDataFiles` Javadoc](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/actions/RewriteDataFiles.html) to see more configuration options. + +### Rewrite manifests + +Iceberg uses metadata in its manifest list and manifest files speed up query planning and to prune unnecessary data files. The metadata tree functions as an index over a table's data. + +Manifests in the metadata tree are automatically compacted in the order they are added, which makes queries faster when the write pattern aligns with read filters. For example, writing hourly-partitioned data as it arrives is aligned with time range query filters. + +When a table's write pattern doesn't align with the query pattern, metadata can be rewritten to re-group data files into manifests using `rewriteManifests` or the `rewriteManifests` action (for parallel rewrites using Spark). + +This example rewrites small manifests and groups data files by the first partition field. + +```java +Table table = ... +SparkActions + .get() + .rewriteManifests(table) + .rewriteIf(file -> file.length() < 10 * 1024 * 1024) // 10 MB + .execute(); +``` + +See the [`RewriteManifests` Javadoc](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/actions/RewriteManifests.html) to see more configuration options. diff --git a/docs/metrics-reporting.md b/docs/metrics-reporting.md new file mode 100644 index 000000000000..3805b4894c6c --- /dev/null +++ b/docs/metrics-reporting.md @@ -0,0 +1,174 @@ +--- +title: "Metrics Reporting" +url: metrics-reporting +aliases: + - "tables/metrics-reporting" +menu: + main: + parent: Tables + identifier: metrics_reporting + weight: 0 +--- + + +# Metrics Reporting + +As of 1.1.0 Iceberg supports the [`MetricsReporter`](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/metrics/MetricsReporter.html) and the [`MetricsReport`](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/metrics/MetricsReport.html) APIs. These two APIs allow expressing different metrics reports while supporting a pluggable way of reporting these reports. + +## Type of Reports + +### ScanReport +A [`ScanReport`](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/metrics/ScanReport.html) carries metrics being collected during scan planning against a given table. Amongst some general information about the involved table, such as the snapshot id or the table name, it includes metrics like: +* total scan planning duration +* number of data/delete files included in the result +* number of data/delete manifests scanned/skipped +* number of data/delete files scanned/skipped +* number of equality/positional delete files scanned + + +### CommitReport +A [`CommitReport`](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/metrics/CommitReport.html) carries metrics being collected after committing changes to a table (aka producing a snapshot). Amongst some general information about the involved table, such as the snapshot id or the table name, it includes metrics like: +* total duration +* number of attempts required for the commit to succeed +* number of added/removed data/delete files +* number of added/removed equality/positional delete files +* number of added/removed equality/positional deletes + + +## Available Metrics Reporters + +### [`LoggingMetricsReporter`](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/metrics/LoggingMetricsReporter.html) + +This is the default metrics reporter when nothing else is configured and its purpose is to log results to the log file. Example output would look as shown below: + +``` +INFO org.apache.iceberg.metrics.LoggingMetricsReporter - Received metrics report: +ScanReport{ + tableName=scan-planning-with-eq-and-pos-delete-files, + snapshotId=2, + filter=ref(name="data") == "(hash-27fa7cc0)", + schemaId=0, + projectedFieldIds=[1, 2], + projectedFieldNames=[id, data], + scanMetrics=ScanMetricsResult{ + totalPlanningDuration=TimerResult{timeUnit=NANOSECONDS, totalDuration=PT0.026569404S, count=1}, + resultDataFiles=CounterResult{unit=COUNT, value=1}, + resultDeleteFiles=CounterResult{unit=COUNT, value=2}, + totalDataManifests=CounterResult{unit=COUNT, value=1}, + totalDeleteManifests=CounterResult{unit=COUNT, value=1}, + scannedDataManifests=CounterResult{unit=COUNT, value=1}, + skippedDataManifests=CounterResult{unit=COUNT, value=0}, + totalFileSizeInBytes=CounterResult{unit=BYTES, value=10}, + totalDeleteFileSizeInBytes=CounterResult{unit=BYTES, value=20}, + skippedDataFiles=CounterResult{unit=COUNT, value=0}, + skippedDeleteFiles=CounterResult{unit=COUNT, value=0}, + scannedDeleteManifests=CounterResult{unit=COUNT, value=1}, + skippedDeleteManifests=CounterResult{unit=COUNT, value=0}, + indexedDeleteFiles=CounterResult{unit=COUNT, value=2}, + equalityDeleteFiles=CounterResult{unit=COUNT, value=1}, + positionalDeleteFiles=CounterResult{unit=COUNT, value=1}}, + metadata={ + iceberg-version=Apache Iceberg 1.4.0-SNAPSHOT (commit 4868d2823004c8c256a50ea7c25cff94314cc135)}} +``` + +``` +INFO org.apache.iceberg.metrics.LoggingMetricsReporter - Received metrics report: +CommitReport{ + tableName=scan-planning-with-eq-and-pos-delete-files, + snapshotId=1, + sequenceNumber=1, + operation=append, + commitMetrics=CommitMetricsResult{ + totalDuration=TimerResult{timeUnit=NANOSECONDS, totalDuration=PT0.098429626S, count=1}, + attempts=CounterResult{unit=COUNT, value=1}, + addedDataFiles=CounterResult{unit=COUNT, value=1}, + removedDataFiles=null, + totalDataFiles=CounterResult{unit=COUNT, value=1}, + addedDeleteFiles=null, + addedEqualityDeleteFiles=null, + addedPositionalDeleteFiles=null, + removedDeleteFiles=null, + removedEqualityDeleteFiles=null, + removedPositionalDeleteFiles=null, + totalDeleteFiles=CounterResult{unit=COUNT, value=0}, + addedRecords=CounterResult{unit=COUNT, value=1}, + removedRecords=null, + totalRecords=CounterResult{unit=COUNT, value=1}, + addedFilesSizeInBytes=CounterResult{unit=BYTES, value=10}, + removedFilesSizeInBytes=null, + totalFilesSizeInBytes=CounterResult{unit=BYTES, value=10}, + addedPositionalDeletes=null, + removedPositionalDeletes=null, + totalPositionalDeletes=CounterResult{unit=COUNT, value=0}, + addedEqualityDeletes=null, + removedEqualityDeletes=null, + totalEqualityDeletes=CounterResult{unit=COUNT, value=0}}, + metadata={ + iceberg-version=Apache Iceberg 1.4.0-SNAPSHOT (commit 4868d2823004c8c256a50ea7c25cff94314cc135)}} +``` + + +### [`RESTMetricsReporter`](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/rest/RESTMetricsReporter.html) + +This is the default when using the [`RESTCatalog`](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/rest/RESTCatalog.html) and its purpose is to send metrics to a REST server at the `/v1/{prefix}/namespaces/{namespace}/tables/{table}/metrics` endpoint as defined in the [REST OpenAPI spec](https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml). + +Sending metrics via REST can be controlled with the `rest-metrics-reporting-enabled` (defaults to `true`) property. + + +## Implementing a custom Metrics Reporter + +Implementing the [`MetricsReporter`](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/metrics/MetricsReporter.html) API gives full flexibility in dealing with incoming [`MetricsReport`](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/metrics/MetricsReport.html) instances. For example, it would be possible to send results to a Prometheus endpoint or any other observability framework/system. + +Below is a short example illustrating an `InMemoryMetricsReporter` that stores reports in a list and makes them available: +```java +public class InMemoryMetricsReporter implements MetricsReporter { + + private List metricsReports = Lists.newArrayList(); + + @Override + public void report(MetricsReport report) { + metricsReports.add(report); + } + + public List reports() { + return metricsReports; + } +} +``` + +## Registering a custom Metrics Reporter + +### Via Catalog Configuration + +The [catalog property](../configuration#catalog-properties) `metrics-reporter-impl` allows registering a given [`MetricsReporter`](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/metrics/MetricsReporter.html) by specifying its fully-qualified class name, e.g. `metrics-reporter-impl=org.apache.iceberg.metrics.InMemoryMetricsReporter`. + +### Via the Java API during Scan planning + +Independently of the [`MetricsReporter`](../../../javadoc/{{% icebergVersion %}}/org/apache/iceberg/metrics/MetricsReporter.html) being registered at the catalog level via the `metrics-reporter-impl` property, it is also possible to supply additional reporters during scan planning as shown below: + +```java +TableScan tableScan = + table + .newScan() + .metricsReporter(customReporterOne) + .metricsReporter(customReporterTwo); + +try (CloseableIterable fileScanTasks = tableScan.planFiles()) { + // ... +} +``` \ No newline at end of file diff --git a/docs/nessie.md b/docs/nessie.md new file mode 100644 index 000000000000..47b91c89198b --- /dev/null +++ b/docs/nessie.md @@ -0,0 +1,168 @@ +--- +title: "Nessie" +url: nessie +menu: + main: + parent: Integrations + identifier: nessie_integration + weight: 0 +--- + + +# Iceberg Nessie Integration + +Iceberg provides integration with Nessie through the `iceberg-nessie` module. +This section describes how to use Iceberg with Nessie. Nessie provides several key features on top of Iceberg: + +* multi-table transactions +* git-like operations (eg branches, tags, commits) +* hive-like metastore capabilities + +See [Project Nessie](https://projectnessie.org) for more information on Nessie. Nessie requires a server to run, see +[Getting Started](https://projectnessie.org/try/) to start a Nessie server. + +## Enabling Nessie Catalog + +The `iceberg-nessie` module is bundled with Spark and Flink runtimes for all versions from `0.11.0`. To get started +with Nessie (with spark-3.3) and Iceberg simply add the Iceberg runtime to your process. Eg: `spark-sql --packages +org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:{{% icebergVersion %}}`. + +## Spark SQL Extensions + +Nessie SQL extensions can be used to manage the Nessie repo as shown below. +Example for Spark 3.3 with scala 2.12: + +``` +bin/spark-sql + --packages "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:{{% icebergVersion %}},org.projectnessie.nessie-integrations:nessie-spark-extensions-3.3_2.12:{{% nessieVersion %}}" + --conf spark.sql.extensions="org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions" + --conf +``` +Please refer [Nessie SQL extension document](https://projectnessie.org/tools/sql/) to learn more about it. + +## Nessie Catalog + +One major feature introduced in release `0.11.0` is the ability to easily interact with a [Custom +Catalog](../custom-catalog) from Spark and Flink. See [Spark Configuration](../spark-configuration#catalog-configuration) + and [Flink Configuration](../flink#custom-catalog) for instructions for adding a custom catalog to Iceberg. + +To use the Nessie Catalog the following properties are required: + +* `warehouse`. Like most other catalogs the warehouse property is a file path to where this catalog should store tables. +* `uri`. This is the Nessie server base uri. Eg `http://localhost:19120/api/v1`. +* `ref` (optional). This is the Nessie branch or tag you want to work in. + +To run directly in Java this looks like: + +``` java +Map options = new HashMap<>(); +options.put("warehouse", "/path/to/warehouse"); +options.put("ref", "main"); +options.put("uri", "https://localhost:19120/api/v1"); +Catalog nessieCatalog = CatalogUtil.loadCatalog("org.apache.iceberg.nessie.NessieCatalog", "nessie", options, hadoopConfig); +``` + +and in Spark: + +``` java +conf.set("spark.sql.catalog.nessie.warehouse", "/path/to/warehouse"); +conf.set("spark.sql.catalog.nessie.uri", "http://localhost:19120/api/v1") +conf.set("spark.sql.catalog.nessie.ref", "main") +conf.set("spark.sql.catalog.nessie.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog") +conf.set("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog") +conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions") +``` +This is how it looks in Flink via the Python API (additional details can be found [here](../flink#preparation-when-using-flinks-python-api)): +```python +import os +from pyflink.datastream import StreamExecutionEnvironment +from pyflink.table import StreamTableEnvironment + +env = StreamExecutionEnvironment.get_execution_environment() +iceberg_flink_runtime_jar = os.path.join(os.getcwd(), "iceberg-flink-runtime-{{% icebergVersion %}}.jar") +env.add_jars("file://{}".format(iceberg_flink_runtime_jar)) +table_env = StreamTableEnvironment.create(env) + +table_env.execute_sql("CREATE CATALOG nessie_catalog WITH (" + "'type'='iceberg', " + "'catalog-impl'='org.apache.iceberg.nessie.NessieCatalog', " + "'uri'='http://localhost:19120/api/v1', " + "'ref'='main', " + "'warehouse'='/path/to/warehouse')") +``` + +There is nothing special above about the `nessie` name. A spark catalog can have any name, the important parts are the +settings for the `catalog-impl` and the required config to start Nessie correctly. +Once you have a Nessie catalog you have access to your entire Nessie repo. You can then perform create/delete/merge +operations on branches and perform commits on branches. Each Iceberg table in a Nessie Catalog is identified by an +arbitrary length namespace and table name (eg `data.base.name.table`). These namespaces must be explicitly created +as mentioned [here](https://projectnessie.org/blog/namespace-enforcement/). +Any transaction on a Nessie enabled Iceberg table is a single commit in Nessie. Nessie commits +can encompass an arbitrary number of actions on an arbitrary number of tables, however in Iceberg this will be limited +to the set of single table transactions currently available. + +Further operations such as merges, viewing the commit log or diffs are performed by direct interaction with the +`NessieClient` in java or by using the python client or cli. See [Nessie CLI](https://projectnessie.org/tools/cli/) for +more details on the CLI and [Spark Guide](https://projectnessie.org/tools/iceberg/spark/) for a more complete description of +Nessie functionality. + +## Nessie and Iceberg + +For most cases Nessie acts just like any other Catalog for Iceberg: providing a logical organization of a set of tables +and providing atomicity to transactions. However, using Nessie opens up other interesting possibilities. When using Nessie with +Iceberg every Iceberg transaction becomes a Nessie commit. This history can be listed, merged or cherry-picked across branches. + +### Loosely coupled transactions + +By creating a branch and performing a set of operations on that branch you can approximate a multi-table transaction. +A sequence of commits can be performed on the newly created branch and then merged back into the main branch atomically. +This gives the appearance of a series of connected changes being exposed to the main branch simultaneously. While downstream +consumers will see multiple transactions appear at once this isn't a true multi-table transaction on the database. It is +effectively a fast-forward merge of multiple commits (in git language) and each operation from the branch is its own distinct +transaction and commit. This is different from a real multi-table transaction where all changes would be in the same commit. +This does allow multiple applications to take part in modifying a branch and for this distributed set of transactions to be +exposed to the downstream users simultaneously. + + +### Experimentation + +Changes to a table can be tested in a branch before merging back into main. This is particularly useful when performing +large changes like schema evolution or partition evolution. A partition evolution could be performed in a branch and you +would be able to test out the change (eg performance benchmarks) before merging it. This provides great flexibility in +performing on-line table modifications and testing without interrupting downstream use cases. If the changes are +incorrect or not performant the branch can be dropped without being merged. + +### Further use cases + +Please see the [Nessie Documentation](https://projectnessie.org/features/) for further descriptions of +Nessie features. + +{{< hint danger >}} +Regular table maintenance in Iceberg is complicated when using nessie. Please consult +[Management Services](https://projectnessie.org/features/management/) before performing any +[table maintenance](../maintenance). +{{< /hint >}} + +## Example + +Please have a look at the [Nessie Demos repo](https://github.com/projectnessie/nessie-demos) +for different examples of Nessie and Iceberg in action together. + +## Future Improvements + +* Iceberg multi-table transactions. Changes to multiple Iceberg tables in the same transaction, isolation levels etc diff --git a/docs/partitioning.md b/docs/partitioning.md new file mode 100644 index 000000000000..0fddde1ceb6f --- /dev/null +++ b/docs/partitioning.md @@ -0,0 +1,103 @@ +--- +title: Partitioning +url: partitioning +aliases: + - "tables/partitioning" +menu: + main: + parent: Tables + identifier: tables_partitioning + weight: 0 +--- + + +# Partitioning + +## What is partitioning? + +Partitioning is a way to make queries faster by grouping similar rows together when writing. + +For example, queries for log entries from a `logs` table would usually include a time range, like this query for logs between 10 and 12 AM: + +```sql +SELECT level, message FROM logs +WHERE event_time BETWEEN '2018-12-01 10:00:00' AND '2018-12-01 12:00:00'; +``` + +Configuring the `logs` table to partition by the date of `event_time` will group log events into files with the same event date. Iceberg keeps track of that date and will use it to skip files for other dates that don't have useful data. + +Iceberg can partition timestamps by year, month, day, and hour granularity. It can also use a categorical column, like `level` in this logs example, to store rows together and speed up queries. + + +## What does Iceberg do differently? + +Other tables formats like Hive support partitioning, but Iceberg supports *hidden partitioning*. + +* Iceberg handles the tedious and error-prone task of producing partition values for rows in a table. +* Iceberg avoids reading unnecessary partitions automatically. Consumers don't need to know how the table is partitioned and add extra filters to their queries. +* Iceberg partition layouts can evolve as needed. + +### Partitioning in Hive + +To demonstrate the difference, consider how Hive would handle a `logs` table. + +In Hive, partitions are explicit and appear as a column, so the `logs` table would have a column called `event_date`. When writing, an insert needs to supply the data for the `event_date` column: + +```sql +INSERT INTO logs PARTITION (event_date) + SELECT level, message, event_time, format_time(event_time, 'YYYY-MM-dd') + FROM unstructured_log_source; +``` + +Similarly, queries that search through the `logs` table must have an `event_date` filter in addition to an `event_time` filter. + +```sql +SELECT level, count(1) as count FROM logs +WHERE event_time BETWEEN '2018-12-01 10:00:00' AND '2018-12-01 12:00:00' + AND event_date = '2018-12-01'; +``` + +If the `event_date` filter were missing, Hive would scan through every file in the table because it doesn't know that the `event_time` column is related to the `event_date` column. + +### Problems with Hive partitioning + +Hive must be given partition values. In the logs example, it doesn't know the relationship between `event_time` and `event_date`. + +This leads to several problems: + +* Hive can't validate partition values -- it is up to the writer to produce the correct value + - Using the wrong format, `2018-12-01` instead of `20181201`, produces silently incorrect results, not query failures + - Using the wrong source column, like `processing_time`, or time zone also causes incorrect results, not failures +* It is up to the user to write queries correctly + - Using the wrong format also leads to silently incorrect results + - Users that don't understand a table's physical layout get needlessly slow queries -- Hive can't translate filters automatically +* Working queries are tied to the table's partitioning scheme, so partitioning configuration cannot be changed without breaking queries + +### Iceberg's hidden partitioning + +Iceberg produces partition values by taking a column value and optionally transforming it. Iceberg is responsible for converting `event_time` into `event_date`, and keeps track of the relationship. + +Table partitioning is configured using these relationships. The `logs` table would be partitioned by `date(event_time)` and `level`. + +Because Iceberg doesn't require user-maintained partition columns, it can hide partitioning. Partition values are produced correctly every time and always used to speed up queries, when possible. Producers and consumers wouldn't even see `event_date`. + +Most importantly, queries no longer depend on a table's physical layout. With a separation between physical and logical, Iceberg tables can evolve partition schemes over time as data volume changes. Misconfigured tables can be fixed without an expensive migration. + +For details about all the supported hidden partition transformations, see the [Partition Transforms](../../../spec/#partition-transforms) section. + +For details about updating a table's partition spec, see the [partition evolution](../evolution/#partition-evolution) section. diff --git a/docs/performance.md b/docs/performance.md new file mode 100644 index 000000000000..f5e87e3a3251 --- /dev/null +++ b/docs/performance.md @@ -0,0 +1,63 @@ +--- +title: Performance +url: performance +aliases: + - "tables/performance" +menu: + main: + parent: Tables + identifier: tables_performance + weight: 0 +--- + + +# Performance + +* Iceberg is designed for huge tables and is used in production where a *single table* can contain tens of petabytes of data. +* Even multi-petabyte tables can be read from a single node, without needing a distributed SQL engine to sift through table metadata. + +## Scan planning + +Scan planning is the process of finding the files in a table that are needed for a query. + +Planning in an Iceberg table fits on a single node because Iceberg's metadata can be used to prune *metadata* files that aren't needed, in addition to filtering *data* files that don't contain matching data. + +Fast scan planning from a single node enables: + +* Lower latency SQL queries -- by eliminating a distributed scan to plan a distributed scan +* Access from any client -- stand-alone processes can read data directly from Iceberg tables + +### Metadata filtering + +Iceberg uses two levels of metadata to track the files in a snapshot. + +* **Manifest files** store a list of data files, along each data file's partition data and column-level stats +* A **manifest list** stores the snapshot's list of manifests, along with the range of values for each partition field + +For fast scan planning, Iceberg first filters manifests using the partition value ranges in the manifest list. Then, it reads each manifest to get data files. With this scheme, the manifest list acts as an index over the manifest files, making it possible to plan without reading all manifests. + +In addition to partition value ranges, a manifest list also stores the number of files added or deleted in a manifest to speed up operations like snapshot expiration. + +### Data filtering + +Manifest files include a tuple of partition data and column-level stats for each data file. + +During planning, query predicates are automatically converted to predicates on the partition data and applied first to filter data files. Next, column-level value counts, null counts, lower bounds, and upper bounds are used to eliminate files that cannot match the query predicate. + +By using upper and lower bounds to filter data files at planning time, Iceberg uses clustered data to eliminate splits without running tasks. In some cases, this is a [10x performance improvement](https://conferences.oreilly.com/strata/strata-ny-2018/cdn.oreillystatic.com/en/assets/1/event/278/Introducing%20Iceberg_%20Tables%20designed%20for%20object%20stores%20Presentation.pdf +). diff --git a/docs/reliability.md b/docs/reliability.md new file mode 100644 index 000000000000..9a4766841942 --- /dev/null +++ b/docs/reliability.md @@ -0,0 +1,76 @@ +--- +title: Reliability +url: reliability +aliases: + - "tables/reliability" +menu: + main: + parent: Tables + identifier: tables_reliability + weight: 0 +--- + + +# Reliability + +Iceberg was designed to solve correctness problems that affect Hive tables running in S3. + +Hive tables track data files using both a central metastore for partitions and a file system for individual files. This makes atomic changes to a table's contents impossible, and eventually consistent stores like S3 may return incorrect results due to the use of listing files to reconstruct the state of a table. It also requires job planning to make many slow listing calls: O(n) with the number of partitions. + +Iceberg tracks the complete list of data files in each [snapshot](../../../terms#snapshot) using a persistent tree structure. Every write or delete produces a new snapshot that reuses as much of the previous snapshot's metadata tree as possible to avoid high write volumes. + +Valid snapshots in an Iceberg table are stored in the table metadata file, along with a reference to the current snapshot. Commits replace the path of the current table metadata file using an atomic operation. This ensures that all updates to table data and metadata are atomic, and is the basis for [serializable isolation](https://en.wikipedia.org/wiki/Isolation_(database_systems)#Serializable). + +This results in improved reliability guarantees: + +* **Serializable isolation**: All table changes occur in a linear history of atomic table updates +* **Reliable reads**: Readers always use a consistent snapshot of the table without holding a lock +* **Version history and rollback**: Table snapshots are kept as history and tables can roll back if a job produces bad data +* **Safe file-level operations**. By supporting atomic changes, Iceberg enables new use cases, like safely compacting small files and safely appending late data to tables + +This design also has performance benefits: + +* **O(1) RPCs to plan**: Instead of listing O(n) directories in a table to plan a job, reading a snapshot requires O(1) RPC calls +* **Distributed planning**: File pruning and predicate push-down is distributed to jobs, removing the metastore as a bottleneck +* **Finer granularity partitioning**: Distributed planning and O(1) RPC calls remove the current barriers to finer-grained partitioning + + +## Concurrent write operations + +Iceberg supports multiple concurrent writes using optimistic concurrency. + +Each writer assumes that no other writers are operating and writes out new table metadata for an operation. Then, the writer attempts to commit by atomically swapping the new table metadata file for the existing metadata file. + +If the atomic swap fails because another writer has committed, the failed writer retries by writing a new metadata tree based on the new current table state. + +### Cost of retries + +Writers avoid expensive retry operations by structuring changes so that work can be reused across retries. + +For example, appends usually create a new manifest file for the appended data files, which can be added to the table without rewriting the manifest on every attempt. + +### Retry validation + +Commits are structured as assumptions and actions. After a conflict, a writer checks that the assumptions are met by the current table state. If the assumptions are met, then it is safe to re-apply the actions and commit. + +For example, a compaction might rewrite `file_a.avro` and `file_b.avro` as `merged.parquet`. This is safe to commit as long as the table still contains both `file_a.avro` and `file_b.avro`. If either file was deleted by a conflicting commit, then the operation must fail. Otherwise, it is safe to remove the source files and add the merged file. + + +## Compatibility + +By avoiding file listing and rename operations, Iceberg tables are compatible with any object store. No consistent listing is required. diff --git a/docs/schemas.md b/docs/schemas.md new file mode 100644 index 000000000000..03805fad81eb --- /dev/null +++ b/docs/schemas.md @@ -0,0 +1,52 @@ +--- +title: Schemas +url: schemas +aliases: + - "tables/schemas" +menu: + main: + parent: Tables + identifier: tables_schema + weight: 0 +--- + + +# Schemas + +Iceberg tables support the following types: + +| Type | Description | Notes | +|--------------------|--------------------------------------------------------------------------|--------------------------------------------------| +| **`boolean`** | True or false | | +| **`int`** | 32-bit signed integers | Can promote to `long` | +| **`long`** | 64-bit signed integers | | +| **`float`** | [32-bit IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) floating point | Can promote to `double` | +| **`double`** | [64-bit IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) floating point | | +| **`decimal(P,S)`** | Fixed-point decimal; precision P, scale S | Scale is fixed and precision must be 38 or less | +| **`date`** | Calendar date without timezone or time | | +| **`time`** | Time of day without date, timezone | Stored as microseconds | +| **`timestamp`** | Timestamp without timezone | Stored as microseconds | +| **`timestamptz`** | Timestamp with timezone | Stored as microseconds | +| **`string`** | Arbitrary-length character sequences | Encoded with UTF-8 | +| **`fixed(L)`** | Fixed-length byte array of length L | | +| **`binary`** | Arbitrary-length byte array | | +| **`struct<...>`** | A record with named fields of any data type | | +| **`list`** | A list with elements of any data type | | +| **`map`** | A map with keys and values of any data type | | + +Iceberg tracks each field in a table schema using an ID that is never reused in a table. See [correctness guarantees](../evolution#correctness) for more information. diff --git a/docs/spark-configuration.md b/docs/spark-configuration.md new file mode 100644 index 000000000000..9470acf02754 --- /dev/null +++ b/docs/spark-configuration.md @@ -0,0 +1,214 @@ +--- +title: "Configuration" +url: spark-configuration +aliases: + - "spark/spark-configuration" +menu: + main: + parent: Spark + identifier: spark_configuration + weight: 0 +--- + + +# Spark Configuration + +## Catalogs + +Spark adds an API to plug in table catalogs that are used to load, create, and manage Iceberg tables. Spark catalogs are configured by setting Spark properties under `spark.sql.catalog`. + +This creates an Iceberg catalog named `hive_prod` that loads tables from a Hive metastore: + +```plain +spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog +spark.sql.catalog.hive_prod.type = hive +spark.sql.catalog.hive_prod.uri = thrift://metastore-host:port +# omit uri to use the same URI as Spark: hive.metastore.uris in hive-site.xml +``` + +Below is an example for a REST catalog named `rest_prod` that loads tables from REST URL `http://localhost:8080`: + +```plain +spark.sql.catalog.rest_prod = org.apache.iceberg.spark.SparkCatalog +spark.sql.catalog.rest_prod.type = rest +spark.sql.catalog.rest_prod.uri = http://localhost:8080 +``` + +Iceberg also supports a directory-based catalog in HDFS that can be configured using `type=hadoop`: + +```plain +spark.sql.catalog.hadoop_prod = org.apache.iceberg.spark.SparkCatalog +spark.sql.catalog.hadoop_prod.type = hadoop +spark.sql.catalog.hadoop_prod.warehouse = hdfs://nn:8020/warehouse/path +``` + +{{< hint info >}} +The Hive-based catalog only loads Iceberg tables. To load non-Iceberg tables in the same Hive metastore, use a [session catalog](#replacing-the-session-catalog). +{{< /hint >}} + +### Catalog configuration + +A catalog is created and named by adding a property `spark.sql.catalog.(catalog-name)` with an implementation class for its value. + +Iceberg supplies two implementations: + +* `org.apache.iceberg.spark.SparkCatalog` supports a Hive Metastore or a Hadoop warehouse as a catalog +* `org.apache.iceberg.spark.SparkSessionCatalog` adds support for Iceberg tables to Spark's built-in catalog, and delegates to the built-in catalog for non-Iceberg tables + +Both catalogs are configured using properties nested under the catalog name. Common configuration properties for Hive and Hadoop are: + +| Property | Values | Description | +| -------------------------------------------------- | ----------------------------- | -------------------------------------------------------------------- | +| spark.sql.catalog._catalog-name_.type | `hive`, `hadoop` or `rest` | The underlying Iceberg catalog implementation, `HiveCatalog`, `HadoopCatalog`, `RESTCatalog` or left unset if using a custom catalog | +| spark.sql.catalog._catalog-name_.catalog-impl | | The custom Iceberg catalog implementation. If `type` is null, `catalog-impl` must not be null. | +| spark.sql.catalog._catalog-name_.io-impl | | The custom FileIO implementation. | +| spark.sql.catalog._catalog-name_.metrics-reporter-impl | | The custom MetricsReporter implementation. | +| spark.sql.catalog._catalog-name_.default-namespace | default | The default current namespace for the catalog | +| spark.sql.catalog._catalog-name_.uri | thrift://host:port | Hive metastore URL for hive typed catalog, REST URL for REST typed catalog | +| spark.sql.catalog._catalog-name_.warehouse | hdfs://nn:8020/warehouse/path | Base path for the warehouse directory | +| spark.sql.catalog._catalog-name_.cache-enabled | `true` or `false` | Whether to enable catalog cache, default value is `true` | +| spark.sql.catalog._catalog-name_.cache.expiration-interval-ms | `30000` (30 seconds) | Duration after which cached catalog entries are expired; Only effective if `cache-enabled` is `true`. `-1` disables cache expiration and `0` disables caching entirely, irrespective of `cache-enabled`. Default is `30000` (30 seconds) | +| spark.sql.catalog._catalog-name_.table-default._propertyKey_ | | Default Iceberg table property value for property key _propertyKey_, which will be set on tables created by this catalog if not overridden | +| spark.sql.catalog._catalog-name_.table-override._propertyKey_ | | Enforced Iceberg table property value for property key _propertyKey_, which cannot be overridden by user | + +Additional properties can be found in common [catalog configuration](../configuration#catalog-properties). + + +### Using catalogs + +Catalog names are used in SQL queries to identify a table. In the examples above, `hive_prod` and `hadoop_prod` can be used to prefix database and table names that will be loaded from those catalogs. + +```sql +SELECT * FROM hive_prod.db.table; -- load db.table from catalog hive_prod +``` + +Spark 3 keeps track of the current catalog and namespace, which can be omitted from table names. + +```sql +USE hive_prod.db; +SELECT * FROM table; -- load db.table from catalog hive_prod +``` + +To see the current catalog and namespace, run `SHOW CURRENT NAMESPACE`. + +### Replacing the session catalog + +To add Iceberg table support to Spark's built-in catalog, configure `spark_catalog` to use Iceberg's `SparkSessionCatalog`. + +```plain +spark.sql.catalog.spark_catalog = org.apache.iceberg.spark.SparkSessionCatalog +spark.sql.catalog.spark_catalog.type = hive +``` + +Spark's built-in catalog supports existing v1 and v2 tables tracked in a Hive Metastore. This configures Spark to use Iceberg's `SparkSessionCatalog` as a wrapper around that session catalog. When a table is not an Iceberg table, the built-in catalog will be used to load it instead. + +This configuration can use same Hive Metastore for both Iceberg and non-Iceberg tables. + +### Using catalog specific Hadoop configuration values + +Similar to configuring Hadoop properties by using `spark.hadoop.*`, it's possible to set per-catalog Hadoop configuration values when using Spark by adding the property for the catalog with the prefix `spark.sql.catalog.(catalog-name).hadoop.*`. These properties will take precedence over values configured globally using `spark.hadoop.*` and will only affect Iceberg tables. + +```plain +spark.sql.catalog.hadoop_prod.hadoop.fs.s3a.endpoint = http://aws-local:9000 +``` + +### Loading a custom catalog + +Spark supports loading a custom Iceberg `Catalog` implementation by specifying the `catalog-impl` property. Here is an example: + +```plain +spark.sql.catalog.custom_prod = org.apache.iceberg.spark.SparkCatalog +spark.sql.catalog.custom_prod.catalog-impl = com.my.custom.CatalogImpl +spark.sql.catalog.custom_prod.my-additional-catalog-config = my-value +``` + +## SQL Extensions + +Iceberg 0.11.0 and later add an extension module to Spark to add new SQL commands, like `CALL` for stored procedures or `ALTER TABLE ... WRITE ORDERED BY`. + +Using those SQL commands requires adding Iceberg extensions to your Spark environment using the following Spark property: + + +| Spark extensions property | Iceberg extensions implementation | +|---------------------------|---------------------------------------------------------------------| +| `spark.sql.extensions` | `org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions` | + +## Runtime configuration + +### Read options + +Spark read options are passed when configuring the DataFrameReader, like this: + +```scala +// time travel +spark.read + .option("snapshot-id", 10963874102873L) + .table("catalog.db.table") +``` + +| Spark option | Default | Description | +| --------------- | --------------------- | ----------------------------------------------------------------------------------------- | +| snapshot-id | (latest) | Snapshot ID of the table snapshot to read | +| as-of-timestamp | (latest) | A timestamp in milliseconds; the snapshot used will be the snapshot current at this time. | +| split-size | As per table property | Overrides this table's read.split.target-size and read.split.metadata-target-size | +| lookback | As per table property | Overrides this table's read.split.planning-lookback | +| file-open-cost | As per table property | Overrides this table's read.split.open-file-cost | +| vectorization-enabled | As per table property | Overrides this table's read.parquet.vectorization.enabled | +| batch-size | As per table property | Overrides this table's read.parquet.vectorization.batch-size | +| stream-from-timestamp | (none) | A timestamp in milliseconds to stream from; if before the oldest known ancestor snapshot, the oldest will be used | + +### Write options + +Spark write options are passed when configuring the DataFrameWriter, like this: + +```scala +// write with Avro instead of Parquet +df.write + .option("write-format", "avro") + .option("snapshot-property.key", "value") + .insertInto("catalog.db.table") +``` + +| Spark option | Default | Description | +| ---------------------- | -------------------------- | ------------------------------------------------------------ | +| write-format | Table write.format.default | File format to use for this write operation; parquet, avro, or orc | +| target-file-size-bytes | As per table property | Overrides this table's write.target-file-size-bytes | +| check-nullability | true | Sets the nullable check on fields | +| snapshot-property._custom-key_ | null | Adds an entry with custom-key and corresponding value in the snapshot summary (the `snapshot-property.` prefix is only required for DSv2) | +| fanout-enabled | false | Overrides this table's write.spark.fanout.enabled | +| check-ordering | true | Checks if input schema and table schema are same | +| isolation-level | null | Desired isolation level for Dataframe overwrite operations. `null` => no checks (for idempotent writes), `serializable` => check for concurrent inserts or deletes in destination partitions, `snapshot` => checks for concurrent deletes in destination partitions. | +| validate-from-snapshot-id | null | If isolation level is set, id of base snapshot from which to check concurrent write conflicts into a table. Should be the snapshot before any reads from the table. Can be obtained via [Table API](../../api#table-metadata) or [Snapshots table](../spark-queries#snapshots). If null, the table's oldest known snapshot is used. | +| compression-codec | Table write.(fileformat).compression-codec | Overrides this table's compression codec for this write | +| compression-level | Table write.(fileformat).compression-level | Overrides this table's compression level for Parquet and Avro tables for this write | +| compression-strategy | Table write.orc.compression-strategy | Overrides this table's compression strategy for ORC tables for this write | + +CommitMetadata provides an interface to add custom metadata to a snapshot summary during a SQL execution, which can be beneficial for purposes such as auditing or change tracking. If properties start with `snapshot-property.`, then that prefix will be removed from each property. Here is an example: + +```java +import org.apache.iceberg.spark.CommitMetadata; + +Map properties = Maps.newHashMap(); +properties.put("property_key", "property_value"); +CommitMetadata.withCommitProperties(properties, + () -> { + spark.sql("DELETE FROM " + tableName + " where id = 1"); + return 0; + }, + RuntimeException.class); +``` \ No newline at end of file diff --git a/docs/spark-ddl.md b/docs/spark-ddl.md new file mode 100644 index 000000000000..0501bede01eb --- /dev/null +++ b/docs/spark-ddl.md @@ -0,0 +1,580 @@ +--- +title: "DDL" +url: spark-ddl +aliases: + - "spark/spark-ddl" +menu: + main: + parent: Spark + identifier: spark_ddl + weight: 0 +--- + + +# Spark DDL + +To use Iceberg in Spark, first configure [Spark catalogs](../spark-configuration). Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. + +## `CREATE TABLE` + +Spark 3 can create tables in any Iceberg catalog with the clause `USING iceberg`: + +```sql +CREATE TABLE prod.db.sample ( + id bigint NOT NULL COMMENT 'unique id', + data string) +USING iceberg; +``` + +Iceberg will convert the column type in Spark to corresponding Iceberg type. Please check the section of [type compatibility on creating table](../spark-writes#spark-type-to-iceberg-type) for details. + +Table create commands, including CTAS and RTAS, support the full range of Spark create clauses, including: + +* `PARTITIONED BY (partition-expressions)` to configure partitioning +* `LOCATION '(fully-qualified-uri)'` to set the table location +* `COMMENT 'table documentation'` to set a table description +* `TBLPROPERTIES ('key'='value', ...)` to set [table configuration](../configuration) + +Create commands may also set the default format with the `USING` clause. This is only supported for `SparkCatalog` because Spark handles the `USING` clause differently for the built-in catalog. + +`CREATE TABLE ... LIKE ...` syntax is not supported. + +### `PARTITIONED BY` + +To create a partitioned table, use `PARTITIONED BY`: + +```sql +CREATE TABLE prod.db.sample ( + id bigint, + data string, + category string) +USING iceberg +PARTITIONED BY (category); +``` + +The `PARTITIONED BY` clause supports transform expressions to create [hidden partitions](../partitioning). + +```sql +CREATE TABLE prod.db.sample ( + id bigint, + data string, + category string, + ts timestamp) +USING iceberg +PARTITIONED BY (bucket(16, id), days(ts), category); +``` + +Supported transformations are: + +* `year(ts)`: partition by year +* `month(ts)`: partition by month +* `day(ts)` or `date(ts)`: equivalent to dateint partitioning +* `hour(ts)` or `date_hour(ts)`: equivalent to dateint and hour partitioning +* `bucket(N, col)`: partition by hashed value mod N buckets +* `truncate(L, col)`: partition by value truncated to L + * Strings are truncated to the given length + * Integers and longs truncate to bins: `truncate(10, i)` produces partitions 0, 10, 20, 30, ... + +Note: Old syntax of `years(ts)`, `months(ts)`, `days(ts)` and `hours(ts)` are also supported for compatibility. + +## `CREATE TABLE ... AS SELECT` + +Iceberg supports CTAS as an atomic operation when using a [`SparkCatalog`](../spark-configuration#catalog-configuration). CTAS is supported, but is not atomic when using [`SparkSessionCatalog`](../spark-configuration#replacing-the-session-catalog). + +```sql +CREATE TABLE prod.db.sample +USING iceberg +AS SELECT ... +``` + +The newly created table won't inherit the partition spec and table properties from the source table in SELECT, you can use PARTITIONED BY and TBLPROPERTIES in CTAS to declare partition spec and table properties for the new table. + +```sql +CREATE TABLE prod.db.sample +USING iceberg +PARTITIONED BY (part) +TBLPROPERTIES ('key'='value') +AS SELECT ... +``` + +## `REPLACE TABLE ... AS SELECT` + +Iceberg supports RTAS as an atomic operation when using a [`SparkCatalog`](../spark-configuration#catalog-configuration). RTAS is supported, but is not atomic when using [`SparkSessionCatalog`](../spark-configuration#replacing-the-session-catalog). + +Atomic table replacement creates a new snapshot with the results of the `SELECT` query, but keeps table history. + +```sql +REPLACE TABLE prod.db.sample +USING iceberg +AS SELECT ... +``` +```sql +REPLACE TABLE prod.db.sample +USING iceberg +PARTITIONED BY (part) +TBLPROPERTIES ('key'='value') +AS SELECT ... +``` +```sql +CREATE OR REPLACE TABLE prod.db.sample +USING iceberg +AS SELECT ... +``` + +The schema and partition spec will be replaced if changed. To avoid modifying the table's schema and partitioning, use `INSERT OVERWRITE` instead of `REPLACE TABLE`. +The new table properties in the `REPLACE TABLE` command will be merged with any existing table properties. The existing table properties will be updated if changed else they are preserved. + +## `DROP TABLE` + +The drop table behavior changed in 0.14. + +Prior to 0.14, running `DROP TABLE` would remove the table from the catalog and delete the table contents as well. + +From 0.14 onwards, `DROP TABLE` would only remove the table from the catalog. +In order to delete the table contents `DROP TABLE PURGE` should be used. + +### `DROP TABLE` + +To drop the table from the catalog, run: + +```sql +DROP TABLE prod.db.sample; +``` + +### `DROP TABLE PURGE` + +To drop the table from the catalog and delete the table's contents, run: + +```sql +DROP TABLE prod.db.sample PURGE; +``` + +## `ALTER TABLE` + +Iceberg has full `ALTER TABLE` support in Spark 3, including: + +* Renaming a table +* Setting or removing table properties +* Adding, deleting, and renaming columns +* Adding, deleting, and renaming nested fields +* Reordering top-level columns and nested struct fields +* Widening the type of `int`, `float`, and `decimal` fields +* Making required columns optional + +In addition, [SQL extensions](../spark-configuration#sql-extensions) can be used to add support for partition evolution and setting a table's write order + +### `ALTER TABLE ... RENAME TO` + +```sql +ALTER TABLE prod.db.sample RENAME TO prod.db.new_name; +``` + +### `ALTER TABLE ... SET TBLPROPERTIES` + +```sql +ALTER TABLE prod.db.sample SET TBLPROPERTIES ( + 'read.split.target-size'='268435456' +); +``` + +Iceberg uses table properties to control table behavior. For a list of available properties, see [Table configuration](../configuration). + +`UNSET` is used to remove properties: + +```sql +ALTER TABLE prod.db.sample UNSET TBLPROPERTIES ('read.split.target-size'); +``` + +`SET TBLPROPERTIES` can also be used to set the table comment (description): + +```sql +ALTER TABLE prod.db.sample SET TBLPROPERTIES ( + 'comment' = 'A table comment.' +); +``` + +### `ALTER TABLE ... ADD COLUMN` + +To add a column to Iceberg, use the `ADD COLUMNS` clause with `ALTER TABLE`: + +```sql +ALTER TABLE prod.db.sample +ADD COLUMNS ( + new_column string comment 'new_column docs' +); +``` + +Multiple columns can be added at the same time, separated by commas. + +Nested columns should be identified using the full column name: + +```sql +-- create a struct column +ALTER TABLE prod.db.sample +ADD COLUMN point struct; + +-- add a field to the struct +ALTER TABLE prod.db.sample +ADD COLUMN point.z double; +``` + +```sql +-- create a nested array column of struct +ALTER TABLE prod.db.sample +ADD COLUMN points array>; + +-- add a field to the struct within an array. Using keyword 'element' to access the array's element column. +ALTER TABLE prod.db.sample +ADD COLUMN points.element.z double; +``` + +```sql +-- create a map column of struct key and struct value +ALTER TABLE prod.db.sample +ADD COLUMN points map, struct>; + +-- add a field to the value struct in a map. Using keyword 'value' to access the map's value column. +ALTER TABLE prod.db.sample +ADD COLUMN points.value.b int; +``` + +Note: Altering a map 'key' column by adding columns is not allowed. Only map values can be updated. + +Add columns in any position by adding `FIRST` or `AFTER` clauses: + +```sql +ALTER TABLE prod.db.sample +ADD COLUMN new_column bigint AFTER other_column; +``` + +```sql +ALTER TABLE prod.db.sample +ADD COLUMN nested.new_column bigint FIRST; +``` + +### `ALTER TABLE ... RENAME COLUMN` + +Iceberg allows any field to be renamed. To rename a field, use `RENAME COLUMN`: + +```sql +ALTER TABLE prod.db.sample RENAME COLUMN data TO payload; +ALTER TABLE prod.db.sample RENAME COLUMN location.lat TO latitude; +``` + +Note that nested rename commands only rename the leaf field. The above command renames `location.lat` to `location.latitude` + +### `ALTER TABLE ... ALTER COLUMN` + +Alter column is used to widen types, make a field optional, set comments, and reorder fields. + +Iceberg allows updating column types if the update is safe. Safe updates are: + +* `int` to `bigint` +* `float` to `double` +* `decimal(P,S)` to `decimal(P2,S)` when P2 > P (scale cannot change) + +```sql +ALTER TABLE prod.db.sample ALTER COLUMN measurement TYPE double; +``` + +To add or remove columns from a struct, use `ADD COLUMN` or `DROP COLUMN` with a nested column name. + +Column comments can also be updated using `ALTER COLUMN`: + +```sql +ALTER TABLE prod.db.sample ALTER COLUMN measurement TYPE double COMMENT 'unit is bytes per second'; +ALTER TABLE prod.db.sample ALTER COLUMN measurement COMMENT 'unit is kilobytes per second'; +``` + +Iceberg allows reordering top-level columns or columns in a struct using `FIRST` and `AFTER` clauses: + +```sql +ALTER TABLE prod.db.sample ALTER COLUMN col FIRST; +``` +```sql +ALTER TABLE prod.db.sample ALTER COLUMN nested.col AFTER other_col; +``` + +Nullability for a non-nullable column can be changed using `DROP NOT NULL`: + +```sql +ALTER TABLE prod.db.sample ALTER COLUMN id DROP NOT NULL; +``` + +{{< hint info >}} +It is not possible to change a nullable column to a non-nullable column with `SET NOT NULL` because Iceberg doesn't know whether there is existing data with null values. +{{< /hint >}} + + +{{< hint info >}} +`ALTER COLUMN` is not used to update `struct` types. Use `ADD COLUMN` and `DROP COLUMN` to add or remove struct fields. +{{< /hint >}} + + +### `ALTER TABLE ... DROP COLUMN` + +To drop columns, use `ALTER TABLE ... DROP COLUMN`: + +```sql +ALTER TABLE prod.db.sample DROP COLUMN id; +ALTER TABLE prod.db.sample DROP COLUMN point.z; +``` + +## `ALTER TABLE` SQL extensions + +These commands are available in Spark 3 when using Iceberg [SQL extensions](../spark-configuration#sql-extensions). + +### `ALTER TABLE ... ADD PARTITION FIELD` + +Iceberg supports adding new partition fields to a spec using `ADD PARTITION FIELD`: + +```sql +ALTER TABLE prod.db.sample ADD PARTITION FIELD catalog; -- identity transform +``` + +[Partition transforms](#partitioned-by) are also supported: + +```sql +ALTER TABLE prod.db.sample ADD PARTITION FIELD bucket(16, id); +ALTER TABLE prod.db.sample ADD PARTITION FIELD truncate(4, data); +ALTER TABLE prod.db.sample ADD PARTITION FIELD year(ts); +-- use optional AS keyword to specify a custom name for the partition field +ALTER TABLE prod.db.sample ADD PARTITION FIELD bucket(16, id) AS shard; +``` + +Adding a partition field is a metadata operation and does not change any of the existing table data. New data will be written with the new partitioning, but existing data will remain in the old partition layout. Old data files will have null values for the new partition fields in metadata tables. + +Dynamic partition overwrite behavior will change when the table's partitioning changes because dynamic overwrite replaces partitions implicitly. To overwrite explicitly, use the new `DataFrameWriterV2` API. + +{{< hint note >}} +To migrate from daily to hourly partitioning with transforms, it is not necessary to drop the daily partition field. Keeping the field ensures existing metadata table queries continue to work. +{{< /hint >}} + +{{< hint danger >}} +**Dynamic partition overwrite behavior will change** when partitioning changes +For example, if you partition by days and move to partitioning by hours, overwrites will overwrite hourly partitions but not days anymore. +{{< /hint >}} + +### `ALTER TABLE ... DROP PARTITION FIELD` + +Partition fields can be removed using `DROP PARTITION FIELD`: + +```sql +ALTER TABLE prod.db.sample DROP PARTITION FIELD catalog; +ALTER TABLE prod.db.sample DROP PARTITION FIELD bucket(16, id); +ALTER TABLE prod.db.sample DROP PARTITION FIELD truncate(4, data); +ALTER TABLE prod.db.sample DROP PARTITION FIELD year(ts); +ALTER TABLE prod.db.sample DROP PARTITION FIELD shard; +``` + +Note that although the partition is removed, the column will still exist in the table schema. + +Dropping a partition field is a metadata operation and does not change any of the existing table data. New data will be written with the new partitioning, but existing data will remain in the old partition layout. + +{{< hint danger >}} +**Dynamic partition overwrite behavior will change** when partitioning changes +For example, if you partition by days and move to partitioning by hours, overwrites will overwrite hourly partitions but not days anymore. +{{< /hint >}} + +{{< hint danger >}} +Be careful when dropping a partition field because it will change the schema of metadata tables, like `files`, and may cause metadata queries to fail or produce different results. +{{< /hint >}} + +### `ALTER TABLE ... REPLACE PARTITION FIELD` + +A partition field can be replaced by a new partition field in a single metadata update by using `REPLACE PARTITION FIELD`: + +```sql +ALTER TABLE prod.db.sample REPLACE PARTITION FIELD ts_day WITH day(ts); +-- use optional AS keyword to specify a custom name for the new partition field +ALTER TABLE prod.db.sample REPLACE PARTITION FIELD ts_day WITH day(ts) AS day_of_ts; +``` + +### `ALTER TABLE ... WRITE ORDERED BY` + +Iceberg tables can be configured with a sort order that is used to automatically sort data that is written to the table in some engines. For example, `MERGE INTO` in Spark will use the table ordering. + +To set the write order for a table, use `WRITE ORDERED BY`: + +```sql +ALTER TABLE prod.db.sample WRITE ORDERED BY category, id +-- use optional ASC/DEC keyword to specify sort order of each field (default ASC) +ALTER TABLE prod.db.sample WRITE ORDERED BY category ASC, id DESC +-- use optional NULLS FIRST/NULLS LAST keyword to specify null order of each field (default FIRST) +ALTER TABLE prod.db.sample WRITE ORDERED BY category ASC NULLS LAST, id DESC NULLS FIRST +``` + +{{< hint info >}} +Table write order does not guarantee data order for queries. It only affects how data is written to the table. +{{< /hint >}} + +`WRITE ORDERED BY` sets a global ordering where rows are ordered across tasks, like using `ORDER BY` in an `INSERT` command: + +```sql +INSERT INTO prod.db.sample +SELECT id, data, category, ts FROM another_table +ORDER BY ts, category +``` + +To order within each task, not across tasks, use `LOCALLY ORDERED BY`: + +```sql +ALTER TABLE prod.db.sample WRITE LOCALLY ORDERED BY category, id +``` + +To unset the sort order of the table, use `UNORDERED`: + +```sql +ALTER TABLE prod.db.sample WRITE UNORDERED +``` + +### `ALTER TABLE ... WRITE DISTRIBUTED BY PARTITION` + +`WRITE DISTRIBUTED BY PARTITION` will request that each partition is handled by one writer, the default implementation is hash distribution. + +```sql +ALTER TABLE prod.db.sample WRITE DISTRIBUTED BY PARTITION +``` + +`DISTRIBUTED BY PARTITION` and `LOCALLY ORDERED BY` may be used together, to distribute by partition and locally order rows within each task. + +```sql +ALTER TABLE prod.db.sample WRITE DISTRIBUTED BY PARTITION LOCALLY ORDERED BY category, id +``` + +### `ALTER TABLE ... SET IDENTIFIER FIELDS` + +Iceberg supports setting [identifier fields](https://iceberg.apache.org/spec/#identifier-field-ids) to a spec using `SET IDENTIFIER FIELDS`: +Spark table can support Flink SQL upsert operation if the table has identifier fields. + +```sql +ALTER TABLE prod.db.sample SET IDENTIFIER FIELDS id +-- single column +ALTER TABLE prod.db.sample SET IDENTIFIER FIELDS id, data +-- multiple columns +``` + +Identifier fields must be `NOT NULL` columns when they are created or added. +The later `ALTER` statement will overwrite the previous setting. + +### `ALTER TABLE ... DROP IDENTIFIER FIELDS` + +Identifier fields can be removed using `DROP IDENTIFIER FIELDS`: + +```sql +ALTER TABLE prod.db.sample DROP IDENTIFIER FIELDS id +-- single column +ALTER TABLE prod.db.sample DROP IDENTIFIER FIELDS id, data +-- multiple columns +``` + +Note that although the identifier is removed, the column will still exist in the table schema. + +### Branching and Tagging DDL + +#### `ALTER TABLE ... CREATE BRANCH` + +Branches can be created via the `CREATE BRANCH` statement with the following options: +* Do not fail if the branch already exists with `IF NOT EXISTS` +* Update the branch if it already exists with `CREATE OR REPLACE` +* Create at a snapshot +* Create with retention + +```sql +-- CREATE audit-branch at current snapshot with default retention. +ALTER TABLE prod.db.sample CREATE BRANCH `audit-branch` + +-- CREATE audit-branch at current snapshot with default retention if it doesn't exist. +ALTER TABLE prod.db.sample CREATE BRANCH IF NOT EXISTS `audit-branch` + +-- CREATE audit-branch at current snapshot with default retention or REPLACE it if it already exists. +ALTER TABLE prod.db.sample CREATE OR REPLACE BRANCH `audit-branch` + +-- CREATE audit-branch at snapshot 1234 with default retention. +ALTER TABLE prod.db.sample CREATE BRANCH `audit-branch` +AS OF VERSION 1234 + +-- CREATE audit-branch at snapshot 1234, retain audit-branch for 31 days, and retain the latest 31 days. The latest 3 snapshot snapshots, and 2 days worth of snapshots. +ALTER TABLE prod.db.sample CREATE BRANCH `audit-branch` +AS OF VERSION 1234 RETAIN 30 DAYS +WITH SNAPSHOT RETENTION 3 SNAPSHOTS 2 DAYS +``` + +#### `ALTER TABLE ... CREATE TAG` + +Tags can be created via the `CREATE TAG` statement with the following options: +* Do not fail if the tag already exists with `IF NOT EXISTS` +* Update the tag if it already exists with `CREATE OR REPLACE` +* Create at a snapshot +* Create with retention + +```sql +-- CREATE historical-tag at current snapshot with default retention. +ALTER TABLE prod.db.sample CREATE TAG `historical-tag` + +-- CREATE historical-tag at current snapshot with default retention if it doesn't exist. +ALTER TABLE prod.db.sample CREATE TAG IF NOT EXISTS `historical-tag` + +-- CREATE historical-tag at current snapshot with default retention or REPLACE it if it already exists. +ALTER TABLE prod.db.sample CREATE OR REPLACE TAG `historical-tag` + +-- CREATE historical-tag at snapshot 1234 with default retention. +ALTER TABLE prod.db.sample CREATE TAG `historical-tag` AS OF VERSION 1234 + +-- CREATE historical-tag at snapshot 1234 and retain it for 1 year. +ALTER TABLE prod.db.sample CREATE TAG `historical-tag` +AS OF VERSION 1234 RETAIN 365 DAYS +``` + +#### `ALTER TABLE ... REPLACE BRANCH` + +The snapshot which a branch references can be updated via +the `REPLACE BRANCH` sql. Retention can also be updated in this statement. + +```sql +-- REPLACE audit-branch to reference snapshot 4567 and update the retention to 60 days. +ALTER TABLE prod.db.sample REPLACE BRANCH `audit-branch` +AS OF VERSION 4567 RETAIN 60 DAYS +``` + +#### `ALTER TABLE ... REPLACE TAG` + +The snapshot which a tag references can be updated via +the `REPLACE TAG` sql. Retention can also be updated in this statement. + +```sql +-- REPLACE historical-tag to reference snapshot 4567 and update the retention to 60 days. +ALTER TABLE prod.db.sample REPLACE TAG `historical-tag` +AS OF VERSION 4567 RETAIN 60 DAYS +``` + +#### `ALTER TABLE ... DROP BRANCH` + +Branches can be removed via the `DROP BRANCH` sql + +```sql +ALTER TABLE prod.db.sample DROP BRANCH `audit-branch` +``` + +#### `ALTER TABLE ... DROP TAG` + +Tags can be removed via the `DROP TAG` sql + +```sql +ALTER TABLE prod.db.sample DROP TAG `historical-tag` +``` \ No newline at end of file diff --git a/docs/spark-getting-started.md b/docs/spark-getting-started.md new file mode 100644 index 000000000000..5789d5b704d2 --- /dev/null +++ b/docs/spark-getting-started.md @@ -0,0 +1,147 @@ +--- +title: "Getting Started" +weight: 200 +url: getting-started +aliases: + - "spark/getting-started" +menu: + main: + parent: Spark + identifier: spark_getting_started + weight: 0 +--- + + +# Getting Started + +The latest version of Iceberg is [{{% icebergVersion %}}](../../../releases). + +Spark is currently the most feature-rich compute engine for Iceberg operations. +We recommend you to get started with Spark to understand Iceberg concepts and features with examples. +You can also view documentations of using Iceberg with other compute engine under the [Multi-Engine Support](https://iceberg.apache.org/multi-engine-support) page. + +## Using Iceberg in Spark 3 + +To use Iceberg in a Spark shell, use the `--packages` option: + +```sh +spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{% icebergVersion %}} +``` + +{{< hint info >}} +If you want to include Iceberg in your Spark installation, add the [`iceberg-spark-runtime-3.5_2.12` Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/{{% icebergVersion %}}/iceberg-spark-runtime-3.5_2.12-{{% icebergVersion %}}.jar) to Spark's `jars` folder. +{{< /hint >}} + +### Adding catalogs + +Iceberg comes with [catalogs](../spark-configuration#catalogs) that enable SQL commands to manage tables and load them by name. Catalogs are configured using properties under `spark.sql.catalog.(catalog_name)`. + +This command creates a path-based catalog named `local` for tables under `$PWD/warehouse` and adds support for Iceberg tables to Spark's built-in catalog: + +```sh +spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{{% icebergVersion %}}\ + --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ + --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \ + --conf spark.sql.catalog.spark_catalog.type=hive \ + --conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \ + --conf spark.sql.catalog.local.type=hadoop \ + --conf spark.sql.catalog.local.warehouse=$PWD/warehouse +``` + +### Creating a table + +To create your first Iceberg table in Spark, use the `spark-sql` shell or `spark.sql(...)` to run a [`CREATE TABLE`](../spark-ddl#create-table) command: + +```sql +-- local is the path-based catalog defined above +CREATE TABLE local.db.table (id bigint, data string) USING iceberg; +``` + +Iceberg catalogs support the full range of SQL DDL commands, including: + +* [`CREATE TABLE ... PARTITIONED BY`](../spark-ddl#create-table) +* [`CREATE TABLE ... AS SELECT`](../spark-ddl#create-table--as-select) +* [`ALTER TABLE`](../spark-ddl#alter-table) +* [`DROP TABLE`](../spark-ddl#drop-table) + +### Writing + +Once your table is created, insert data using [`INSERT INTO`](../spark-writes#insert-into): + +```sql +INSERT INTO local.db.table VALUES (1, 'a'), (2, 'b'), (3, 'c'); +INSERT INTO local.db.table SELECT id, data FROM source WHERE length(data) = 1; +``` + +Iceberg also adds row-level SQL updates to Spark, [`MERGE INTO`](../spark-writes#merge-into) and [`DELETE FROM`](../spark-writes#delete-from): + +```sql +MERGE INTO local.db.target t USING (SELECT * FROM updates) u ON t.id = u.id +WHEN MATCHED THEN UPDATE SET t.count = t.count + u.count +WHEN NOT MATCHED THEN INSERT *; +``` + +Iceberg supports writing DataFrames using the new [v2 DataFrame write API](../spark-writes#writing-with-dataframes): + +```scala +spark.table("source").select("id", "data") + .writeTo("local.db.table").append() +``` + +The old `write` API is supported, but _not_ recommended. + +### Reading + +To read with SQL, use the Iceberg table's name in a `SELECT` query: + +```sql +SELECT count(1) as count, data +FROM local.db.table +GROUP BY data; +``` + +SQL is also the recommended way to [inspect tables](../spark-queries#inspecting-tables). To view all snapshots in a table, use the `snapshots` metadata table: +```sql +SELECT * FROM local.db.table.snapshots; +``` +``` ++-------------------------+----------------+-----------+-----------+----------------------------------------------------+-----+ +| committed_at | snapshot_id | parent_id | operation | manifest_list | ... | ++-------------------------+----------------+-----------+-----------+----------------------------------------------------+-----+ +| 2019-02-08 03:29:51.215 | 57897183625154 | null | append | s3://.../table/metadata/snap-57897183625154-1.avro | ... | +| | | | | | ... | +| | | | | | ... | +| ... | ... | ... | ... | ... | ... | ++-------------------------+----------------+-----------+-----------+----------------------------------------------------+-----+ +``` + +[DataFrame reads](../spark-queries#querying-with-dataframes) are supported and can now reference tables by name using `spark.table`: + +```scala +val df = spark.table("local.db.table") +df.count() +``` + +### Next steps + +Next, you can learn more about Iceberg tables in Spark: + +* [DDL commands](../spark-ddl): `CREATE`, `ALTER`, and `DROP` +* [Querying data](../spark-queries): `SELECT` queries and metadata tables +* [Writing data](../spark-writes): `INSERT INTO` and `MERGE INTO` +* [Maintaining tables](../spark-procedures) with stored procedures diff --git a/docs/spark-procedures.md b/docs/spark-procedures.md new file mode 100644 index 000000000000..ee25de8f3349 --- /dev/null +++ b/docs/spark-procedures.md @@ -0,0 +1,889 @@ +--- +title: "Procedures" +url: spark-procedures +aliases: + - "spark/spark-procedures" +menu: + main: + parent: Spark + identifier: spark_procedures + weight: 0 +--- + + +# Spark Procedures + +To use Iceberg in Spark, first configure [Spark catalogs](../spark-configuration). Stored procedures are only available when using [Iceberg SQL extensions](../spark-configuration#sql-extensions) in Spark 3. + +## Usage + +Procedures can be used from any configured Iceberg catalog with `CALL`. All procedures are in the namespace `system`. + +`CALL` supports passing arguments by name (recommended) or by position. Mixing position and named arguments is not supported. + +### Named arguments + +All procedure arguments are named. When passing arguments by name, arguments can be in any order and any optional argument can be omitted. + +```sql +CALL catalog_name.system.procedure_name(arg_name_2 => arg_2, arg_name_1 => arg_1); +``` + +### Positional arguments + +When passing arguments by position, only the ending arguments may be omitted if they are optional. + +```sql +CALL catalog_name.system.procedure_name(arg_1, arg_2, ... arg_n); +``` + +## Snapshot management + +### `rollback_to_snapshot` + +Roll back a table to a specific snapshot ID. + +To roll back to a specific time, use [`rollback_to_timestamp`](#rollback_to_timestamp). + +{{< hint info >}} +This procedure invalidates all cached Spark plans that reference the affected table. +{{< /hint >}} + +#### Usage + +| Argument Name | Required? | Type | Description | +|---------------|-----------|------|-------------| +| `table` | ✔️ | string | Name of the table to update | +| `snapshot_id` | ✔️ | long | Snapshot ID to rollback to | + +#### Output + +| Output Name | Type | Description | +| ------------|------|-------------| +| `previous_snapshot_id` | long | The current snapshot ID before the rollback | +| `current_snapshot_id` | long | The new current snapshot ID | + +#### Example + +Roll back table `db.sample` to snapshot ID `1`: + +```sql +CALL catalog_name.system.rollback_to_snapshot('db.sample', 1); +``` + +### `rollback_to_timestamp` + +Roll back a table to the snapshot that was current at some time. + +{{< hint info >}} +This procedure invalidates all cached Spark plans that reference the affected table. +{{< /hint >}} + +#### Usage + +| Argument Name | Required? | Type | Description | +|---------------|-----------|------|-------------| +| `table` | ✔️ | string | Name of the table to update | +| `timestamp` | ✔️ | timestamp | A timestamp to rollback to | + +#### Output + +| Output Name | Type | Description | +| ------------|------|-------------| +| `previous_snapshot_id` | long | The current snapshot ID before the rollback | +| `current_snapshot_id` | long | The new current snapshot ID | + +#### Example + +Roll back `db.sample` to a specific day and time. +```sql +CALL catalog_name.system.rollback_to_timestamp('db.sample', TIMESTAMP '2021-06-30 00:00:00.000'); +``` + +### `set_current_snapshot` + +Sets the current snapshot ID for a table. + +Unlike rollback, the snapshot is not required to be an ancestor of the current table state. + +{{< hint info >}} +This procedure invalidates all cached Spark plans that reference the affected table. +{{< /hint >}} + +#### Usage + +| Argument Name | Required? | Type | Description | +|---------------|-----------|------|-------------| +| `table` | ✔️ | string | Name of the table to update | +| `snapshot_id` | | long | Snapshot ID to set as current | +| `ref` | | string | Snapshot Reference (branch or tag) to set as current | + +Either `snapshot_id` or `ref` must be provided but not both. + +#### Output + +| Output Name | Type | Description | +| ------------|------|-------------| +| `previous_snapshot_id` | long | The current snapshot ID before the rollback | +| `current_snapshot_id` | long | The new current snapshot ID | + +#### Example + +Set the current snapshot for `db.sample` to 1: +```sql +CALL catalog_name.system.set_current_snapshot('db.sample', 1); +``` + +Set the current snapshot for `db.sample` to tag `s1`: +```sql +CALL catalog_name.system.set_current_snapshot(table => 'db.sample', ref => 's1'); +``` + +### `cherrypick_snapshot` + +Cherry-picks changes from a snapshot into the current table state. + +Cherry-picking creates a new snapshot from an existing snapshot without altering or removing the original. + +Only append and dynamic overwrite snapshots can be cherry-picked. + +{{< hint info >}} +This procedure invalidates all cached Spark plans that reference the affected table. +{{< /hint >}} + +#### Usage + +| Argument Name | Required? | Type | Description | +|---------------|-----------|------|-------------| +| `table` | ✔️ | string | Name of the table to update | +| `snapshot_id` | ✔️ | long | The snapshot ID to cherry-pick | + +#### Output + +| Output Name | Type | Description | +| ------------|------|-------------| +| `source_snapshot_id` | long | The table's current snapshot before the cherry-pick | +| `current_snapshot_id` | long | The snapshot ID created by applying the cherry-pick | + +#### Examples + +Cherry-pick snapshot 1 +```sql +CALL catalog_name.system.cherrypick_snapshot('my_table', 1); +``` + +Cherry-pick snapshot 1 with named args +```sql +CALL catalog_name.system.cherrypick_snapshot(snapshot_id => 1, table => 'my_table' ); +``` + +### `publish_changes` + +Publish changes from a staged WAP ID into the current table state. + +publish_changes creates a new snapshot from an existing snapshot without altering or removing the original. + +Only append and dynamic overwrite snapshots can be successfully published. + +{{< hint info >}} +This procedure invalidates all cached Spark plans that reference the affected table. +{{< /hint >}} + +#### Usage + +| Argument Name | Required? | Type | Description | +|---------------|-----------|------|-------------| +| `table` | ✔️ | string | Name of the table to update | +| `wap_id` | ✔️ | long | The wap_id to be pusblished from stage to prod | + +#### Output + +| Output Name | Type | Description | +| ------------|------|-------------| +| `source_snapshot_id` | long | The table's current snapshot before publishing the change | +| `current_snapshot_id` | long | The snapshot ID created by applying the change | + +#### Examples + +publish_changes with WAP ID 'wap_id_1' +```sql +CALL catalog_name.system.publish_changes('my_table', 'wap_id_1'); +``` + +publish_changes with named args +```sql +CALL catalog_name.system.publish_changes(wap_id => 'wap_id_2', table => 'my_table'); +``` + +### `fast_forward` + +Fast-forward the current snapshot of one branch to the latest snapshot of another. + +#### Usage + +| Argument Name | Required? | Type | Description | +|---------------|-----------|------|-------------| +| `table` | ✔️ | string | Name of the table to update | +| `branch` | ✔️ | string | Name of the branch to fast-forward | +| `to` | ✔️ | string | | Name of the branch to be fast-forwarded to | + +#### Output + +| Output Name | Type | Description | +| ------------|------|-------------| +| `branch_updated` | string | Name of the branch that has been fast-forwarded | +| `previous_ref` | long | The snapshot ID before applying fast-forward | +| `updated_ref` | long | The current snapshot ID after applying fast-forward | + +#### Examples + +Fast-forward the main branch to the head of `audit-branch` +```sql +CALL catalog_name.system.fast_forward('my_table', 'main', 'audit-branch'); +``` + + + +## Metadata management + +Many [maintenance actions](../maintenance) can be performed using Iceberg stored procedures. + +### `expire_snapshots` + +Each write/update/delete/upsert/compaction in Iceberg produces a new snapshot while keeping the old data and metadata +around for snapshot isolation and time travel. The `expire_snapshots` procedure can be used to remove older snapshots +and their files which are no longer needed. + +This procedure will remove old snapshots and data files which are uniquely required by those old snapshots. This means +the `expire_snapshots` procedure will never remove files which are still required by a non-expired snapshot. + +#### Usage + +| Argument Name | Required? | Type | Description | +|---------------|-----------|------|-------------| +| `table` | ✔️ | string | Name of the table to update | +| `older_than` | ️ | timestamp | Timestamp before which snapshots will be removed (Default: 5 days ago) | +| `retain_last` | | int | Number of ancestor snapshots to preserve regardless of `older_than` (defaults to 1) | +| `max_concurrent_deletes` | | int | Size of the thread pool used for delete file actions (by default, no thread pool is used) | +| `stream_results` | | boolean | When true, deletion files will be sent to Spark driver by RDD partition (by default, all the files will be sent to Spark driver). This option is recommended to set to `true` to prevent Spark driver OOM from large file size | +| `snapshot_ids` | | array of long | Array of snapshot IDs to expire. | + +If `older_than` and `retain_last` are omitted, the table's [expiration properties](../configuration/#table-behavior-properties) will be used. +Snapshots that are still referenced by branches or tags won't be removed. By default, branches and tags never expire, but their retention policy can be changed with the table property `history.expire.max-ref-age-ms`. The `main` branch never expires. + +#### Output + +| Output Name | Type | Description | +| ------------|------|-------------| +| `deleted_data_files_count` | long | Number of data files deleted by this operation | +| `deleted_position_delete_files_count` | long | Number of position delete files deleted by this operation | +| `deleted_equality_delete_files_count` | long | Number of equality delete files deleted by this operation | +| `deleted_manifest_files_count` | long | Number of manifest files deleted by this operation | +| `deleted_manifest_lists_count` | long | Number of manifest List files deleted by this operation | + +#### Examples + +Remove snapshots older than specific day and time, but retain the last 100 snapshots: + +```sql +CALL hive_prod.system.expire_snapshots('db.sample', TIMESTAMP '2021-06-30 00:00:00.000', 100); +``` + +Remove snapshots with snapshot ID `123` (note that this snapshot ID should not be the current snapshot): + +```sql +CALL hive_prod.system.expire_snapshots(table => 'db.sample', snapshot_ids => ARRAY(123)); +``` + +### `remove_orphan_files` + +Used to remove files which are not referenced in any metadata files of an Iceberg table and can thus be considered "orphaned". + +#### Usage + +| Argument Name | Required? | Type | Description | +|---------------|-----------|------|-------------| +| `table` | ✔️ | string | Name of the table to clean | +| `older_than` | ️ | timestamp | Remove orphan files created before this timestamp (Defaults to 3 days ago) | +| `location` | | string | Directory to look for files in (defaults to the table's location) | +| `dry_run` | | boolean | When true, don't actually remove files (defaults to false) | +| `max_concurrent_deletes` | | int | Size of the thread pool used for delete file actions (by default, no thread pool is used) | + +#### Output + +| Output Name | Type | Description | +| ------------|------|-------------| +| `orphan_file_location` | String | The path to each file determined to be an orphan by this command | + +#### Examples + +List all the files that are candidates for removal by performing a dry run of the `remove_orphan_files` command on this table without actually removing them: +```sql +CALL catalog_name.system.remove_orphan_files(table => 'db.sample', dry_run => true); +``` + +Remove any files in the `tablelocation/data` folder which are not known to the table `db.sample`. +```sql +CALL catalog_name.system.remove_orphan_files(table => 'db.sample', location => 'tablelocation/data'); +``` + +### `rewrite_data_files` + +Iceberg tracks each data file in a table. More data files leads to more metadata stored in manifest files, and small data files causes an unnecessary amount of metadata and less efficient queries from file open costs. + +Iceberg can compact data files in parallel using Spark with the `rewriteDataFiles` action. This will combine small files into larger files to reduce metadata overhead and runtime file open cost. + +#### Usage + +| Argument Name | Required? | Type | Description | +|---------------|-----------|------|-------------| +| `table` | ✔️ | string | Name of the table to update | +| `strategy` | | string | Name of the strategy - binpack or sort. Defaults to binpack strategy | +| `sort_order` | | string | For Zorder use a comma separated list of columns within zorder(). Example: zorder(c1,c2,c3).
Else, Comma separated sort orders in the format (ColumnName SortDirection NullOrder).
Where SortDirection can be ASC or DESC. NullOrder can be NULLS FIRST or NULLS LAST.
Defaults to the table's sort order | +| `options` | ️ | map | Options to be used for actions| +| `where` | ️ | string | predicate as a string used for filtering the files. Note that all files that may contain data matching the filter will be selected for rewriting| + +#### Options + +##### General Options +| Name | Default Value | Description | +|------|---------------|-------------| +| `max-concurrent-file-group-rewrites` | 5 | Maximum number of file groups to be simultaneously rewritten | +| `partial-progress.enabled` | false | Enable committing groups of files prior to the entire rewrite completing | +| `partial-progress.max-commits` | 10 | Maximum amount of commits that this rewrite is allowed to produce if partial progress is enabled | +| `use-starting-sequence-number` | true | Use the sequence number of the snapshot at compaction start time instead of that of the newly produced snapshot | +| `rewrite-job-order` | none | Force the rewrite job order based on the value.
  • If rewrite-job-order=bytes-asc, then rewrite the smallest job groups first.
  • If rewrite-job-order=bytes-desc, then rewrite the largest job groups first.
  • If rewrite-job-order=files-asc, then rewrite the job groups with the least files first.
  • If rewrite-job-order=files-desc, then rewrite the job groups with the most files first.
  • If rewrite-job-order=none, then rewrite job groups in the order they were planned (no specific ordering).
| +| `target-file-size-bytes` | 536870912 (512 MB, default value of `write.target-file-size-bytes` from [table properties](../configuration/#write-properties)) | Target output file size | +| `min-file-size-bytes` | 75% of target file size | Files under this threshold will be considered for rewriting regardless of any other criteria | +| `max-file-size-bytes` | 180% of target file size | Files with sizes above this threshold will be considered for rewriting regardless of any other criteria | +| `min-input-files` | 5 | Any file group exceeding this number of files will be rewritten regardless of other criteria | +| `rewrite-all` | false | Force rewriting of all provided files overriding other options | +| `max-file-group-size-bytes` | 107374182400 (100GB) | Largest amount of data that should be rewritten in a single file group. The entire rewrite operation is broken down into pieces based on partitioning and within partitions based on size into file-groups. This helps with breaking down the rewriting of very large partitions which may not be rewritable otherwise due to the resource constraints of the cluster. | +| `delete-file-threshold` | 2147483647 | Minimum number of deletes that needs to be associated with a data file for it to be considered for rewriting | + + +##### Options for sort strategy + +| Name | Default Value | Description | +|------|---------------|-------------| +| `compression-factor` | 1.0 | The number of shuffle partitions and consequently the number of output files created by the Spark sort is based on the size of the input data files used in this file rewriter. Due to compression, the disk file sizes may not accurately represent the size of files in the output. This parameter lets the user adjust the file size used for estimating actual output data size. A factor greater than 1.0 would generate more files than we would expect based on the on-disk file size. A value less than 1.0 would create fewer files than we would expect based on the on-disk size. | +| `shuffle-partitions-per-file` | 1 | Number of shuffle partitions to use for each output file. Iceberg will use a custom coalesce operation to stitch these sorted partitions back together into a single sorted file. | + +##### Options for sort strategy with zorder sort_order + +| Name | Default Value | Description | +|------|---------------|-------------| +| `var-length-contribution` | 8 | Number of bytes considered from an input column of a type with variable length (String, Binary) | +| `max-output-size` | 2147483647 | Amount of bytes interleaved in the ZOrder algorithm | + +#### Output + +| Output Name | Type | Description | +| ------------|------|-------------| +| `rewritten_data_files_count` | int | Number of data which were re-written by this command | +| `added_data_files_count` | int | Number of new data files which were written by this command | +| `rewritten_bytes_count` | long | Number of bytes which were written by this command | +| `failed_data_files_count` | int | Number of data files that failed to be rewritten when `partial-progress.enabled` is true | + +#### Examples + +Rewrite the data files in table `db.sample` using the default rewrite algorithm of bin-packing to combine small files +and also split large files according to the default write size of the table. +```sql +CALL catalog_name.system.rewrite_data_files('db.sample'); +``` + +Rewrite the data files in table `db.sample` by sorting all the data on id and name +using the same defaults as bin-pack to determine which files to rewrite. +```sql +CALL catalog_name.system.rewrite_data_files(table => 'db.sample', strategy => 'sort', sort_order => 'id DESC NULLS LAST,name ASC NULLS FIRST'); +``` + +Rewrite the data files in table `db.sample` by zOrdering on column c1 and c2. +Using the same defaults as bin-pack to determine which files to rewrite. +```sql +CALL catalog_name.system.rewrite_data_files(table => 'db.sample', strategy => 'sort', sort_order => 'zorder(c1,c2)'); +``` + +Rewrite the data files in table `db.sample` using bin-pack strategy in any partition where more than 2 or more files need to be rewritten. +```sql +CALL catalog_name.system.rewrite_data_files(table => 'db.sample', options => map('min-input-files','2')); +``` + +Rewrite the data files in table `db.sample` and select the files that may contain data matching the filter (id = 3 and name = "foo") to be rewritten. +```sql +CALL catalog_name.system.rewrite_data_files(table => 'db.sample', where => 'id = 3 and name = "foo"'); +``` + +### `rewrite_manifests` + +Rewrite manifests for a table to optimize scan planning. + +Data files in manifests are sorted by fields in the partition spec. This procedure runs in parallel using a Spark job. + +{{< hint info >}} +This procedure invalidates all cached Spark plans that reference the affected table. +{{< /hint >}} + +#### Usage + +| Argument Name | Required? | Type | Description | +|---------------|-----------|------|---------------------------------------------------------------| +| `table` | ✔️ | string | Name of the table to update | +| `use_caching` | ️ | boolean | Use Spark caching during operation (defaults to true) | +| `spec_id` | ️ | int | Spec id of the manifests to rewrite (defaults to current spec id) | + +#### Output + +| Output Name | Type | Description | +| ------------|------|-------------| +| `rewritten_manifests_count` | int | Number of manifests which were re-written by this command | +| `added_mainfests_count` | int | Number of new manifest files which were written by this command | + +#### Examples + +Rewrite the manifests in table `db.sample` and align manifest files with table partitioning. +```sql +CALL catalog_name.system.rewrite_manifests('db.sample'); +``` + +Rewrite the manifests in table `db.sample` and disable the use of Spark caching. This could be done to avoid memory issues on executors. +```sql +CALL catalog_name.system.rewrite_manifests('db.sample', false); +``` + +### `rewrite_position_delete_files` + +Iceberg can rewrite position delete files, which serves two purposes: +* Minor Compaction: Compact small position delete files into larger ones. This reduces the size of metadata stored in manifest files and overhead of opening small delete files. +* Remove Dangling Deletes: Filter out position delete records that refer to data files that are no longer live. After rewrite_data_files, position delete records pointing to the rewritten data files are not always marked for removal, and can remain tracked by the table's live snapshot metadata. This is known as the 'dangling delete' problem. + +#### Usage + +| Argument Name | Required? | Type | Description | +|---------------|-----------|------|----------------------------------| +| `table` | ✔️ | string | Name of the table to update | +| `options` | ️ | map | Options to be used for procedure | + +Dangling deletes are always filtered out during rewriting. + +#### Options + +| Name | Default Value | Description | +|------|---------------|-------------| +| `max-concurrent-file-group-rewrites` | 5 | Maximum number of file groups to be simultaneously rewritten | +| `partial-progress.enabled` | false | Enable committing groups of files prior to the entire rewrite completing | +| `partial-progress.max-commits` | 10 | Maximum amount of commits that this rewrite is allowed to produce if partial progress is enabled | +| `rewrite-job-order` | none | Force the rewrite job order based on the value.
  • If rewrite-job-order=bytes-asc, then rewrite the smallest job groups first.
  • If rewrite-job-order=bytes-desc, then rewrite the largest job groups first.
  • If rewrite-job-order=files-asc, then rewrite the job groups with the least files first.
  • If rewrite-job-order=files-desc, then rewrite the job groups with the most files first.
  • If rewrite-job-order=none, then rewrite job groups in the order they were planned (no specific ordering).
| +| `target-file-size-bytes` | 67108864 (64MB, default value of `write.delete.target-file-size-bytes` from [table properties](../configuration/#write-properties)) | Target output file size | +| `min-file-size-bytes` | 75% of target file size | Files under this threshold will be considered for rewriting regardless of any other criteria | +| `max-file-size-bytes` | 180% of target file size | Files with sizes above this threshold will be considered for rewriting regardless of any other criteria | +| `min-input-files` | 5 | Any file group exceeding this number of files will be rewritten regardless of other criteria | +| `rewrite-all` | false | Force rewriting of all provided files overriding other options | +| `max-file-group-size-bytes` | 107374182400 (100GB) | Largest amount of data that should be rewritten in a single file group. The entire rewrite operation is broken down into pieces based on partitioning and within partitions based on size into file-groups. This helps with breaking down the rewriting of very large partitions which may not be rewritable otherwise due to the resource constraints of the cluster. | + +#### Output + +| Output Name | Type | Description | +|--------------------------------|------|----------------------------------------------------------------------------| +| `rewritten_delete_files_count` | int | Number of delete files which were removed by this command | +| `added_delete_files_count` | int | Number of delete files which were added by this command | +| `rewritten_bytes_count` | long | Count of bytes across delete files which were removed by this command | +| `added_bytes_count` | long | Count of bytes across all new delete files which were added by this command | + + +#### Examples + +Rewrite position delete files in table `db.sample`. This selects position delete files that fit default rewrite criteria, and writes new files of target size `target-file-size-bytes`. Dangling deletes are removed from rewritten delete files. +```sql +CALL catalog_name.system.rewrite_position_delete_files('db.sample'); +``` + +Rewrite all position delete files in table `db.sample`, writing new files `target-file-size-bytes`. Dangling deletes are removed from rewritten delete files. +```sql +CALL catalog_name.system.rewrite_position_delete_files(table => 'db.sample', options => map('rewrite-all', 'true')); +``` + +Rewrite position delete files in table `db.sample`. This selects position delete files in partitions where 2 or more position delete files need to be rewritten based on size criteria. Dangling deletes are removed from rewritten delete files. +```sql +CALL catalog_name.system.rewrite_position_delete_files(table => 'db.sample', options => map('min-input-files','2')); +``` + +## Table migration + +The `snapshot` and `migrate` procedures help test and migrate existing Hive or Spark tables to Iceberg. + +### `snapshot` + +Create a light-weight temporary copy of a table for testing, without changing the source table. + +The newly created table can be changed or written to without affecting the source table, but the snapshot uses the original table's data files. + +When inserts or overwrites run on the snapshot, new files are placed in the snapshot table's location rather than the original table location. + +When finished testing a snapshot table, clean it up by running `DROP TABLE`. + +{{< hint info >}} +Because tables created by `snapshot` are not the sole owners of their data files, they are prohibited from +actions like `expire_snapshots` which would physically delete data files. Iceberg deletes, which only effect metadata, +are still allowed. In addition, any operations which affect the original data files will disrupt the Snapshot's +integrity. DELETE statements executed against the original Hive table will remove original data files and the +`snapshot` table will no longer be able to access them. +{{< /hint >}} + +See [`migrate`](#migrate) to replace an existing table with an Iceberg table. + +#### Usage + +| Argument Name | Required? | Type | Description | +|---------------|-----------|------|-------------| +| `source_table`| ✔️ | string | Name of the table to snapshot | +| `table` | ✔️ | string | Name of the new Iceberg table to create | +| `location` | | string | Table location for the new table (delegated to the catalog by default) | +| `properties` | ️ | map | Properties to add to the newly created table | + +#### Output + +| Output Name | Type | Description | +| ------------|------|-------------| +| `imported_files_count` | long | Number of files added to the new table | + +#### Examples + +Make an isolated Iceberg table which references table `db.sample` named `db.snap` at the +catalog's default location for `db.snap`. +```sql +CALL catalog_name.system.snapshot('db.sample', 'db.snap'); +``` + +Migrate an isolated Iceberg table which references table `db.sample` named `db.snap` at +a manually specified location `/tmp/temptable/`. +```sql +CALL catalog_name.system.snapshot('db.sample', 'db.snap', '/tmp/temptable/'); +``` + +### `migrate` + +Replace a table with an Iceberg table, loaded with the source's data files. + +Table schema, partitioning, properties, and location will be copied from the source table. + +Migrate will fail if any table partition uses an unsupported format. Supported formats are Avro, Parquet, and ORC. +Existing data files are added to the Iceberg table's metadata and can be read using a name-to-id mapping created from the original table schema. + +To leave the original table intact while testing, use [`snapshot`](#snapshot) to create new temporary table that shares source data files and schema. + +By default, the original table is retained with the name `table_BACKUP_`. + +#### Usage + +| Argument Name | Required? | Type | Description | +|---------------|-----------|------|-------------| +| `table` | ✔️ | string | Name of the table to migrate | +| `properties` | ️ | map | Properties for the new Iceberg table | +| `drop_backup` | | boolean | When true, the original table will not be retained as backup (defaults to false) | +| `backup_table_name` | | string | Name of the table that will be retained as backup (defaults to `table_BACKUP_`) | + +#### Output + +| Output Name | Type | Description | +| ------------|------|-------------| +| `migrated_files_count` | long | Number of files appended to the Iceberg table | + +#### Examples + +Migrate the table `db.sample` in Spark's default catalog to an Iceberg table and add a property 'foo' set to 'bar': + +```sql +CALL catalog_name.system.migrate('spark_catalog.db.sample', map('foo', 'bar')); +``` + +Migrate `db.sample` in the current catalog to an Iceberg table without adding any additional properties: +```sql +CALL catalog_name.system.migrate('db.sample'); +``` + +### `add_files` + +Attempts to directly add files from a Hive or file based table into a given Iceberg table. Unlike migrate or +snapshot, `add_files` can import files from a specific partition or partitions and does not create a new Iceberg table. +This command will create metadata for the new files and will not move them. This procedure will not analyze the schema +of the files to determine if they actually match the schema of the Iceberg table. Upon completion, the Iceberg table +will then treat these files as if they are part of the set of files owned by Iceberg. This means any subsequent +`expire_snapshot` calls will be able to physically delete the added files. This method should not be used if +`migrate` or `snapshot` are possible. + +{{< hint warning >}} +Keep in mind the `add_files` procedure will fetch the Parquet metadata from each file being added just once. If you're using tiered storage, (such as [Amazon S3 Intelligent-Tiering storage class](https://aws.amazon.com/s3/storage-classes/intelligent-tiering/)), the underlying, file will be retrieved from the archive, and will remain on a higher tier for a set period of time. +{{< /hint >}} + +#### Usage + +| Argument Name | Required? | Type | Description | +|-------------------------|-----------|---------------------|-----------------------------------------------------------------------------------------------------| +| `table` | ✔️ | string | Table which will have files added to | +| `source_table` | ✔️ | string | Table where files should come from, paths are also possible in the form of \`file_format\`.\`path\` | +| `partition_filter` | ️ | map | A map of partitions in the source table to import from | +| `check_duplicate_files` | ️ | boolean | Whether to prevent files existing in the table from being added (defaults to true) | +| `parallelism` | | int | number of threads to use for file reading (defaults to 1) | + +Warning : Schema is not validated, adding files with different schema to the Iceberg table will cause issues. + +Warning : Files added by this method can be physically deleted by Iceberg operations + +#### Output + +| Output Name | Type | Description | +|---------------------------|------|---------------------------------------------------| +| `added_files_count` | long | The number of files added by this command | +| `changed_partition_count` | long | The number of partitioned changed by this command (if known) | + +{{< hint warning >}} +changed_partition_count will be NULL when table property `compatibility.snapshot-id-inheritance.enabled` is set to true or if the table format version is > 1. +{{< /hint >}} +#### Examples + +Add the files from table `db.src_table`, a Hive or Spark table registered in the session Catalog, to Iceberg table +`db.tbl`. Only add files that exist within partitions where `part_col_1` is equal to `A`. +```sql +CALL spark_catalog.system.add_files( +table => 'db.tbl', +source_table => 'db.src_tbl', +partition_filter => map('part_col_1', 'A') +); +``` + +Add files from a `parquet` file based table at location `path/to/table` to the Iceberg table `db.tbl`. Add all +files regardless of what partition they belong to. +```sql +CALL spark_catalog.system.add_files( + table => 'db.tbl', + source_table => '`parquet`.`path/to/table`' +); +``` + +### `register_table` + +Creates a catalog entry for a metadata.json file which already exists but does not have a corresponding catalog identifier. + +#### Usage + +| Argument Name | Required? | Type | Description | +|---------------|-----------|------|-------------| +| `table` | ✔️ | string | Table which is to be registered | +| `metadata_file`| ✔️ | string | Metadata file which is to be registered as a new catalog identifier | + +{{< hint warning >}} +Having the same metadata.json registered in more than one catalog can lead to missing updates, loss of data, and table corruption. +Only use this procedure when the table is no longer registered in an existing catalog, or you are moving a table between catalogs. +{{< /hint >}} + +#### Output + +| Output Name | Type | Description | +| ------------|------|-------------| +| `current_snapshot_id` | long | The current snapshot ID of the newly registered Iceberg table | +| `total_records_count` | long | Total records count of the newly registered Iceberg table | +| `total_data_files_count` | long | Total data files count of the newly registered Iceberg table | + +#### Examples + +Register a new table as `db.tbl` to `spark_catalog` pointing to metadata.json file `path/to/metadata/file.json`. +```sql +CALL spark_catalog.system.register_table( + table => 'db.tbl', + metadata_file => 'path/to/metadata/file.json' +); +``` + +## Metadata information + +### `ancestors_of` + +Report the live snapshot IDs of parents of a specified snapshot + +#### Usage + +| Argument Name | Required? | Type | Description | +|---------------|-----------|------|-------------| +| `table` | ✔️ | string | Name of the table to report live snapshot IDs | +| `snapshot_id` | ️ | long | Use a specified snapshot to get the live snapshot IDs of parents | + +> tip : Using snapshot_id +> +> Given snapshots history with roll back to B and addition of C' -> D' +> ```shell +> A -> B - > C -> D +> \ -> C' -> (D') +> ``` +> Not specifying the snapshot ID would return A -> B -> C' -> D', while providing the snapshot ID of +> D as an argument would return A-> B -> C -> D + +#### Output + +| Output Name | Type | Description | +| ------------|------|-------------| +| `snapshot_id` | long | the ancestor snapshot id | +| `timestamp` | long | snapshot creation time | + +#### Examples + +Get all the snapshot ancestors of current snapshots(default) +```sql +CALL spark_catalog.system.ancestors_of('db.tbl'); +``` + +Get all the snapshot ancestors by a particular snapshot +```sql +CALL spark_catalog.system.ancestors_of('db.tbl', 1); +CALL spark_catalog.system.ancestors_of(snapshot_id => 1, table => 'db.tbl'); +``` + +## Change Data Capture + +### `create_changelog_view` + +Creates a view that contains the changes from a given table. + +#### Usage + +| Argument Name | Required? | Type | Description | +|----------------------|-----------|---------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `table` | ✔️ | string | Name of the source table for the changelog | +| `changelog_view` | | string | Name of the view to create | +| `options` | | map | A map of Spark read options to use | +| `net_changes` | | boolean | Whether to output net changes (see below for more information). Defaults to false. | +| `compute_updates` | | boolean | Whether to compute pre/post update images (see below for more information). Defaults to false. | +| `identifier_columns` | | array | The list of identifier columns to compute updates. If the argument `compute_updates` is set to true and `identifier_columns` are not provided, the table’s current identifier fields will be used. | + +Here is a list of commonly used Spark read options: +* `start-snapshot-id`: the exclusive start snapshot ID. If not provided, it reads from the table’s first snapshot inclusively. +* `end-snapshot-id`: the inclusive end snapshot id, default to table's current snapshot. +* `start-timestamp`: the exclusive start timestamp. If not provided, it reads from the table’s first snapshot inclusively. +* `end-timestamp`: the inclusive end timestamp, default to table's current snapshot. + +#### Output +| Output Name | Type | Description | +| ------------|------|----------------------------------------| +| `changelog_view` | string | The name of the created changelog view | + +#### Examples + +Create a changelog view `tbl_changes` based on the changes that happened between snapshot `1` (exclusive) and `2` (inclusive). +```sql +CALL spark_catalog.system.create_changelog_view( + table => 'db.tbl', + options => map('start-snapshot-id','1','end-snapshot-id', '2') +); +``` + +Create a changelog view `my_changelog_view` based on the changes that happened between timestamp `1678335750489` (exclusive) and `1678992105265` (inclusive). +```sql +CALL spark_catalog.system.create_changelog_view( + table => 'db.tbl', + options => map('start-timestamp','1678335750489','end-timestamp', '1678992105265'), + changelog_view => 'my_changelog_view' +); +``` + +Create a changelog view that computes updates based on the identifier columns `id` and `name`. +```sql +CALL spark_catalog.system.create_changelog_view( + table => 'db.tbl', + options => map('start-snapshot-id','1','end-snapshot-id', '2'), + identifier_columns => array('id', 'name') +) +``` + +Once the changelog view is created, you can query the view to see the changes that happened between the snapshots. +```sql +SELECT * FROM tbl_changes; +``` +```sql +SELECT * FROM tbl_changes where _change_type = 'INSERT' AND id = 3 ORDER BY _change_ordinal; +``` +Please note that the changelog view includes Change Data Capture(CDC) metadata columns +that provide additional information about the changes being tracked. These columns are: +- `_change_type`: the type of change. It has one of the following values: `INSERT`, `DELETE`, `UPDATE_BEFORE`, or `UPDATE_AFTER`. +- `_change_ordinal`: the order of changes +- `_commit_snapshot_id`: the snapshot ID where the change occurred + +Here is an example of corresponding results. It shows that the first snapshot inserted 2 records, and the +second snapshot deleted 1 record. + +| id | name |_change_type | _change_ordinal | _change_snapshot_id | +|---|--------|---|---|---| +|1 | Alice |INSERT |0 |5390529835796506035| +|2 | Bob |INSERT |0 |5390529835796506035| +|1 | Alice |DELETE |1 |8764748981452218370| + +Create a changelog view that computes net changes. It removes intermediate changes and only outputs the net changes. +```sql +CALL spark_catalog.system.create_changelog_view( + table => 'db.tbl', + options => map('end-snapshot-id', '87647489814522183702'), + net_changes => true +); +``` + +With the net changes, the above changelog view only contains the following row since Alice was inserted in the first snapshot and deleted in the second snapshot. + +| id | name |_change_type | _change_ordinal | _change_snapshot_id | +|---|--------|---|---|---| +|2 | Bob |INSERT |0 |5390529835796506035| + + +#### Carry-over Rows + +The procedure removes the carry-over rows by default. Carry-over rows are the result of row-level operations(`MERGE`, `UPDATE` and `DELETE`) +when using copy-on-write. For example, given a file which contains row1 `(id=1, name='Alice')` and row2 `(id=2, name='Bob')`. +A copy-on-write delete of row2 would require erasing this file and preserving row1 in a new file. The changelog table +reports this as the following pair of rows, despite it not being an actual change to the table. + +| id | name | _change_type | +|-----|-------|--------------| +| 1 | Alice | DELETE | +| 1 | Alice | INSERT | + +To see carry-over rows, query `SparkChangelogTable` as follows: +```sql +SELECT * FROM spark_catalog.db.tbl.changes; +``` + +#### Pre/Post Update Images + +The procedure computes the pre/post update images if configured. Pre/post update images are converted from a +pair of a delete row and an insert row. Identifier columns are used for determining whether an insert and a delete record +refer to the same row. If the two records share the same values for the identity columns they are considered to be before +and after states of the same row. You can either set identifier fields in the table schema or input them as the procedure parameters. + +The following example shows pre/post update images computation with an identifier column(`id`), where a row deletion +and an insertion with the same `id` are treated as a single update operation. Specifically, suppose we have the following pair of rows: + +| id | name | _change_type | +|-----|--------|--------------| +| 3 | Robert | DELETE | +| 3 | Dan | INSERT | + +In this case, the procedure marks the row before the update as an `UPDATE_BEFORE` image and the row after the update +as an `UPDATE_AFTER` image, resulting in the following pre/post update images: + +| id | name | _change_type | +|-----|--------|--------------| +| 3 | Robert | UPDATE_BEFORE| +| 3 | Dan | UPDATE_AFTER | diff --git a/docs/spark-queries.md b/docs/spark-queries.md new file mode 100644 index 000000000000..c55d70e9e4b8 --- /dev/null +++ b/docs/spark-queries.md @@ -0,0 +1,478 @@ +--- +title: "Queries" +url: spark-queries +aliases: + - "spark/spark-queries" +menu: + main: + parent: Spark + identifier: spark_queries + weight: 0 +--- + + +# Spark Queries + +To use Iceberg in Spark, first configure [Spark catalogs](../spark-configuration). Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. + +## Querying with SQL + +In Spark 3, tables use identifiers that include a [catalog name](../spark-configuration#using-catalogs). + +```sql +SELECT * FROM prod.db.table; -- catalog: prod, namespace: db, table: table +``` + +Metadata tables, like `history` and `snapshots`, can use the Iceberg table name as a namespace. + +For example, to read from the `files` metadata table for `prod.db.table`: + +```sql +SELECT * FROM prod.db.table.files; +``` +|content|file_path |file_format|spec_id|partition|record_count|file_size_in_bytes|column_sizes |value_counts |null_value_counts|nan_value_counts|lower_bounds |upper_bounds |key_metadata|split_offsets|equality_ids|sort_order_id| +| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | +| 0 | s3:/.../table/data/00000-3-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 0 | {1999-01-01, 01} | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [] | [1 -> , 2 -> c] | [1 -> , 2 -> c] | null | [4] | null | null | +| 0 | s3:/.../table/data/00001-4-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 0 | {1999-01-01, 02} | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [] | [1 -> , 2 -> b] | [1 -> , 2 -> b] | null | [4] | null | null | +| 0 | s3:/.../table/data/00002-5-8d6d60e8-d427-4809-bcf0-f5d45a4aad96.parquet | PARQUET | 0 | {1999-01-01, 03} | 1 | 597 | [1 -> 90, 2 -> 62] | [1 -> 1, 2 -> 1] | [1 -> 0, 2 -> 0] | [] | [1 -> , 2 -> a] | [1 -> , 2 -> a] | null | [4] | null | null | + +## Querying with DataFrames + +To load a table as a DataFrame, use `table`: + +```scala +val df = spark.table("prod.db.table") +``` + +### Catalogs with DataFrameReader + +Paths and table names can be loaded with Spark's `DataFrameReader` interface. How tables are loaded depends on how +the identifier is specified. When using `spark.read.format("iceberg").load(table)` or `spark.table(table)` the `table` +variable can take a number of forms as listed below: + +* `file:///path/to/table`: loads a HadoopTable at given path +* `tablename`: loads `currentCatalog.currentNamespace.tablename` +* `catalog.tablename`: loads `tablename` from the specified catalog. +* `namespace.tablename`: loads `namespace.tablename` from current catalog +* `catalog.namespace.tablename`: loads `namespace.tablename` from the specified catalog. +* `namespace1.namespace2.tablename`: loads `namespace1.namespace2.tablename` from current catalog + +The above list is in order of priority. For example: a matching catalog will take priority over any namespace resolution. + + +### Time travel + +#### SQL + +Spark 3.3 and later supports time travel in SQL queries using `TIMESTAMP AS OF` or `VERSION AS OF` clauses. +The `VERSION AS OF` clause can contain a long snapshot ID or a string branch or tag name. + +{{< hint info >}} +Note: If the name of a branch or tag is the same as a snapshot ID, then the snapshot which is selected for time travel is the snapshot +with the given snapshot ID. For example, consider the case where there is a tag named '1' and it references snapshot with ID 2. +If the version travel clause is `VERSION AS OF '1'`, time travel will be done to the snapshot with ID 1. +If this is not desired, rename the tag or branch with a well-defined prefix such as 'snapshot-1'. +{{< /hint >}} + +```sql +-- time travel to October 26, 1986 at 01:21:00 +SELECT * FROM prod.db.table TIMESTAMP AS OF '1986-10-26 01:21:00'; + +-- time travel to snapshot with id 10963874102873L +SELECT * FROM prod.db.table VERSION AS OF 10963874102873; + +-- time travel to the head snapshot of audit-branch +SELECT * FROM prod.db.table VERSION AS OF 'audit-branch'; + +-- time travel to the snapshot referenced by the tag historical-snapshot +SELECT * FROM prod.db.table VERSION AS OF 'historical-snapshot'; +``` + +In addition, `FOR SYSTEM_TIME AS OF` and `FOR SYSTEM_VERSION AS OF` clauses are also supported: + +```sql +SELECT * FROM prod.db.table FOR SYSTEM_TIME AS OF '1986-10-26 01:21:00'; +SELECT * FROM prod.db.table FOR SYSTEM_VERSION AS OF 10963874102873; +SELECT * FROM prod.db.table FOR SYSTEM_VERSION AS OF 'audit-branch'; +SELECT * FROM prod.db.table FOR SYSTEM_VERSION AS OF 'historical-snapshot'; +``` + +Timestamps may also be supplied as a Unix timestamp, in seconds: + +```sql +-- timestamp in seconds +SELECT * FROM prod.db.table TIMESTAMP AS OF 499162860; +SELECT * FROM prod.db.table FOR SYSTEM_TIME AS OF 499162860; +``` + +The branch or tag may also be specified using a similar syntax to metadata tables, with `branch_` or `tag_`: + +```sql +SELECT * FROM prod.db.table.`branch_audit-branch`; +SELECT * FROM prod.db.table.`tag_historical-snapshot`; +``` + +(Identifiers with "-" are not valid, and so must be escaped using back quotes.) + +Note that the identifier with branch or tag may not be used in combination with `VERSION AS OF`. + +#### DataFrame + +To select a specific table snapshot or the snapshot at some time in the DataFrame API, Iceberg supports four Spark read options: + +* `snapshot-id` selects a specific table snapshot +* `as-of-timestamp` selects the current snapshot at a timestamp, in milliseconds +* `branch` selects the head snapshot of the specified branch. Note that currently branch cannot be combined with as-of-timestamp. +* `tag` selects the snapshot associated with the specified tag. Tags cannot be combined with `as-of-timestamp`. + +```scala +// time travel to October 26, 1986 at 01:21:00 +spark.read + .option("as-of-timestamp", "499162860000") + .format("iceberg") + .load("path/to/table") +``` + +```scala +// time travel to snapshot with ID 10963874102873L +spark.read + .option("snapshot-id", 10963874102873L) + .format("iceberg") + .load("path/to/table") +``` + +```scala +// time travel to tag historical-snapshot +spark.read + .option(SparkReadOptions.TAG, "historical-snapshot") + .format("iceberg") + .load("path/to/table") +``` + +```scala +// time travel to the head snapshot of audit-branch +spark.read + .option(SparkReadOptions.BRANCH, "audit-branch") + .format("iceberg") + .load("path/to/table") +``` + +{{< hint info >}} +Spark 3.0 and earlier versions do not support using `option` with `table` in DataFrameReader commands. All options will be silently +ignored. Do not use `table` when attempting to time-travel or use other options. See [SPARK-32592](https://issues.apache.org/jira/browse/SPARK-32592). +{{< /hint >}} + + +### Incremental read + +To read appended data incrementally, use: + +* `start-snapshot-id` Start snapshot ID used in incremental scans (exclusive). +* `end-snapshot-id` End snapshot ID used in incremental scans (inclusive). This is optional. Omitting it will default to the current snapshot. + +```scala +// get the data added after start-snapshot-id (10963874102873L) until end-snapshot-id (63874143573109L) +spark.read + .format("iceberg") + .option("start-snapshot-id", "10963874102873") + .option("end-snapshot-id", "63874143573109") + .load("path/to/table") +``` + +{{< hint info >}} +Currently gets only the data from `append` operation. Cannot support `replace`, `overwrite`, `delete` operations. +Incremental read works with both V1 and V2 format-version. +Incremental read is not supported by Spark's SQL syntax. +{{< /hint >}} + +## Inspecting tables + +To inspect a table's history, snapshots, and other metadata, Iceberg supports metadata tables. + +Metadata tables are identified by adding the metadata table name after the original table name. For example, history for `db.table` is read using `db.table.history`. + +### History + +To show table history: + +```sql +SELECT * FROM prod.db.table.history; +``` + +| made_current_at | snapshot_id | parent_id | is_current_ancestor | +| -- | -- | -- | -- | +| 2019-02-08 03:29:51.215 | 5781947118336215154 | NULL | true | +| 2019-02-08 03:47:55.948 | 5179299526185056830 | 5781947118336215154 | true | +| 2019-02-09 16:24:30.13 | 296410040247533544 | 5179299526185056830 | false | +| 2019-02-09 16:32:47.336 | 2999875608062437330 | 5179299526185056830 | true | +| 2019-02-09 19:42:03.919 | 8924558786060583479 | 2999875608062437330 | true | +| 2019-02-09 19:49:16.343 | 6536733823181975045 | 8924558786060583479 | true | + +{{< hint info >}} +**This shows a commit that was rolled back.** The example has two snapshots with the same parent, and one is *not* an ancestor of the current table state. +{{< /hint >}} + +### Metadata Log Entries + +To show table metadata log entries: + +```sql +SELECT * from prod.db.table.metadata_log_entries; +``` + +| timestamp | file | latest_snapshot_id | latest_schema_id | latest_sequence_number | +| -- | -- | -- | -- | -- | +| 2022-07-28 10:43:52.93 | s3://.../table/metadata/00000-9441e604-b3c2-498a-a45a-6320e8ab9006.metadata.json | null | null | null | +| 2022-07-28 10:43:57.487 | s3://.../table/metadata/00001-f30823df-b745-4a0a-b293-7532e0c99986.metadata.json | 170260833677645300 | 0 | 1 | +| 2022-07-28 10:43:58.25 | s3://.../table/metadata/00002-2cc2837a-02dc-4687-acc1-b4d86ea486f4.metadata.json | 958906493976709774 | 0 | 2 | + +### Snapshots + +To show the valid snapshots for a table: + +```sql +SELECT * FROM prod.db.table.snapshots; +``` + +| committed_at | snapshot_id | parent_id | operation | manifest_list | summary | +| -- | -- | -- | -- | -- | -- | +| 2019-02-08 03:29:51.215 | 57897183625154 | null | append | s3://.../table/metadata/snap-57897183625154-1.avro | { added-records -> 2478404, total-records -> 2478404, added-data-files -> 438, total-data-files -> 438, spark.app.id -> application_1520379288616_155055 } | + +You can also join snapshots to table history. For example, this query will show table history, with the application ID that wrote each snapshot: + +```sql +select + h.made_current_at, + s.operation, + h.snapshot_id, + h.is_current_ancestor, + s.summary['spark.app.id'] +from prod.db.table.history h +join prod.db.table.snapshots s + on h.snapshot_id = s.snapshot_id +order by made_current_at; +``` + +| made_current_at | operation | snapshot_id | is_current_ancestor | summary[spark.app.id] | +| -- | -- | -- | -- | -- | +| 2019-02-08 03:29:51.215 | append | 57897183625154 | true | application_1520379288616_155055 | +| 2019-02-09 16:24:30.13 | delete | 29641004024753 | false | application_1520379288616_151109 | +| 2019-02-09 16:32:47.336 | append | 57897183625154 | true | application_1520379288616_155055 | +| 2019-02-08 03:47:55.948 | overwrite | 51792995261850 | true | application_1520379288616_152431 | + +### Entries + +To show all the table's current manifest entries for both data and delete files. + +```sql +SELECT * FROM prod.db.table.entries; +``` + +| status | snapshot_id | sequence_number | file_sequence_number | data_file | readable_metrics | +| -- | -- | -- | -- | -- | -- | +| 2 | 57897183625154 | 0 | 0 | {"content":0,"file_path":"s3:/.../table/data/00047-25-833044d0-127b-415c-b874-038a4f978c29-00612.parquet","file_format":"PARQUET","spec_id":0,"record_count":15,"file_size_in_bytes":473,"column_sizes":{1:103},"value_counts":{1:15},"null_value_counts":{1:0},"nan_value_counts":{},"lower_bounds":{1:},"upper_bounds":{1:},"key_metadata":null,"split_offsets":[4],"equality_ids":null,"sort_order_id":0} | {"c1":{"column_size":103,"value_count":15,"null_value_count":0,"nan_value_count":null,"lower_bound":1,"upper_bound":3}} | + +### Files + +To show a table's current files: + +```sql +SELECT * FROM prod.db.table.files; +``` + +| content | file_path | file_format | spec_id | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_counts | nan_value_counts | lower_bounds | upper_bounds | key_metadata | split_offsets | equality_ids | sort_order_id | readable_metrics | +| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | +| 0 | s3:/.../table/data/00042-3-a9aa8b24-20bc-4d56-93b0-6b7675782bb5-00001.parquet | PARQUET | 0 | 1 | 652 | {1:52,2:48} | {1:1,2:1} | {1:0,2:0} | {} | {1:,2:d} | {1:,2:d} | NULL | [4] | NULL | 0 | {"data":{"column_size":48,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"d","upper_bound":"d"},"id":{"column_size":52,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":1,"upper_bound":1}} | +| 0 | s3:/.../table/data/00000-0-f9709213-22ca-4196-8733-5cb15d2afeb9-00001.parquet | PARQUET | 0 | 1 | 643 | {1:46,2:48} | {1:1,2:1} | {1:0,2:0} | {} | {1:,2:a} | {1:,2:a} | NULL | [4] | NULL | 0 | {"data":{"column_size":48,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":"a","upper_bound":"a"},"id":{"column_size":46,"value_count":1,"null_value_count":0,"nan_value_count":null,"lower_bound":1,"upper_bound":1}} | +| 0 | s3:/.../table/data/00001-1-f9709213-22ca-4196-8733-5cb15d2afeb9-00001.parquet | PARQUET | 0 | 2 | 644 | {1:49,2:51} | {1:2,2:2} | {1:0,2:0} | {} | {1:,2:b} | {1:,2:c} | NULL | [4] | NULL | 0 | {"data":{"column_size":51,"value_count":2,"null_value_count":0,"nan_value_count":null,"lower_bound":"b","upper_bound":"c"},"id":{"column_size":49,"value_count":2,"null_value_count":0,"nan_value_count":null,"lower_bound":2,"upper_bound":3}} | +| 1 | s3:/.../table/data/00081-4-a9aa8b24-20bc-4d56-93b0-6b7675782bb5-00001-deletes.parquet | PARQUET | 0 | 1 | 1560 | {2147483545:46,2147483546:152} | {2147483545:1,2147483546:1} | {2147483545:0,2147483546:0} | {} | {2147483545:,2147483546:s3:/.../table/data/00000-0-f9709213-22ca-4196-8733-5cb15d2afeb9-00001.parquet} | {2147483545:,2147483546:s3:/.../table/data/00000-0-f9709213-22ca-4196-8733-5cb15d2afeb9-00001.parquet} | NULL | [4] | NULL | NULL | {"data":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"id":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null}} | +| 2 | s3:/.../table/data/00047-25-833044d0-127b-415c-b874-038a4f978c29-00612.parquet | PARQUET | 0 | 126506 | 28613985 | {100:135377,101:11314} | {100:126506,101:126506} | {100:105434,101:11} | {} | {100:0,101:17} | {100:404455227527,101:23} | NULL | NULL | [1] | 0 | {"id":{"column_size":135377,"value_count":126506,"null_value_count":105434,"nan_value_count":null,"lower_bound":0,"upper_bound":404455227527},"data":{"column_size":11314,"value_count":126506,"null_value_count": 11,"nan_value_count":null,"lower_bound":17,"upper_bound":23}} | + +{{< hint info >}} +Content refers to type of content stored by the data file: + 0 Data + 1 Position Deletes + 2 Equality Deletes +{{< /hint >}} + +To show only data files or delete files, query `prod.db.table.data_files` and `prod.db.table.delete_files` respectively. +To show all files, data files and delete files across all tracked snapshots, query `prod.db.table.all_files`, `prod.db.table.all_data_files` and `prod.db.table.all_delete_files` respectively. + +### Manifests + +To show a table's current file manifests: + +```sql +SELECT * FROM prod.db.table.manifests; +``` + +| path | length | partition_spec_id | added_snapshot_id | added_data_files_count | existing_data_files_count | deleted_data_files_count | partition_summaries | +| -- | -- | -- | -- | -- | -- | -- | -- | +| s3://.../table/metadata/45b5290b-ee61-4788-b324-b1e2735c0e10-m0.avro | 4479 | 0 | 6668963634911763636 | 8 | 0 | 0 | [[false,null,2019-05-13,2019-05-15]] | + +Note: +1. Fields within `partition_summaries` column of the manifests table correspond to `field_summary` structs within [manifest list](../../../spec#manifest-lists), with the following order: + - `contains_null` + - `contains_nan` + - `lower_bound` + - `upper_bound` +2. `contains_nan` could return null, which indicates that this information is not available from the file's metadata. + This usually occurs when reading from V1 table, where `contains_nan` is not populated. + +### Partitions + +To show a table's current partitions: + +```sql +SELECT * FROM prod.db.table.partitions; +``` + +| partition | spec_id | record_count | file_count | total_data_file_size_in_bytes | position_delete_record_count | position_delete_file_count | equality_delete_record_count | equality_delete_file_count | last_updated_at(μs) | last_updated_snapshot_id | +| -------------- |---------|---------------|------------|--------------------------|------------------------------|----------------------------|------------------------------|----------------------------|---------------------|--------------------------| +| {20211001, 11} | 0 | 1 | 1 | 100 | 2 | 1 | 0 | 0 | 1633086034192000 | 9205185327307503337 | +| {20211002, 11} | 0 | 4 | 3 | 500 | 1 | 1 | 0 | 0 | 1633172537358000 | 867027598972211003 | +| {20211001, 10} | 0 | 7 | 4 | 700 | 0 | 0 | 0 | 0 | 1633082598716000 | 3280122546965981531 | +| {20211002, 10} | 0 | 3 | 2 | 400 | 0 | 0 | 1 | 1 | 1633169159489000 | 6941468797545315876 | + +Note: +1. For unpartitioned tables, the partitions table will not contain the partition and spec_id fields. + +2. The partitions metadata table shows partitions with data files or delete files in the current snapshot. However, delete files are not applied, and so in some cases partitions may be shown even though all their data rows are marked deleted by delete files. + +### Positional Delete Files + +To show all positional delete files from the current snapshot of table: + +```sql +SELECT * from prod.db.table.position_deletes; +``` + +| file_path | pos | row | spec_id | delete_file_path | +| -- | -- | -- | -- | -- | +| s3:/.../table/data/00042-3-a9aa8b24-20bc-4d56-93b0-6b7675782bb5-00001.parquet | 1 | 0 | 0 | s3:/.../table/data/00191-1933-25e9f2f3-d863-4a69-a5e1-f9aeeebe60bb-00001-deletes.parquet | + +### All Metadata Tables + +These tables are unions of the metadata tables specific to the current snapshot, and return metadata across all snapshots. + +{{< hint danger >}} +The "all" metadata tables may produce more than one row per data file or manifest file because metadata files may be part of more than one table snapshot. +{{< /hint >}} + +#### All Data Files + +To show all of the table's data files and each file's metadata: + +```sql +SELECT * FROM prod.db.table.all_data_files; +``` + +| content | file_path | file_format | partition | record_count | file_size_in_bytes | column_sizes| value_counts | null_value_counts | nan_value_counts| lower_bounds| upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id| +| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | +| 0|s3://.../dt=20210102/00000-0-756e2512-49ae-45bb-aae3-c0ca475e7879-00001.parquet| PARQUET|{20210102}| 14| 2444|{1 -> 94, 2 -> 17}|{1 -> 14, 2 -> 14}| {1 -> 0, 2 -> 0}| {}|{1 -> 1, 2 -> 20210102}|{1 -> 2, 2 -> 20210102}| null| [4]| null| 0| +| 0|s3://.../dt=20210103/00000-0-26222098-032f-472b-8ea5-651a55b21210-00001.parquet| PARQUET|{20210103}| 14| 2444|{1 -> 94, 2 -> 17}|{1 -> 14, 2 -> 14}| {1 -> 0, 2 -> 0}| {}|{1 -> 1, 2 -> 20210103}|{1 -> 3, 2 -> 20210103}| null| [4]| null| 0| +| 0|s3://.../dt=20210104/00000-0-a3bb1927-88eb-4f1c-bc6e-19076b0d952e-00001.parquet| PARQUET|{20210104}| 14| 2444|{1 -> 94, 2 -> 17}|{1 -> 14, 2 -> 14}| {1 -> 0, 2 -> 0}| {}|{1 -> 1, 2 -> 20210104}|{1 -> 3, 2 -> 20210104}| null| [4]| null| 0| + +#### All Delete Files + +To show the table's delete files and each file's metadata from all the snapshots: + +```sql +SELECT * FROM prod.db.table.all_delete_files; +``` + +| content | file_path | file_format | spec_id | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_counts | nan_value_counts | lower_bounds | upper_bounds | key_metadata | split_offsets | equality_ids | sort_order_id | readable_metrics | +| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | +| 1 | s3:/.../table/data/00081-4-a9aa8b24-20bc-4d56-93b0-6b7675782bb5-00001-deletes.parquet | PARQUET | 0 | 1 | 1560 | {2147483545:46,2147483546:152} | {2147483545:1,2147483546:1} | {2147483545:0,2147483546:0} | {} | {2147483545:,2147483546:s3:/.../table/data/00000-0-f9709213-22ca-4196-8733-5cb15d2afeb9-00001.parquet} | {2147483545:,2147483546:s3:/.../table/data/00000-0-f9709213-22ca-4196-8733-5cb15d2afeb9-00001.parquet} | NULL | [4] | NULL | NULL | {"data":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null},"id":{"column_size":null,"value_count":null,"null_value_count":null,"nan_value_count":null,"lower_bound":null,"upper_bound":null}} | +| 2 | s3:/.../table/data/00047-25-833044d0-127b-415c-b874-038a4f978c29-00612.parquet | PARQUET | 0 | 126506 | 28613985 | {100:135377,101:11314} | {100:126506,101:126506} | {100:105434,101:11} | {} | {100:0,101:17} | {100:404455227527,101:23} | NULL | NULL | [1] | 0 | {"id":{"column_size":135377,"value_count":126506,"null_value_count":105434,"nan_value_count":null,"lower_bound":0,"upper_bound":404455227527},"data":{"column_size":11314,"value_count":126506,"null_value_count": 11,"nan_value_count":null,"lower_bound":17,"upper_bound":23}} | + +#### All Entries + +To show the table's manifest entries from all the snapshots for both data and delete files: + +```sql +SELECT * FROM prod.db.table.all_entries; +``` + +| status | snapshot_id | sequence_number | file_sequence_number | data_file | readable_metrics | +| -- | -- | -- | -- | -- | -- | +| 2 | 57897183625154 | 0 | 0 | {"content":0,"file_path":"s3:/.../table/data/00047-25-833044d0-127b-415c-b874-038a4f978c29-00612.parquet","file_format":"PARQUET","spec_id":0,"record_count":15,"file_size_in_bytes":473,"column_sizes":{1:103},"value_counts":{1:15},"null_value_counts":{1:0},"nan_value_counts":{},"lower_bounds":{1:},"upper_bounds":{1:},"key_metadata":null,"split_offsets":[4],"equality_ids":null,"sort_order_id":0} | {"c1":{"column_size":103,"value_count":15,"null_value_count":0,"nan_value_count":null,"lower_bound":1,"upper_bound":3}} | + +#### All Manifests + +To show all of the table's manifest files: + +```sql +SELECT * FROM prod.db.table.all_manifests; +``` + +| path | length | partition_spec_id | added_snapshot_id | added_data_files_count | existing_data_files_count | deleted_data_files_count| partition_summaries| +| -- | -- | -- | -- | -- | -- | -- | -- | +| s3://.../metadata/a85f78c5-3222-4b37-b7e4-faf944425d48-m0.avro | 6376 | 0 | 6272782676904868561 | 2 | 0 | 0 |[{false, false, 20210101, 20210101}]| + +Note: +1. Fields within `partition_summaries` column of the manifests table correspond to `field_summary` structs within [manifest list](../../../spec#manifest-lists), with the following order: + - `contains_null` + - `contains_nan` + - `lower_bound` + - `upper_bound` +2. `contains_nan` could return null, which indicates that this information is not available from the file's metadata. + This usually occurs when reading from V1 table, where `contains_nan` is not populated. + +### References + +To show a table's known snapshot references: + +```sql +SELECT * FROM prod.db.table.refs; +``` + +| name | type | snapshot_id | max_reference_age_in_ms | min_snapshots_to_keep | max_snapshot_age_in_ms | +| -- | -- | -- | -- | -- | -- | +| main | BRANCH | 4686954189838128572 | 10 | 20 | 30 | +| testTag | TAG | 4686954189838128572 | 10 | null | null | + +### Inspecting with DataFrames + +Metadata tables can be loaded using the DataFrameReader API: + +```scala +// named metastore table +spark.read.format("iceberg").load("db.table.files") +// Hadoop path table +spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table#files") +``` + +### Time Travel with Metadata Tables + +To inspect a tables's metadata with the time travel feature: + +```sql +-- get the table's file manifests at timestamp Sep 20, 2021 08:00:00 +SELECT * FROM prod.db.table.manifests TIMESTAMP AS OF '2021-09-20 08:00:00'; + +-- get the table's partitions with snapshot id 10963874102873L +SELECT * FROM prod.db.table.partitions VERSION AS OF 10963874102873; +``` + +Metadata tables can also be inspected with time travel using the DataFrameReader API: + +```scala +// load the table's file metadata at snapshot-id 10963874102873 as DataFrame +spark.read.format("iceberg").option("snapshot-id", 10963874102873L).load("db.table.files") +``` diff --git a/docs/spark-structured-streaming.md b/docs/spark-structured-streaming.md new file mode 100644 index 000000000000..b4aed332e684 --- /dev/null +++ b/docs/spark-structured-streaming.md @@ -0,0 +1,128 @@ +--- +title: "Structured Streaming" +url: spark-structured-streaming +aliases: + - "spark/spark-structured-streaming" +menu: + main: + parent: Spark + identifier: spark_structured_streaming + weight: 0 +--- + + +# Spark Structured Streaming + +Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. Spark DSv2 is an evolving API with different levels of support in Spark versions. + +## Streaming Reads + +Iceberg supports processing incremental data in spark structured streaming jobs which starts from a historical timestamp: + +```scala +val df = spark.readStream + .format("iceberg") + .option("stream-from-timestamp", Long.toString(streamStartTimestamp)) + .load("database.table_name") +``` + +{{< hint warning >}} +Iceberg only supports reading data from append snapshots. Overwrite snapshots cannot be processed and will cause an exception by default. Overwrites may be ignored by setting `streaming-skip-overwrite-snapshots=true`. Similarly, delete snapshots will cause an exception by default, and deletes may be ignored by setting `streaming-skip-delete-snapshots=true`. +{{}} + +## Streaming Writes + +To write values from streaming query to Iceberg table, use `DataStreamWriter`: + +```scala +data.writeStream + .format("iceberg") + .outputMode("append") + .trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES)) + .option("checkpointLocation", checkpointPath) + .toTable("database.table_name") +``` + +If you're using Spark 3.0 or earlier, you need to use `.option("path", "database.table_name").start()`, instead of `.toTable("database.table_name")`. + +In the case of the directory-based Hadoop catalog: + +```scala +data.writeStream + .format("iceberg") + .outputMode("append") + .trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES)) + .option("path", "hdfs://nn:8020/path/to/table") + .option("checkpointLocation", checkpointPath) + .start() +``` + +Iceberg supports `append` and `complete` output modes: + +* `append`: appends the rows of every micro-batch to the table +* `complete`: replaces the table contents every micro-batch + +Prior to starting the streaming query, ensure you created the table. Refer to the [SQL create table](../spark-ddl/#create-table) documentation to learn how to create the Iceberg table. + +Iceberg doesn't support experimental [continuous processing](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing), as it doesn't provide the interface to "commit" the output. + +### Partitioned table + +Iceberg requires sorting data by partition per task prior to writing the data. In Spark tasks are split by Spark partition. +against partitioned table. For batch queries you're encouraged to do explicit sort to fulfill the requirement +(see [here](../spark-writes/#writing-to-partitioned-tables)), but the approach would bring additional latency as +repartition and sort are considered as heavy operations for streaming workload. To avoid additional latency, you can +enable fanout writer to eliminate the requirement. + +```scala +data.writeStream + .format("iceberg") + .outputMode("append") + .trigger(Trigger.ProcessingTime(1, TimeUnit.MINUTES)) + .option("fanout-enabled", "true") + .option("checkpointLocation", checkpointPath) + .toTable("database.table_name") +``` + +Fanout writer opens the files per partition value and doesn't close these files till the write task finishes. Avoid using the fanout writer for batch writing, as explicit sort against output rows is cheap for batch workloads. + +## Maintenance for streaming tables + +Streaming writes can create new table versions quickly, creating lots of table metadata to track those versions. +Maintaining metadata by tuning the rate of commits, expiring old snapshots, and automatically cleaning up metadata files +is highly recommended. + +### Tune the rate of commits + +Having a high rate of commits produces data files, manifests, and snapshots which leads to additional maintenance. It is recommended to have a trigger interval of 1 minute at the minimum and increase the interval if needed. + +The triggers section in [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers) +documents how to configure the interval. + +### Expire old snapshots + +Each batch written to a table produces a new snapshot. Iceberg tracks snapshots in table metadata until they are expired. Snapshots accumulate quickly with frequent commits, so it is highly recommended that tables written by streaming queries are [regularly maintained](../maintenance#expire-snapshots). [Snapshot expiration](../spark-procedures/#expire_snapshots) is the procedure of removing the metadata and any data files that are no longer needed. By default, the procedure will expire the snapshots older than five days. + +### Compacting data files + +The amount of data written from a streaming process is typically small, which can cause the table metadata to track lots of small files. [Compacting small files into larger files](../maintenance#compact-data-files) reduces the metadata needed by the table, and increases query efficiency. Iceberg and Spark [comes with the `rewrite_data_files` procedure](../spark-procedures/#rewrite_data_files). + +### Rewrite manifests + +To optimize write latency on a streaming workload, Iceberg can write the new snapshot with a "fast" append that does not automatically compact manifests. +This could lead lots of small manifest files. Iceberg can [rewrite the number of manifest files to improve query performance](../maintenance#rewrite-manifests). Iceberg and Spark [come with the `rewrite_manifests` procedure](../spark-procedures/#rewrite_manifests). diff --git a/docs/spark-writes.md b/docs/spark-writes.md new file mode 100644 index 000000000000..b9435674609b --- /dev/null +++ b/docs/spark-writes.md @@ -0,0 +1,477 @@ +--- +title: "Writes" +url: spark-writes +aliases: + - "spark/spark-writes" +menu: + main: + parent: Spark + identifier: spark_writes + weight: 0 +--- + + +# Spark Writes + +To use Iceberg in Spark, first configure [Spark catalogs](../spark-configuration). + +Some plans are only available when using [Iceberg SQL extensions](../spark-configuration#sql-extensions) in Spark 3. + +Iceberg uses Apache Spark's DataSourceV2 API for data source and catalog implementations. Spark DSv2 is an evolving API with different levels of support in Spark versions: + +| Feature support | Spark 3 | Notes | +|--------------------------------------------------|-----------|-----------------------------------------------------------------------------| +| [SQL insert into](#insert-into) | ✔️ | ⚠ Requires `spark.sql.storeAssignmentPolicy=ANSI` (default since Spark 3.0) | +| [SQL merge into](#merge-into) | ✔️ | ⚠ Requires Iceberg Spark extensions | +| [SQL insert overwrite](#insert-overwrite) | ✔️ | ⚠ Requires `spark.sql.storeAssignmentPolicy=ANSI` (default since Spark 3.0) | +| [SQL delete from](#delete-from) | ✔️ | ⚠ Row-level delete requires Iceberg Spark extensions | +| [SQL update](#update) | ✔️ | ⚠ Requires Iceberg Spark extensions | +| [DataFrame append](#appending-data) | ✔️ | | +| [DataFrame overwrite](#overwriting-data) | ✔️ | | +| [DataFrame CTAS and RTAS](#creating-tables) | ✔️ | ⚠ Requires DSv2 API | + + +## Writing with SQL + +Spark 3 supports SQL `INSERT INTO`, `MERGE INTO`, and `INSERT OVERWRITE`, as well as the new `DataFrameWriterV2` API. + +### `INSERT INTO` + +To append new data to a table, use `INSERT INTO`. + +```sql +INSERT INTO prod.db.table VALUES (1, 'a'), (2, 'b') +``` +```sql +INSERT INTO prod.db.table SELECT ... +``` + +### `MERGE INTO` + +Spark 3 added support for `MERGE INTO` queries that can express row-level updates. + +Iceberg supports `MERGE INTO` by rewriting data files that contain rows that need to be updated in an `overwrite` commit. + +**`MERGE INTO` is recommended instead of `INSERT OVERWRITE`** because Iceberg can replace only the affected data files, and because the data overwritten by a dynamic overwrite may change if the table's partitioning changes. + + +#### `MERGE INTO` syntax + +`MERGE INTO` updates a table, called the _target_ table, using a set of updates from another query, called the _source_. The update for a row in the target table is found using the `ON` clause that is like a join condition. + +```sql +MERGE INTO prod.db.target t -- a target table +USING (SELECT ...) s -- the source updates +ON t.id = s.id -- condition to find updates for target rows +WHEN ... -- updates +``` + +Updates to rows in the target table are listed using `WHEN MATCHED ... THEN ...`. Multiple `MATCHED` clauses can be added with conditions that determine when each match should be applied. The first matching expression is used. + +```sql +WHEN MATCHED AND s.op = 'delete' THEN DELETE +WHEN MATCHED AND t.count IS NULL AND s.op = 'increment' THEN UPDATE SET t.count = 0 +WHEN MATCHED AND s.op = 'increment' THEN UPDATE SET t.count = t.count + 1 +``` + +Source rows (updates) that do not match can be inserted: + +```sql +WHEN NOT MATCHED THEN INSERT * +``` + +Inserts also support additional conditions: + +```sql +WHEN NOT MATCHED AND s.event_time > still_valid_threshold THEN INSERT (id, count) VALUES (s.id, 1) +``` + +Only one record in the source data can update any given row of the target table, or else an error will be thrown. + + +### `INSERT OVERWRITE` + +`INSERT OVERWRITE` can replace data in the table with the result of a query. Overwrites are atomic operations for Iceberg tables. + +The partitions that will be replaced by `INSERT OVERWRITE` depends on Spark's partition overwrite mode and the partitioning of a table. `MERGE INTO` can rewrite only affected data files and has more easily understood behavior, so it is recommended instead of `INSERT OVERWRITE`. + +#### Overwrite behavior + +Spark's default overwrite mode is **static**, but **dynamic overwrite mode is recommended when writing to Iceberg tables.** Static overwrite mode determines which partitions to overwrite in a table by converting the `PARTITION` clause to a filter, but the `PARTITION` clause can only reference table columns. + +Dynamic overwrite mode is configured by setting `spark.sql.sources.partitionOverwriteMode=dynamic`. + +To demonstrate the behavior of dynamic and static overwrites, consider a `logs` table defined by the following DDL: + +```sql +CREATE TABLE prod.my_app.logs ( + uuid string NOT NULL, + level string NOT NULL, + ts timestamp NOT NULL, + message string) +USING iceberg +PARTITIONED BY (level, hours(ts)) +``` + +#### Dynamic overwrite + +When Spark's overwrite mode is dynamic, partitions that have rows produced by the `SELECT` query will be replaced. + +For example, this query removes duplicate log events from the example `logs` table. + +```sql +INSERT OVERWRITE prod.my_app.logs +SELECT uuid, first(level), first(ts), first(message) +FROM prod.my_app.logs +WHERE cast(ts as date) = '2020-07-01' +GROUP BY uuid +``` + +In dynamic mode, this will replace any partition with rows in the `SELECT` result. Because the date of all rows is restricted to 1 July, only hours of that day will be replaced. + +#### Static overwrite + +When Spark's overwrite mode is static, the `PARTITION` clause is converted to a filter that is used to delete from the table. If the `PARTITION` clause is omitted, all partitions will be replaced. + +Because there is no `PARTITION` clause in the query above, it will drop all existing rows in the table when run in static mode, but will only write the logs from 1 July. + +To overwrite just the partitions that were loaded, add a `PARTITION` clause that aligns with the `SELECT` query filter: + +```sql +INSERT OVERWRITE prod.my_app.logs +PARTITION (level = 'INFO') +SELECT uuid, first(level), first(ts), first(message) +FROM prod.my_app.logs +WHERE level = 'INFO' +GROUP BY uuid +``` + +Note that this mode cannot replace hourly partitions like the dynamic example query because the `PARTITION` clause can only reference table columns, not hidden partitions. + +### `DELETE FROM` + +Spark 3 added support for `DELETE FROM` queries to remove data from tables. + +Delete queries accept a filter to match rows to delete. + +```sql +DELETE FROM prod.db.table +WHERE ts >= '2020-05-01 00:00:00' and ts < '2020-06-01 00:00:00' + +DELETE FROM prod.db.all_events +WHERE session_time < (SELECT min(session_time) FROM prod.db.good_events) + +DELETE FROM prod.db.orders AS t1 +WHERE EXISTS (SELECT oid FROM prod.db.returned_orders WHERE t1.oid = oid) +``` + +If the delete filter matches entire partitions of the table, Iceberg will perform a metadata-only delete. If the filter matches individual rows of a table, then Iceberg will rewrite only the affected data files. + +### `UPDATE` + +Update queries accept a filter to match rows to update. + +```sql +UPDATE prod.db.table +SET c1 = 'update_c1', c2 = 'update_c2' +WHERE ts >= '2020-05-01 00:00:00' and ts < '2020-06-01 00:00:00' + +UPDATE prod.db.all_events +SET session_time = 0, ignored = true +WHERE session_time < (SELECT min(session_time) FROM prod.db.good_events) + +UPDATE prod.db.orders AS t1 +SET order_status = 'returned' +WHERE EXISTS (SELECT oid FROM prod.db.returned_orders WHERE t1.oid = oid) +``` + +For more complex row-level updates based on incoming data, see the section on `MERGE INTO`. + +## Writing to Branches +Branch writes can be performed via SQL by providing a branch identifier, `branch_yourBranch` in the operation. +Branch writes can also be performed as part of a write-audit-publish (WAP) workflow by specifying the `spark.wap.branch` config. +Note WAP branch and branch identifier cannot both be specified. +Also, the branch must exist before performing the write. +The operation does **not** create the branch if it does not exist. +For more information on branches please refer to [branches](../tables/branching) + +```sql +-- INSERT (1,' a') (2, 'b') into the audit branch. +INSERT INTO prod.db.table.branch_audit VALUES (1, 'a'), (2, 'b'); + +-- MERGE INTO audit branch +MERGE INTO prod.db.table.branch_audit t +USING (SELECT ...) s +ON t.id = s.id +WHEN ... + +-- UPDATE audit branch +UPDATE prod.db.table.branch_audit AS t1 +SET val = 'c' + +-- DELETE FROM audit branch +DELETE FROM prod.dbl.table.branch_audit WHERE id = 2; + +-- WAP Branch write +SET spark.wap.branch = audit-branch +INSERT INTO prod.db.table VALUES (3, 'c'); +``` + +## Writing with DataFrames + +Spark 3 introduced the new `DataFrameWriterV2` API for writing to tables using data frames. The v2 API is recommended for several reasons: + +* CTAS, RTAS, and overwrite by filter are supported +* All operations consistently write columns to a table by name +* Hidden partition expressions are supported in `partitionedBy` +* Overwrite behavior is explicit, either dynamic or by a user-supplied filter +* The behavior of each operation corresponds to SQL statements + - `df.writeTo(t).create()` is equivalent to `CREATE TABLE AS SELECT` + - `df.writeTo(t).replace()` is equivalent to `REPLACE TABLE AS SELECT` + - `df.writeTo(t).append()` is equivalent to `INSERT INTO` + - `df.writeTo(t).overwritePartitions()` is equivalent to dynamic `INSERT OVERWRITE` + +The v1 DataFrame `write` API is still supported, but is not recommended. + +{{< hint danger >}} +When writing with the v1 DataFrame API in Spark 3, use `saveAsTable` or `insertInto` to load tables with a catalog. +Using `format("iceberg")` loads an isolated table reference that will not automatically refresh tables used by queries. +{{< /hint >}} + + +### Appending data + +To append a dataframe to an Iceberg table, use `append`: + +```scala +val data: DataFrame = ... +data.writeTo("prod.db.table").append() +``` + +### Overwriting data + +To overwrite partitions dynamically, use `overwritePartitions()`: + +```scala +val data: DataFrame = ... +data.writeTo("prod.db.table").overwritePartitions() +``` + +To explicitly overwrite partitions, use `overwrite` to supply a filter: + +```scala +data.writeTo("prod.db.table").overwrite($"level" === "INFO") +``` + +### Creating tables + +To run a CTAS or RTAS, use `create`, `replace`, or `createOrReplace` operations: + +```scala +val data: DataFrame = ... +data.writeTo("prod.db.table").create() +``` + +If you have replaced the default Spark catalog (`spark_catalog`) with Iceberg's `SparkSessionCatalog`, do: + +```scala +val data: DataFrame = ... +data.writeTo("db.table").using("iceberg").create() +``` + +Create and replace operations support table configuration methods, like `partitionedBy` and `tableProperty`: + +```scala +data.writeTo("prod.db.table") + .tableProperty("write.format.default", "orc") + .partitionedBy($"level", days($"ts")) + .createOrReplace() +``` + +The Iceberg table location can also be specified by the `location` table property: + +```scala +data.writeTo("prod.db.table") + .tableProperty("location", "/path/to/location") + .createOrReplace() +``` + +### Schema Merge + +While inserting or updating Iceberg is capable of resolving schema mismatch at runtime. If configured, Iceberg will perform an automatic schema evolution as follows: + + +* A new column is present in the source but not in the target table. + + The new column is added to the target table. Column values are set to `NULL` in all the rows already present in the table + +* A column is present in the target but not in the source. + + The target column value is set to `NULL` when inserting or left unchanged when updating the row. + +The target table must be configured to accept any schema change by setting the property `write.spark.accept-any-schema` to `true`. + +```sql +ALTER TABLE prod.db.sample SET TBLPROPERTIES ( + 'write.spark.accept-any-schema'='true' +) +``` +The writer must enable the `mergeSchema` option. + +```scala +data.writeTo("prod.db.sample").option("mergeSchema","true").append() +``` + + +## Writing Distribution Modes + +Iceberg's default Spark writers require that the data in each spark task is clustered by partition values. This +distribution is required to minimize the number of file handles that are held open while writing. By default, starting +in Iceberg 1.2.0, Iceberg also requests that Spark pre-sort data to be written to fit this distribution. The +request to Spark is done through the table property `write.distribution-mode` with the value `hash`. Spark doesn't respect +distribution mode in CTAS/RTAS before 3.5.0. + +Let's go through writing the data against below sample table: + +```sql +CREATE TABLE prod.db.sample ( + id bigint, + data string, + category string, + ts timestamp) +USING iceberg +PARTITIONED BY (days(ts), category) +``` + +To write data to the sample table, data needs to be sorted by `days(ts), category` but this is taken care +of automatically by the default `hash` distribution. Previously this would have required manually sorting, but this +is no longer the case. + +```sql +INSERT INTO prod.db.sample +SELECT id, data, category, ts FROM another_table +``` + + +There are 3 options for `write.distribution-mode` + +* `none` - This is the previous default for Iceberg. +This mode does not request any shuffles or sort to be performed automatically by Spark. Because no work is done +automatically by Spark, the data must be *manually* sorted by partition value. The data must be sorted either within +each spark task, or globally within the entire dataset. A global sort will minimize the number of output files. +A sort can be avoided by using the Spark [write fanout](#write-properties) property but this will cause all +file handles to remain open until each write task has completed. +* `hash` - This mode is the new default and requests that Spark uses a hash-based exchange to shuffle the incoming +write data before writing. +Practically, this means that each row is hashed based on the row's partition value and then placed +in a corresponding Spark task based upon that value. Further division and coalescing of tasks may take place because of +[Spark's Adaptive Query planning](#controlling-file-sizes). +* `range` - This mode requests that Spark perform a range based exchange to shuffle the data before writing. +This is a two stage procedure which is more expensive than the `hash` mode. The first stage samples the data to +be written based on the partition and sort columns. The second stage uses the range information to shuffle the input data into Spark +tasks. Each task gets an exclusive range of the input data which clusters the data by partition and also globally sorts. +While this is more expensive than the hash distribution, the global ordering can be beneficial for read performance if +sorted columns are used during queries. This mode is used by default if a table is created with a +sort-order. Further division and coalescing of tasks may take place because of +[Spark's Adaptive Query planning](#controlling-file-sizes). + + +## Controlling File Sizes + +When writing data to Iceberg with Spark, it's important to note that Spark cannot write a file larger than a Spark +task and a file cannot span an Iceberg partition boundary. This means although Iceberg will always roll over a file +when it grows to [`write.target-file-size-bytes`](../configuration/#write-properties), but unless the Spark task is +large enough that will not happen. The size of the file created on disk will also be much smaller than the Spark task +since the on disk data will be both compressed and in columnar format as opposed to Spark's uncompressed row +representation. This means a 100 megabyte Spark task will create a file much smaller than 100 megabytes even if that +task is writing to a single Iceberg partition. If the task writes to multiple partitions, the files will be even +smaller than that. + +To control what data ends up in each Spark task use a [`write distribution mode`](#writing-distribution-modes) +or manually repartition the data. + +To adjust Spark's task size it is important to become familiar with Spark's various Adaptive Query Execution (AQE) +parameters. When the `write.distribution-mode` is not `none`, AQE will control the coalescing and splitting of Spark +tasks during the exchange to try to create tasks of `spark.sql.adaptive.advisoryPartitionSizeInBytes` size. These +settings will also affect any user performed re-partitions or sorts. +It is important again to note that this is the in-memory Spark row size and not the on disk +columnar-compressed size, so a larger value than the target file size will need to be specified. The ratio of +in-memory size to on disk size is data dependent. Future work in Spark should allow Iceberg to automatically adjust this +parameter at write time to match the `write.target-file-size-bytes`. + +## Type compatibility + +Spark and Iceberg support different set of types. Iceberg does the type conversion automatically, but not for all combinations, +so you may want to understand the type conversion in Iceberg in prior to design the types of columns in your tables. + +### Spark type to Iceberg type + +This type conversion table describes how Spark types are converted to the Iceberg types. The conversion applies on both creating Iceberg table and writing to Iceberg table via Spark. + +| Spark | Iceberg | Notes | +|-----------------|----------------------------|-------| +| boolean | boolean | | +| short | integer | | +| byte | integer | | +| integer | integer | | +| long | long | | +| float | float | | +| double | double | | +| date | date | | +| timestamp | timestamp with timezone | | +| timestamp_ntz | timestamp without timezone | | +| char | string | | +| varchar | string | | +| string | string | | +| binary | binary | | +| decimal | decimal | | +| struct | struct | | +| array | list | | +| map | map | | + +{{< hint info >}} +The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write: + +* Iceberg numeric types (`integer`, `long`, `float`, `double`, `decimal`) support promotion during writes. e.g. You can write Spark types `short`, `byte`, `integer`, `long` to Iceberg type `long`. +* You can write to Iceberg `fixed` type using Spark `binary` type. Note that assertion on the length will be performed. +{{< /hint >}} + +### Iceberg type to Spark type + +This type conversion table describes how Iceberg types are converted to the Spark types. The conversion applies on reading from Iceberg table via Spark. + +| Iceberg | Spark | Note | +|----------------------------|-------------------------|---------------| +| boolean | boolean | | +| integer | integer | | +| long | long | | +| float | float | | +| double | double | | +| date | date | | +| time | | Not supported | +| timestamp with timezone | timestamp | | +| timestamp without timezone | timestamp_ntz | | +| string | string | | +| uuid | string | | +| fixed | binary | | +| binary | binary | | +| decimal | decimal | | +| struct | struct | | +| list | array | | +| map | map | | + diff --git a/docs/table-migration.md b/docs/table-migration.md new file mode 100644 index 000000000000..0eacd519e136 --- /dev/null +++ b/docs/table-migration.md @@ -0,0 +1,80 @@ +--- +title: "Overview" +url: table-migration +menu: + main: + parent: "Migration" + identifier: table_migration + weight: 100 +--- + + +# Table Migration +Apache Iceberg supports converting existing tables in other formats to Iceberg tables. This section introduces the general concept of table migration, its approaches, and existing implementations in Iceberg. + +## Migration Approaches +There are two methods for executing table migration: full data migration and in-place metadata migration. + +Full data migration involves copying all data files from the source table to the new Iceberg table. This method makes the new table fully isolated from the source table, but is slower and doubles the space. +In practice, users can use operations like [Create-Table-As-Select](../spark-ddl/#create-table--as-select), [INSERT](../spark-writes/#insert-into), and Change-Data-Capture pipelines to perform such migration. + +In-place metadata migration preserves the existing data files while incorporating Iceberg metadata on top of them. +This method is not only faster but also eliminates the need for data duplication. However, the new table and the source table are not fully isolated. In other words, if any processes vacuum data files from the source table, the new table will also be affected. + +In this doc, we will describe more about in-place metadata migration. + +![In-Place Metadata Migration](../../../img/iceberg-in-place-metadata-migration.png) + +Apache Iceberg supports the in-place metadata migration approach, which includes three important actions: **Snapshot Table**, **Migrate Table**, and **Add Files**. + +## Snapshot Table +The Snapshot Table action creates a new iceberg table with a different name and with the same schema and partitioning as the source table, leaving the source table unchanged during and after the action. + +- Create a new Iceberg table with the same metadata (schema, partition spec, etc.) as the source table and a different name. Readers and Writers on the source table can continue to work. + +![Snapshot Table Step 1](../../../img/iceberg-snapshotaction-step1.png) + +- Commit all data files across all partitions to the new Iceberg table. The source table remains unchanged. Readers can be switched to the new Iceberg table. + +![Snapshot Table Step 2](../../../img/iceberg-snapshotaction-step2.png) + +- Eventually, all writers can be switched to the new Iceberg table. Once all writers are transitioned to the new Iceberg table, the migration process will be considered complete. + +## Migrate Table +The Migrate Table action also creates a new Iceberg table with the same schema and partitioning as the source table. However, during the action execution, it locks and drops the source table from the catalog. +Consequently, Migrate Table requires all modifications working on the source table to be stopped before the action is performed. + +Stop all writers interacting with the source table. Readers that also support Iceberg may continue reading. + +![Migrate Table Step 1](../../../img/iceberg-migrateaction-step1.png) + +- Create a new Iceberg table with the same identifier and metadata (schema, partition spec, etc.) as the source table. Rename the source table for a backup in case of failure and rollback. + +![Migrate Table Step 2](../../../img/iceberg-migrateaction-step2.png) + +- Commit all data files across all partitions to the new Iceberg table. Drop the source table. Writers can start writing to the new Iceberg table. + +![Migrate Table Step 3](../../../img/iceberg-migrateaction-step3.png) + +## Add Files +After the initial step (either Snapshot Table or Migrate Table), it is common to find some data files that have not been migrated. These files often originate from concurrent writers who continue writing to the source table during or after the migration process. +In practice, these files can be new data files in Hive tables or new snapshots (versions) of Delta Lake tables. The Add Files action is essential for incorporating these files into the Iceberg table. + +# Migrating From Different Table Formats +* [From Hive to Iceberg](../hive-migration) +* [From Delta Lake to Iceberg](../delta-lake-migration) diff --git a/site/dev/common.sh b/site/dev/common.sh index ab17df7b8f10..42974a1e28be 100755 --- a/site/dev/common.sh +++ b/site/dev/common.sh @@ -178,6 +178,8 @@ pull_versioned_docs () { # Ensure the remote repository for documentation exists and is up-to-date create_or_update_docs_remote + rm -r docs/docs + # Add local worktrees for documentation and javadoc from the remote repository git worktree add -f docs/docs "${REMOTE}/docs" git worktree add -f docs/javadoc "${REMOTE}/javadoc" @@ -204,6 +206,8 @@ clean () { git worktree remove docs/docs &> /dev/null git worktree remove docs/javadoc &> /dev/null + git restore docs/docs + # Remove any additional temporary artifacts (e.g., 'site/' directory) rm -rf site/ &> /dev/null diff --git a/docs/docs/api.md b/site/docs/docs/nightly/docs/api.md similarity index 100% rename from docs/docs/api.md rename to site/docs/docs/nightly/docs/api.md diff --git a/docs/docs/assets/images/audit-branch.png b/site/docs/docs/nightly/docs/assets/images/audit-branch.png similarity index 100% rename from docs/docs/assets/images/audit-branch.png rename to site/docs/docs/nightly/docs/assets/images/audit-branch.png diff --git a/docs/docs/assets/images/historical-snapshot-tag.png b/site/docs/docs/nightly/docs/assets/images/historical-snapshot-tag.png similarity index 100% rename from docs/docs/assets/images/historical-snapshot-tag.png rename to site/docs/docs/nightly/docs/assets/images/historical-snapshot-tag.png diff --git a/docs/docs/assets/images/iceberg-in-place-metadata-migration.png b/site/docs/docs/nightly/docs/assets/images/iceberg-in-place-metadata-migration.png similarity index 100% rename from docs/docs/assets/images/iceberg-in-place-metadata-migration.png rename to site/docs/docs/nightly/docs/assets/images/iceberg-in-place-metadata-migration.png diff --git a/docs/docs/assets/images/iceberg-migrateaction-step1.png b/site/docs/docs/nightly/docs/assets/images/iceberg-migrateaction-step1.png similarity index 100% rename from docs/docs/assets/images/iceberg-migrateaction-step1.png rename to site/docs/docs/nightly/docs/assets/images/iceberg-migrateaction-step1.png diff --git a/docs/docs/assets/images/iceberg-migrateaction-step2.png b/site/docs/docs/nightly/docs/assets/images/iceberg-migrateaction-step2.png similarity index 100% rename from docs/docs/assets/images/iceberg-migrateaction-step2.png rename to site/docs/docs/nightly/docs/assets/images/iceberg-migrateaction-step2.png diff --git a/docs/docs/assets/images/iceberg-migrateaction-step3.png b/site/docs/docs/nightly/docs/assets/images/iceberg-migrateaction-step3.png similarity index 100% rename from docs/docs/assets/images/iceberg-migrateaction-step3.png rename to site/docs/docs/nightly/docs/assets/images/iceberg-migrateaction-step3.png diff --git a/docs/docs/assets/images/iceberg-snapshotaction-step1.png b/site/docs/docs/nightly/docs/assets/images/iceberg-snapshotaction-step1.png similarity index 100% rename from docs/docs/assets/images/iceberg-snapshotaction-step1.png rename to site/docs/docs/nightly/docs/assets/images/iceberg-snapshotaction-step1.png diff --git a/docs/docs/assets/images/iceberg-snapshotaction-step2.png b/site/docs/docs/nightly/docs/assets/images/iceberg-snapshotaction-step2.png similarity index 100% rename from docs/docs/assets/images/iceberg-snapshotaction-step2.png rename to site/docs/docs/nightly/docs/assets/images/iceberg-snapshotaction-step2.png diff --git a/docs/docs/assets/images/partition-spec-evolution.png b/site/docs/docs/nightly/docs/assets/images/partition-spec-evolution.png similarity index 100% rename from docs/docs/assets/images/partition-spec-evolution.png rename to site/docs/docs/nightly/docs/assets/images/partition-spec-evolution.png diff --git a/docs/docs/aws.md b/site/docs/docs/nightly/docs/aws.md similarity index 100% rename from docs/docs/aws.md rename to site/docs/docs/nightly/docs/aws.md diff --git a/docs/docs/branching.md b/site/docs/docs/nightly/docs/branching.md similarity index 100% rename from docs/docs/branching.md rename to site/docs/docs/nightly/docs/branching.md diff --git a/docs/docs/configuration.md b/site/docs/docs/nightly/docs/configuration.md similarity index 100% rename from docs/docs/configuration.md rename to site/docs/docs/nightly/docs/configuration.md diff --git a/docs/docs/custom-catalog.md b/site/docs/docs/nightly/docs/custom-catalog.md similarity index 100% rename from docs/docs/custom-catalog.md rename to site/docs/docs/nightly/docs/custom-catalog.md diff --git a/docs/docs/dell.md b/site/docs/docs/nightly/docs/dell.md similarity index 100% rename from docs/docs/dell.md rename to site/docs/docs/nightly/docs/dell.md diff --git a/docs/docs/delta-lake-migration.md b/site/docs/docs/nightly/docs/delta-lake-migration.md similarity index 100% rename from docs/docs/delta-lake-migration.md rename to site/docs/docs/nightly/docs/delta-lake-migration.md diff --git a/docs/docs/evolution.md b/site/docs/docs/nightly/docs/evolution.md similarity index 100% rename from docs/docs/evolution.md rename to site/docs/docs/nightly/docs/evolution.md diff --git a/docs/docs/flink-actions.md b/site/docs/docs/nightly/docs/flink-actions.md similarity index 100% rename from docs/docs/flink-actions.md rename to site/docs/docs/nightly/docs/flink-actions.md diff --git a/docs/docs/flink-configuration.md b/site/docs/docs/nightly/docs/flink-configuration.md similarity index 100% rename from docs/docs/flink-configuration.md rename to site/docs/docs/nightly/docs/flink-configuration.md diff --git a/docs/docs/flink-connector.md b/site/docs/docs/nightly/docs/flink-connector.md similarity index 100% rename from docs/docs/flink-connector.md rename to site/docs/docs/nightly/docs/flink-connector.md diff --git a/docs/docs/flink-ddl.md b/site/docs/docs/nightly/docs/flink-ddl.md similarity index 100% rename from docs/docs/flink-ddl.md rename to site/docs/docs/nightly/docs/flink-ddl.md diff --git a/docs/docs/flink-queries.md b/site/docs/docs/nightly/docs/flink-queries.md similarity index 100% rename from docs/docs/flink-queries.md rename to site/docs/docs/nightly/docs/flink-queries.md diff --git a/docs/docs/flink-writes.md b/site/docs/docs/nightly/docs/flink-writes.md similarity index 100% rename from docs/docs/flink-writes.md rename to site/docs/docs/nightly/docs/flink-writes.md diff --git a/docs/docs/flink.md b/site/docs/docs/nightly/docs/flink.md similarity index 100% rename from docs/docs/flink.md rename to site/docs/docs/nightly/docs/flink.md diff --git a/docs/docs/hive-migration.md b/site/docs/docs/nightly/docs/hive-migration.md similarity index 100% rename from docs/docs/hive-migration.md rename to site/docs/docs/nightly/docs/hive-migration.md diff --git a/docs/docs/hive.md b/site/docs/docs/nightly/docs/hive.md similarity index 100% rename from docs/docs/hive.md rename to site/docs/docs/nightly/docs/hive.md diff --git a/docs/docs/index.md b/site/docs/docs/nightly/docs/index.md similarity index 100% rename from docs/docs/index.md rename to site/docs/docs/nightly/docs/index.md diff --git a/docs/docs/java-api-quickstart.md b/site/docs/docs/nightly/docs/java-api-quickstart.md similarity index 100% rename from docs/docs/java-api-quickstart.md rename to site/docs/docs/nightly/docs/java-api-quickstart.md diff --git a/docs/docs/jdbc.md b/site/docs/docs/nightly/docs/jdbc.md similarity index 100% rename from docs/docs/jdbc.md rename to site/docs/docs/nightly/docs/jdbc.md diff --git a/docs/docs/maintenance.md b/site/docs/docs/nightly/docs/maintenance.md similarity index 100% rename from docs/docs/maintenance.md rename to site/docs/docs/nightly/docs/maintenance.md diff --git a/docs/docs/metrics-reporting.md b/site/docs/docs/nightly/docs/metrics-reporting.md similarity index 100% rename from docs/docs/metrics-reporting.md rename to site/docs/docs/nightly/docs/metrics-reporting.md diff --git a/docs/docs/nessie.md b/site/docs/docs/nightly/docs/nessie.md similarity index 100% rename from docs/docs/nessie.md rename to site/docs/docs/nightly/docs/nessie.md diff --git a/docs/docs/partitioning.md b/site/docs/docs/nightly/docs/partitioning.md similarity index 100% rename from docs/docs/partitioning.md rename to site/docs/docs/nightly/docs/partitioning.md diff --git a/docs/docs/performance.md b/site/docs/docs/nightly/docs/performance.md similarity index 100% rename from docs/docs/performance.md rename to site/docs/docs/nightly/docs/performance.md diff --git a/docs/docs/reliability.md b/site/docs/docs/nightly/docs/reliability.md similarity index 100% rename from docs/docs/reliability.md rename to site/docs/docs/nightly/docs/reliability.md diff --git a/docs/docs/schemas.md b/site/docs/docs/nightly/docs/schemas.md similarity index 100% rename from docs/docs/schemas.md rename to site/docs/docs/nightly/docs/schemas.md diff --git a/docs/docs/spark-configuration.md b/site/docs/docs/nightly/docs/spark-configuration.md similarity index 100% rename from docs/docs/spark-configuration.md rename to site/docs/docs/nightly/docs/spark-configuration.md diff --git a/docs/docs/spark-ddl.md b/site/docs/docs/nightly/docs/spark-ddl.md similarity index 100% rename from docs/docs/spark-ddl.md rename to site/docs/docs/nightly/docs/spark-ddl.md diff --git a/docs/docs/spark-getting-started.md b/site/docs/docs/nightly/docs/spark-getting-started.md similarity index 100% rename from docs/docs/spark-getting-started.md rename to site/docs/docs/nightly/docs/spark-getting-started.md diff --git a/docs/docs/spark-procedures.md b/site/docs/docs/nightly/docs/spark-procedures.md similarity index 100% rename from docs/docs/spark-procedures.md rename to site/docs/docs/nightly/docs/spark-procedures.md diff --git a/docs/docs/spark-queries.md b/site/docs/docs/nightly/docs/spark-queries.md similarity index 100% rename from docs/docs/spark-queries.md rename to site/docs/docs/nightly/docs/spark-queries.md diff --git a/docs/docs/spark-structured-streaming.md b/site/docs/docs/nightly/docs/spark-structured-streaming.md similarity index 100% rename from docs/docs/spark-structured-streaming.md rename to site/docs/docs/nightly/docs/spark-structured-streaming.md diff --git a/docs/docs/spark-writes.md b/site/docs/docs/nightly/docs/spark-writes.md similarity index 100% rename from docs/docs/spark-writes.md rename to site/docs/docs/nightly/docs/spark-writes.md diff --git a/docs/docs/table-migration.md b/site/docs/docs/nightly/docs/table-migration.md similarity index 100% rename from docs/docs/table-migration.md rename to site/docs/docs/nightly/docs/table-migration.md diff --git a/docs/mkdocs.yml b/site/docs/docs/nightly/mkdocs.yml similarity index 100% rename from docs/mkdocs.yml rename to site/docs/docs/nightly/mkdocs.yml