HiveCatalog client slow in some requests #12024

jotarada · 2025-01-21T18:39:38Z

Apache Iceberg version

1.4.3

Query engine

Spark

Please describe the bug 🐞

We have this schema that contains huge amount of tables (8k+) and we notice timeouts when using hivecatalog iceberg impl, but spark default one is super fast.
Example:
If we run a spark session with this conf:

pyspark --master yarn   
--packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.4.3
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions   
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog   
--conf spark.sql.catalog.spark_catalog.type=hive   
--conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog   
--conf spark.sql.catalog.iceberg.type=hive

and run spark.sql("show tables in some_schema").show() it takes +/- 15secs as we see it uses the spark impl to access hive tables. We can see that on our metastore logs:

INFO 2025-01-21T18:25:24.565000057Z map[class:HiveMetaStore.audit log:ugi=jorge.arada ip=10.123.123.123 cmd=source:123.123.123.123 get_database: some_schema thread:pool-12-thread-19]
INFO 2025-01-21T18:25:24.565000057Z map[class:metastore.HiveMetaStore log:26: source:10.123.123.123 get_database: some_schema thread:pool-12-thread-19]
INFO 2025-01-21T18:25:24.571000099Z map[class:metastore.HiveMetaStore log:26: source:10.123.123.123 get_database: some_schema thread:pool-12-thread-19]
INFO 2025-01-21T18:25:24.571000099Z map[class:HiveMetaStore.audit log:ugi=jorge.arada ip=10.123.123.123 cmd=source:123.123.123.123 get_database: some_schema thread:pool-12-thread-19]
INFO 2025-01-21T18:25:24.579999923Z map[class:HiveMetaStore.audit log:ugi=jorge.arada ip=123.123.123.123 cmd=source:123.123.123.123 get_tables: db=some_schema pat=* thread:pool-12-thread-19]
INFO 2025-01-21T18:25:24.579999923Z map[class:metastore.HiveMetaStore log:26: source:123.123.123.123 get_tables: db=some_schema pat=* thread:pool-12-thread-19]

But if we run spark.sql("show tables in iceberg.some_schema").show() it takes up to 5min and we can see in the logs a different method was called

INFO 2025-01-21T18:29:49.118000030Z map[class:HiveMetaStore.audit log:ugi=jorge.arada ip=123.123.123.123 cmd=source:123.123.123.123 get_all_tables: db=some_schema thread:pool-12-thread-129]
INFO 2025-01-21T18:29:49.118000030Z map[class:metastore.HiveMetaStore log:135: source:123.123.123.123 get_all_tables: db=some_schema thread:pool-12-thread-129]

Tested on spark 3.3 and 3.5
And from what i could read on the iceberg code it seems to be the same for iceberg 1.7.X

Willingness to contribute

I can contribute a fix for this bug independently
I would be willing to contribute a fix for this bug with guidance from the Iceberg community
I cannot contribute a fix for this bug at this time

The text was updated successfully, but these errors were encountered:

jotarada · 2025-01-21T18:46:34Z

Also tested Trino 454 and I noticed a different behaviour

INFO 2025-01-21T18:41:02.275000095Z map[class:metastore.HiveMetaStore log:143: source:123.123.123.123 get_database: some_schema thread:pool-12-thread-138]
INFO 2025-01-21T18:41:02.276000022Z map[class:HiveMetaStore.audit log:ugi=hive ip=123.123.123.123 cmd=source:123.123.123.123 get_database: some_schema thread:pool-12-thread-138]
INFO 2025-01-21T18:41:02.430999994Z map[class:metastore.HiveMetaStore log:185: source:123.123.123.123 get_table_metas : tbl=hive.some_schema.* thread:pool-12-thread-179]
INFO 2025-01-21T18:41:02.431999921Z map[class:HiveMetaStore.audit log:ugi=hive ip=123.123.123.123 cmd=source:123.123.123.123 get_table_metas : tbl=hive.some_schema.* thread:pool-12-thread-179]

jotarada · 2025-01-24T10:48:43Z

I have tested in our fork to use org.apache.hadoop.hive.ql.metadata.Hive as it is done in spark and it works like a charm, but requires a dependency from hive-exec. Do you guys this could be the way to go? i can create a PR with the change

jotarada added the bug Something isn't working label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HiveCatalog client slow in some requests #12024

HiveCatalog client slow in some requests #12024

jotarada commented Jan 21, 2025 •

edited

Loading

jotarada commented Jan 21, 2025

jotarada commented Jan 24, 2025

HiveCatalog client slow in some requests #12024

HiveCatalog client slow in some requests #12024

Comments

jotarada commented Jan 21, 2025 • edited Loading

Apache Iceberg version

Query engine

Please describe the bug 🐞

Willingness to contribute

jotarada commented Jan 21, 2025

jotarada commented Jan 24, 2025

jotarada commented Jan 21, 2025 •

edited

Loading