Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HiveCatalog client slow in some requests #12024

Open
1 of 3 tasks
jotarada opened this issue Jan 21, 2025 · 2 comments
Open
1 of 3 tasks

HiveCatalog client slow in some requests #12024

jotarada opened this issue Jan 21, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@jotarada
Copy link

jotarada commented Jan 21, 2025

Apache Iceberg version

1.4.3

Query engine

Spark

Please describe the bug 🐞

We have this schema that contains huge amount of tables (8k+) and we notice timeouts when using hivecatalog iceberg impl, but spark default one is super fast.
Example:
If we run a spark session with this conf:

pyspark --master yarn   
--packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.4.3
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions   
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog   
--conf spark.sql.catalog.spark_catalog.type=hive   
--conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog   
--conf spark.sql.catalog.iceberg.type=hive

and run spark.sql("show tables in some_schema").show() it takes +/- 15secs as we see it uses the spark impl to access hive tables. We can see that on our metastore logs:

INFO 2025-01-21T18:25:24.565000057Z map[class:HiveMetaStore.audit log:ugi=jorge.arada ip=10.123.123.123 cmd=source:123.123.123.123 get_database: some_schema thread:pool-12-thread-19]
INFO 2025-01-21T18:25:24.565000057Z map[class:metastore.HiveMetaStore log:26: source:10.123.123.123 get_database: some_schema thread:pool-12-thread-19]
INFO 2025-01-21T18:25:24.571000099Z map[class:metastore.HiveMetaStore log:26: source:10.123.123.123 get_database: some_schema thread:pool-12-thread-19]
INFO 2025-01-21T18:25:24.571000099Z map[class:HiveMetaStore.audit log:ugi=jorge.arada ip=10.123.123.123 cmd=source:123.123.123.123 get_database: some_schema thread:pool-12-thread-19]
INFO 2025-01-21T18:25:24.579999923Z map[class:HiveMetaStore.audit log:ugi=jorge.arada ip=123.123.123.123 cmd=source:123.123.123.123 get_tables: db=some_schema pat=* thread:pool-12-thread-19]
INFO 2025-01-21T18:25:24.579999923Z map[class:metastore.HiveMetaStore log:26: source:123.123.123.123 get_tables: db=some_schema pat=* thread:pool-12-thread-19]

But if we run spark.sql("show tables in iceberg.some_schema").show() it takes up to 5min and we can see in the logs a different method was called

INFO 2025-01-21T18:29:49.118000030Z map[class:HiveMetaStore.audit log:ugi=jorge.arada ip=123.123.123.123 cmd=source:123.123.123.123 get_all_tables: db=some_schema thread:pool-12-thread-129]
INFO 2025-01-21T18:29:49.118000030Z map[class:metastore.HiveMetaStore log:135: source:123.123.123.123 get_all_tables: db=some_schema thread:pool-12-thread-129]

Tested on spark 3.3 and 3.5
And from what i could read on the iceberg code it seems to be the same for iceberg 1.7.X

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@jotarada jotarada added the bug Something isn't working label Jan 21, 2025
@jotarada
Copy link
Author

Also tested Trino 454 and I noticed a different behaviour

INFO 2025-01-21T18:41:02.275000095Z map[class:metastore.HiveMetaStore log:143: source:123.123.123.123 get_database: some_schema thread:pool-12-thread-138]
INFO 2025-01-21T18:41:02.276000022Z map[class:HiveMetaStore.audit log:ugi=hive ip=123.123.123.123 cmd=source:123.123.123.123 get_database: some_schema thread:pool-12-thread-138]
INFO 2025-01-21T18:41:02.430999994Z map[class:metastore.HiveMetaStore log:185: source:123.123.123.123 get_table_metas : tbl=hive.some_schema.* thread:pool-12-thread-179]
INFO 2025-01-21T18:41:02.431999921Z map[class:HiveMetaStore.audit log:ugi=hive ip=123.123.123.123 cmd=source:123.123.123.123 get_table_metas : tbl=hive.some_schema.* thread:pool-12-thread-179]

@jotarada
Copy link
Author

I have tested in our fork to use org.apache.hadoop.hive.ql.metadata.Hive as it is done in spark and it works like a charm, but requires a dependency from hive-exec. Do you guys this could be the way to go? i can create a PR with the change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant