You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have this schema that contains huge amount of tables (8k+) and we notice timeouts when using hivecatalog iceberg impl, but spark default one is super fast.
Example:
If we run a spark session with this conf:
and run spark.sql("show tables in some_schema").show() it takes +/- 15secs as we see it uses the spark impl to access hive tables. We can see that on our metastore logs:
INFO 2025-01-21T18:25:24.565000057Z map[class:HiveMetaStore.audit log:ugi=jorge.arada ip=10.123.123.123 cmd=source:123.123.123.123 get_database: some_schema thread:pool-12-thread-19]
INFO 2025-01-21T18:25:24.565000057Z map[class:metastore.HiveMetaStore log:26: source:10.123.123.123 get_database: some_schema thread:pool-12-thread-19]
INFO 2025-01-21T18:25:24.571000099Z map[class:metastore.HiveMetaStore log:26: source:10.123.123.123 get_database: some_schema thread:pool-12-thread-19]
INFO 2025-01-21T18:25:24.571000099Z map[class:HiveMetaStore.audit log:ugi=jorge.arada ip=10.123.123.123 cmd=source:123.123.123.123 get_database: some_schema thread:pool-12-thread-19]
INFO 2025-01-21T18:25:24.579999923Z map[class:HiveMetaStore.audit log:ugi=jorge.arada ip=123.123.123.123 cmd=source:123.123.123.123 get_tables: db=some_schema pat=* thread:pool-12-thread-19]
INFO 2025-01-21T18:25:24.579999923Z map[class:metastore.HiveMetaStore log:26: source:123.123.123.123 get_tables: db=some_schema pat=* thread:pool-12-thread-19]
But if we run spark.sql("show tables in iceberg.some_schema").show() it takes up to 5min and we can see in the logs a different method was called
INFO 2025-01-21T18:29:49.118000030Z map[class:HiveMetaStore.audit log:ugi=jorge.arada ip=123.123.123.123 cmd=source:123.123.123.123 get_all_tables: db=some_schema thread:pool-12-thread-129]
INFO 2025-01-21T18:29:49.118000030Z map[class:metastore.HiveMetaStore log:135: source:123.123.123.123 get_all_tables: db=some_schema thread:pool-12-thread-129]
Tested on spark 3.3 and 3.5
And from what i could read on the iceberg code it seems to be the same for iceberg 1.7.X
Willingness to contribute
I can contribute a fix for this bug independently
I would be willing to contribute a fix for this bug with guidance from the Iceberg community
I cannot contribute a fix for this bug at this time
The text was updated successfully, but these errors were encountered:
I have tested in our fork to use org.apache.hadoop.hive.ql.metadata.Hive as it is done in spark and it works like a charm, but requires a dependency from hive-exec. Do you guys this could be the way to go? i can create a PR with the change
Apache Iceberg version
1.4.3
Query engine
Spark
Please describe the bug 🐞
We have this schema that contains huge amount of tables (8k+) and we notice timeouts when using hivecatalog iceberg impl, but spark default one is super fast.
Example:
If we run a spark session with this conf:
and run
spark.sql("show tables in some_schema").show()
it takes +/- 15secs as we see it uses the spark impl to access hive tables. We can see that on our metastore logs:But if we run
spark.sql("show tables in iceberg.some_schema").show()
it takes up to 5min and we can see in the logs a different method was calledTested on spark 3.3 and 3.5
And from what i could read on the iceberg code it seems to be the same for iceberg 1.7.X
Willingness to contribute
The text was updated successfully, but these errors were encountered: