Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug report] federation query using 2 hive metastores does not work when using gravitino #4932

Open
foryou7242 opened this issue Sep 13, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@foryou7242
Copy link

Version

main branch

Describe what's wrong

I want to use federation query using hive metastore stored in 2 hadoop clusters.

So we added two hive catalogues to metalake.

There is a difference between the location path in the show create table and the actual location information when sql-sql querying.

image

It seems to be an effect of the actual spark-sql query spark.sql.metastore.uris option, so I'm wondering if it's possible to federate query 2 hives?

Error message and/or stacktrace

>  show create table   portal_test_schema;
CREATE TABLE portal_test_schema (
...
  month INT,
  day INT,
  hour INT
)
PARTITIONED BY (month, day, hour)
LOCATION 'hdfs://test1/test1'
TBLPROPERTIES (
  'bucketing_version' = '2',
  'discover.partitions' = 'true',
  'input-format' = 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',

explain query

spark-sql ()> EXPLAIN show create table   portal_test_schema;
== Physical Plan ==
ShowCreateTable [createtab_stmt#0], HiveTable(org.apache.spark.sql.SparkSession@14144cc9,CatalogTable(
Database: ladp
Table: portal_test_schema
Created Time: Thu Jan 26 18:40:15 JST 2023
Last Access: UNKNOWN
Created By: Spark 2.2 or prior
Type: EXTERNAL
Provider: hive
Table Properties: [bucketing_version=2, numFilesErasureCoded=0, transient_lastDdlTime=1725947686]
Location: hdfs://test2/portal_test_schema
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Storage Properties: [serialization.format=1]
Partition Provider: Catalog
Schema: root
...
),org.apache.kyuubi.spark.connector.hive.HiveTableCatalog@64cbc28e)

How to reproduce

gravitino branch main

Additional context

No response

@foryou7242 foryou7242 added the bug Something isn't working label Sep 13, 2024
@jerqi
Copy link
Collaborator

jerqi commented Sep 14, 2024

@FANNG1

@FANNG1
Copy link
Contributor

FANNG1 commented Sep 14, 2024

@foryou7242 , could you help to clarify the below questions?

  1. For your enviroment, is it only one Hive metastore, but two HDFS clusters?
  2. The main problem is you create a table with location hdfs://hdfs1/xxx, but show create table shows the location is hdfs://hdfs2/xxx, YES?
  3. could you share the catalog propertis when you create hive catalog and the Spark configurations when using SparkSQL?

@foryou7242
Copy link
Author

foryou7242 commented Sep 19, 2024

@FANNG1

  1. no, 2 hive metastore and 2 hdfs cluster
  2. yes

test1 cluster catalog

  • metastore.uris : thrift://test1:9083
    test2 cluster catalog
  • metastore.uris : thrift://test2:9083
    spark sql configuration
spark-sql     --master yarn     --queue batch  --deploy-mode client    --conf spark.executor.cores=2  --conf spark.executor.instances=10 \
 --conf spark.plugins="org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin" \
 --conf spark.sql.gravitino.uri=http://gravitino.stage.com \
 --conf spark.sql.gravitino.metalake=TEST \
 --jars hdfs://test1/user/spark/application/gravitino-spark-connector-runtime-3.4_2.12-0.7.0-incubating-SNAPSHOT.jar

Suspicion is that gravitino seems to be using kyuubihivetable for hive meta table connection
But kyuubi only supports connecting one hive metastore, which seems to be the problem, am I right?

@FANNG1
Copy link
Contributor

FANNG1 commented Sep 19, 2024

Suspicion is that gravitino seems to be using kyuubihivetable for hive meta table connection But kyuubi only supports connecting one hive metastore, which seems to be the problem, am I right?

kyuubi hive connector could support multi hive mestatore, because Gravitino will create separate kyuubi hive instance for different catalogs which contains different hive metastore uri, I had tested two hive metastore with a shared HDFS cluster works well in the initial POC phase.

and could you share the SQL to create the table? Does querying data works well?

@foryou7242
Copy link
Author

and could you share the SQL to create the table? Does querying data works well?

table is the same as issue because it's an existing table.

>  show create table   portal_test_schema;
CREATE TABLE portal_test_schema (
...
  month INT,
  day INT,
  hour INT
)
PARTITIONED BY (month, day, hour)
LOCATION 'hdfs://test1/test1'
TBLPROPERTIES (
  'bucketing_version' = '2',
  'discover.partitions' = 'true',
  'input-format' = 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',

@FANNG1
Copy link
Contributor

FANNG1 commented Sep 19, 2024

I setup two hivemestatore with sperate hdfs cluster, and couldn't reproduce this issue with following SQLs in both of the two catalogs. @foryou7242 could you try with the simple SQL like following?

create table a(a int) location 'hdfs://localhost:9000/user/hive/warehouse/t1.db/a';
show create table a;
CREATE TABLE t1.a (
  a INT)
LOCATION 'hdfs://localhost:9000/user/hive/warehouse/t1.db/a'
TBLPROPERTIES (
  'input-format' = 'org.apache.hadoop.mapred.TextInputFormat',
  'output-format' = 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
  'serde-lib' = 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe',
  'serde-name' = 'a',
  'table-type' = 'MANAGED_TABLE',
  'transient_lastDdlTime' = '1726756779')
explain show create table a;
explain show create table a
== Physical Plan ==
ShowCreateTable [createtab_stmt#36], HiveTable(org.apache.spark.sql.SparkSession@7752f9fe,CatalogTable(
Database: t1
Table: a
Owner: hive
Created Time: Thu Sep 19 22:39:39 CST 2024
Last Access: UNKNOWN
Created By: Spark 2.2 or prior
Type: MANAGED
Provider: hive
Table Properties: [gravitino.identifier=gravitino.v1.uid1812375099371418513, owner=hive, transient_lastDdlTime=1726756779]
Location: hdfs://localhost:9000/user/hive/warehouse/t1.db/a
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Partition Provider: Catalog
Schema: root
 |-- a: integer (nullable = true)
),org.apache.kyuubi.spark.connector.hive.HiveTableCatalog@61563a91)

@foryou7242
Copy link
Author

foryou7242 commented Sep 23, 2024

@FANNG1

I setup two hivemestatore with sperate hdfs cluster, and couldn't reproduce this issue with following SQLs in both of the two catalogs.

Did you know that spark-sql uses hdfs, which looks at a different metastore than localhost?

the above query results in the following error

spark-sql (test)> create table a(a int) location 'hdfs://test1/warehouse/tablespace/external/hive/test.db/a';
24/09/23 18:07:58 ERROR SparkSQLDriver: Failed in [create table a(a int) location 'hdfs://test1/warehouse/tablespace/external/hive/test.db/a']
java.lang.RuntimeException: Failed to load the real sparkTable: test.a
	at org.apache.gravitino.spark.connector.catalog.BaseCatalog.loadSparkTable(BaseCatalog.java:459)
	at org.apache.gravitino.spark.connector.catalog.BaseCatalog.createTable(BaseCatalog.java:222)
	at org.apache.spark.sql.connector.catalog.TableCatalog.createTable(TableCatalog.java:199)
	at org.apache.spark.sql.execution.datasources.v2.CreateTableExec.run(CreateTableExec.scala:44)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)


Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `test`.`a` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:256)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableRawMetadata(SessionCatalog.scala:541)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:526)
	at org.apache.kyuubi.spark.connector.hive.HiveTableCatalog.$anonfun$loadTable$1(HiveTableCatalog.scala:166)
	at org.apache.kyuubi.spark.connector.hive.HiveConnectorUtils$.withSQLConf(HiveConnectorUtils.scala:274)
	at org.apache.kyuubi.spark.connector.hive.HiveTableCatalog.loadTable(HiveTableCatalog.scala:166)
	at org.apache.gravitino.spark.connector.catalog.BaseCatalog.loadSparkTable(BaseCatalog.java:456)
	... 60 more

@FANNG1
Copy link
Contributor

FANNG1 commented Sep 23, 2024

You should set spark.bypass.spark.sql.hive.metastore.jars explicitly in Hive catalog properties if using Spark SQL client, or use Spark shell instead. please refer https://gravitino.apache.org/docs/0.6.0-incubating/spark-connector/spark-catalog-hive

@foryou7242
Copy link
Author

foryou7242 commented Sep 23, 2024

@FANNG1 Thank you so much for your help.

First of all, the spark.bypass.spark.sql.hive.metastore.jars setting is already set to builtin.

The root cause is that hive.metastore.uris is set to test2 hive metastore in hive-site.xml. I deleted that setting and checked that it is queried normally when querying with spark-shell.

However, I still have the same problem with spark-sql, am I right in understanding that this is an issue that will be fixed in the future?

scala> val df = spark.table("test2.test").filter($"month" === "202009" && $"day" === "20200928").limit(100)
24/09/23 23:20:02 INFO HiveConf: Found configuration file file:/opt/spark/conf/hive-site.xml
24/09/23 23:20:02 INFO metastore: Trying to connect to metastore with URI thrift://test1:9083
24/09/23 23:20:02 INFO metastore: Opened a connection to metastore, current connections: 1
24/09/23 23:20:02 INFO metastore: Connected to metastore.
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [test1: string, month: string ... 1 more field]

scala> df.show(false)
+------------------------------------+------+--------+
|test1                            |month |day     |
+------------------------------------+------+--------+
|1231231235555555557|202009|20200928|
|1231231235555555557|202009|20200928|
|1231231235555555557|202009|20200928|
spark-sql (default)> show create table test2.test;
24/09/23 23:17:41 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
24/09/23 23:17:42 WARN ObjectStore: Failed to get database test2, returning NoSuchObjectException
[TABLE_OR_VIEW_NOT_FOUND] The table or view `test2`.`test` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.; line 1 pos 18;
'ShowCreateTable false, [createtab_stmt#3]
+- 'UnresolvedTableOrView [test2, test], SHOW CREATE TABLE, false

@FANNG1
Copy link
Contributor

FANNG1 commented Sep 23, 2024

First of all, the spark.bypass.spark.sql.hive.metastore.jars setting is already set to builtin.

spark.bypass.spark.sql.hive.metastore.jars should be set to the values other than buildin, like

        "spark.bypass.spark.sql.hive.metastore.jars":"path",
        "spark.bypass.spark.sql.hive.metastore.jars.path":"file:///xx/hive/lib/*"

The root cause is that hive.metastore.uris is set to test2 hive metastore in hive-site.xml. I deleted that setting and checked that it is queried normally when querying with spark-shell.

Is hive.metastore.uris a final setting in hive-site.xml?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants