HIVE-28095: Hive Query History #5613

abstractdog · 2025-01-16T08:42:47Z

What changes were proposed in this pull request?

This change introduces the Hive Query History service. Technically, the service runs in HS2 and stores historical data about queries into an iceberg table (implementation / table type is pluggable).

Introduced the core classes that implement the service: QueryHistoryService, Record, Schema, QueryHistoryRepository, etc.
Implemented and tested 2 different persistence strategies: batch size and memory consumption based (records can be written out if at least x records are held in memory or approximately y bytes are consumed by the records held in memory): this is to prevent very small files but prevent HS2 from OOM in case of huge sql strings / plans / execution summaries
Introduce unit tests for the service, affected classes: TestIcebergRepository,TestQueryHistoryRecord,TestQueryHistoryService,TestSemanticAnalyzer
HiveConfForTest: disabled the service by default for all unit tests that uses this configuration class (optimized for testing)
HiveIcebergStorageHandler: extended the storageHandlerCommit to receive task attempt context in which case it commits, this is needed as opposed to tez/llap runtime, we write and commit records immediately in HS2
disabled the history service one-by-one in unit test classes where it caused problems (not product issues, only testing problems)
added hive-iceberg-handler test scope dependency to be able to test from hive-unit (query_history table is iceberg by default)
Introduced QTestQueryHistoryHandler, by which the service can be enabled for qtests, also query_history.q qtest added for basic coverage
Changed MiniHS2 Builder to receive the query history service
Added option to StartMiniHS2Cluster to enable the service for MiniHS2 tests, see details in testing section
ql package changes to integrate the service to hive core:
11a: Compiler: set query type after analysis when the full plan is available
11b: Driver: hook the history service, this is the entry point of handling records (QueryHistoryService.handle(driverContext)
11c: DriverContext: several changes, as DriverContext is the most usable object to maintain and encapsulate full query related data, hence it was improve to fulfill this requirement
11d: QueryInfo: this object is extensively used in this service and other parts, so introduce a convenience creator method to prevent null QueryInfos in many place of hive code
11e: QueryProperties: holds queryType and ddlType fields, assembled by the semantic analyzer
11f: TezTask: take care of filling the RuntimeContext which is the info source of the QueryHistory too, also using the console object in TezJobMonitor to be able to capture the counters string in the query history
11g: TezJobMonitor: collection the summary for the exec_summary field of query_history
Hive: implement AutoCloseable: this is used in the QueryHistoryService and makes the ancient Hive object more Java-like :)

Why are the changes needed?

Because this is a cool thing. Query history provides useful information about finished queries on all levels: user, support, devs.

Does this PR introduce any user-facing change?

Yes. After this change, as the history service is enabled by default, the query_history table appears in the sys database. Both the sys database and query_history table is automatically created on HS2 startup.

Is the change a dependency upgrade?

No.

How was this patch tested?

Unit tests added for almost the complete lifecycle of saving and persisting a query history record

mvn install -Dtest.groups= -DfailIfNoTests=true -Dtest.output.overwrite=true -Pitests,iceberg -Denforcer.skip=true -nsu -Dtest=TestIcebergRepository,TestQueryHistoryRecord,TestQueryHistorySchema,TestQueryHistoryService -pl ./itests/hive-unit

mvn install -Dtest.groups= -DfailIfNoTests=true -Dtest.output.overwrite=true -Pitests,iceberg -Denforcer.skip=true -nsu -Dtest=TestQueryProperties,TestSemanticAnalyzer -pl ql

mvn install -Dtest.groups= -DfailIfNoTests=true -Dtest.output.overwrite=true -Pitests,iceberg -Denforcer.skip=true -nsu -pl itests/qtest -Dtest=TestMiniLlapLocalCliDriver -Dqfile=query_history.q

Improved minihs2 to try this out, see -DminiHS2.queryHistory, how I use it locally is:

mvn clean install -Dtest=StartMiniHS2Cluster -DminiHS2.clusterType=llap -DminiHS2.conf="target/testconf/llap/hive-site.xml"  -DminiHS2.run=true -DminiHS2.usePortsFromConf=true -Dpackaging.minimizeJar=false -T 1C -DskipShade -Dremoteresources.skip=true -Dmaven.javadoc.skip=true -Denforcer.skip=true -pl itests/hive-unit -pl itests/util -Pitests -nsu -DminiHS2.queryHistory

and then:

0: jdbc:hive2://localhost:10000/default> show tables in sys;

+----------------+
|    tab_name    |
+----------------+
| query_history  |
+----------------+

DESCRIBE FORMATTED sys.query_history:
https://issues.apache.org/jira/secure/attachment/13074633/describe_formatted_query_history.log

or:

set hive.fetch.task.conversion=none;
select query_id, session_id, tez_dag_id from sys.query_history;

+----------------------------------------------------+---------------------------------------+---------------------------+
|                      query_id                      |              session_id               |        tez_dag_id         |
+----------------------------------------------------+---------------------------------------+---------------------------+
| laszlobodor_20250116014234_67f70be0-34dc-4826-b50f-f19488110f1e | 39e3babf-8990-451e-9636-ac4e5152734b  | dag_1737020261367_0001_1  |
| laszlobodor_20250116014039_02490436-32d8-4d30-9d1a-927ce2a611e5 | 1e3dac96-294f-4750-9213-4e5771caf97e  | NULL                      |
| laszlobodor_20250116013834_77c4054f-5237-4076-b982-1a4a2ab9f3a8 | 1e3dac96-294f-4750-9213-4e5771caf97e  | NULL                      |
| laszlobodor_20250116014304_5cfbdf45-ef45-4512-a49c-8a9cec996904 | 39e3babf-8990-451e-9636-ac4e5152734b  | NULL                      |
| laszlobodor_20250116014226_2292d871-fffc-4a6b-aadc-97705569efc8 | 39e3babf-8990-451e-9636-ac4e5152734b  | NULL                      |
| laszlobodor_20250116014230_72ef549d-9091-4608-b655-99121c733e9e | 39e3babf-8990-451e-9636-ac4e5152734b  | NULL                      |
+----------------------------------------------------+---------------------------------------+---------------------------+

3a) made a lot of sanity tests on Cloudera Data Warehouse: enabled the query history while I was staging data for tpcds tables and ran tpcds tests
3b) tested that records (in memory) are persisted according to different persistence strategies (records limit, memory limit)
3c) tested that records (in memory) are persisted when HS2 shuts down

InvisibleProgrammer · 2025-01-20T08:49:30Z

Why it is turned on, by default?

abstractdog · 2025-01-20T08:59:17Z

Why it is turned on, by default?

it's a new feature, that works out of the box, I thought it was eligible to be turned on by default, please let me know if you think otherwise
in general, I feel/agree that turning off new features can reduce the risk of bumping into early failures

deniskuzZ · 2025-01-21T10:22:30Z

What is the use case for that service? can't I check the query history in HUE or DAS (removed for some storage reason), etc
Please take a look at #5319 which is being worked on by rtrivedi12
I think it provides some extra details for an active queries
cc @nrg4878

common/src/java/org/apache/hadoop/hive/conf/HiveConf.java

iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java

...t/src/test/java/org/apache/hadoop/hive/ql/queryhistory/schema/ExampleQueryHistoryRecord.java

abstractdog · 2025-01-21T10:51:45Z

What is the use case for that service? can't I check the query history in HUE or DAS (removed for some storage reason), etc Please take a look at #5319 which is being worked on by rtrivedi12 I think it provides some extra details for an active queries cc @nrg4878

looks like #5319 is completely different, it uses a well-known SHOW PROCESSLIST for the live queries (live==recent==present in hs2 memory), whereas Query History Service is meant to be a scalable historical query service, scalable in a sense that it uses the iceberg table format

HUE/DAS might work from different sources, like the protobuf history, which's data source also created by a query hook, but this service aims to redesign the way of persisting data while trying to use the same or similar field names that has already been implemented by impala

the current HiveProtoLoggingHook contains much information about the storage details (e.g. rolling over files and stuff), which makes it look a bit less modern when compared to e.g. iceberg format, by which we win everything (in terms of performance for instance) that we achieve by working on integrating iceberg into our product

common/src/java/org/apache/hadoop/hive/conf/HiveConf.java

ql/src/java/org/apache/hadoop/hive/ql/Compiler.java

ql/src/java/org/apache/hadoop/hive/ql/Driver.java

ql/src/java/org/apache/hadoop/hive/ql/DriverContext.java

sonarqubecloud · 2025-02-11T22:42:20Z

Quality Gate passed

Issues
95 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

abstractdog · 2025-02-12T05:51:50Z

precommit is green, about to merge soon

asf-ci-hive added the tests pending label Jan 16, 2025

abstractdog changed the title ~~[DRAFT] HIVE-28095: Hive Query History~~ HIVE-28095: Hive Query History Jan 16, 2025

asf-ci-hive added tests failed and removed tests pending labels Jan 16, 2025

abstractdog force-pushed the HIVE-28095 branch from 22f1654 to bf75848 Compare January 16, 2025 16:23

asf-ci-hive added tests pending tests passed and removed tests failed tests pending labels Jan 16, 2025