Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28095: Hive Query History #5613

Merged
merged 17 commits into from
Feb 12, 2025
Merged

Conversation

abstractdog
Copy link
Contributor

@abstractdog abstractdog commented Jan 16, 2025

What changes were proposed in this pull request?

This change introduces the Hive Query History service. Technically, the service runs in HS2 and stores historical data about queries into an iceberg table (implementation / table type is pluggable).

  1. Introduced the core classes that implement the service: QueryHistoryService, Record, Schema, QueryHistoryRepository, etc.
  2. Implemented and tested 2 different persistence strategies: batch size and memory consumption based (records can be written out if at least x records are held in memory or approximately y bytes are consumed by the records held in memory): this is to prevent very small files but prevent HS2 from OOM in case of huge sql strings / plans / execution summaries
  3. Introduce unit tests for the service, affected classes: TestIcebergRepository,TestQueryHistoryRecord,TestQueryHistoryService,TestSemanticAnalyzer
  4. HiveConfForTest: disabled the service by default for all unit tests that uses this configuration class (optimized for testing)
  5. HiveIcebergStorageHandler: extended the storageHandlerCommit to receive task attempt context in which case it commits, this is needed as opposed to tez/llap runtime, we write and commit records immediately in HS2
  6. disabled the history service one-by-one in unit test classes where it caused problems (not product issues, only testing problems)
  7. added hive-iceberg-handler test scope dependency to be able to test from hive-unit (query_history table is iceberg by default)
  8. Introduced QTestQueryHistoryHandler, by which the service can be enabled for qtests, also query_history.q qtest added for basic coverage
  9. Changed MiniHS2 Builder to receive the query history service
  10. Added option to StartMiniHS2Cluster to enable the service for MiniHS2 tests, see details in testing section
  11. ql package changes to integrate the service to hive core:
    11a: Compiler: set query type after analysis when the full plan is available
    11b: Driver: hook the history service, this is the entry point of handling records (QueryHistoryService.handle(driverContext)
    11c: DriverContext: several changes, as DriverContext is the most usable object to maintain and encapsulate full query related data, hence it was improve to fulfill this requirement
    11d: QueryInfo: this object is extensively used in this service and other parts, so introduce a convenience creator method to prevent null QueryInfos in many place of hive code
    11e: QueryProperties: holds queryType and ddlType fields, assembled by the semantic analyzer
    11f: TezTask: take care of filling the RuntimeContext which is the info source of the QueryHistory too, also using the console object in TezJobMonitor to be able to capture the counters string in the query history
    11g: TezJobMonitor: collection the summary for the exec_summary field of query_history
  12. Hive: implement AutoCloseable: this is used in the QueryHistoryService and makes the ancient Hive object more Java-like :)

Why are the changes needed?

Because this is a cool thing. Query history provides useful information about finished queries on all levels: user, support, devs.

Does this PR introduce any user-facing change?

Yes. After this change, as the history service is enabled by default, the query_history table appears in the sys database. Both the sys database and query_history table is automatically created on HS2 startup.

Is the change a dependency upgrade?

No.

How was this patch tested?

  1. Unit tests added for almost the complete lifecycle of saving and persisting a query history record
mvn install -Dtest.groups= -DfailIfNoTests=true -Dtest.output.overwrite=true -Pitests,iceberg -Denforcer.skip=true -nsu -Dtest=TestIcebergRepository,TestQueryHistoryRecord,TestQueryHistorySchema,TestQueryHistoryService -pl ./itests/hive-unit
mvn install -Dtest.groups= -DfailIfNoTests=true -Dtest.output.overwrite=true -Pitests,iceberg -Denforcer.skip=true -nsu -Dtest=TestQueryProperties,TestSemanticAnalyzer -pl ql
mvn install -Dtest.groups= -DfailIfNoTests=true -Dtest.output.overwrite=true -Pitests,iceberg -Denforcer.skip=true -nsu -pl itests/qtest -Dtest=TestMiniLlapLocalCliDriver -Dqfile=query_history.q
  1. Improved minihs2 to try this out, see -DminiHS2.queryHistory, how I use it locally is:
mvn clean install -Dtest=StartMiniHS2Cluster -DminiHS2.clusterType=llap -DminiHS2.conf="target/testconf/llap/hive-site.xml"  -DminiHS2.run=true -DminiHS2.usePortsFromConf=true -Dpackaging.minimizeJar=false -T 1C -DskipShade -Dremoteresources.skip=true -Dmaven.javadoc.skip=true -Denforcer.skip=true -pl itests/hive-unit -pl itests/util -Pitests -nsu -DminiHS2.queryHistory

and then:

0: jdbc:hive2://localhost:10000/default> show tables in sys;

+----------------+
|    tab_name    |
+----------------+
| query_history  |
+----------------+

DESCRIBE FORMATTED sys.query_history:
https://issues.apache.org/jira/secure/attachment/13074633/describe_formatted_query_history.log

or:

set hive.fetch.task.conversion=none;
select query_id, session_id, tez_dag_id from sys.query_history;

+----------------------------------------------------+---------------------------------------+---------------------------+
|                      query_id                      |              session_id               |        tez_dag_id         |
+----------------------------------------------------+---------------------------------------+---------------------------+
| laszlobodor_20250116014234_67f70be0-34dc-4826-b50f-f19488110f1e | 39e3babf-8990-451e-9636-ac4e5152734b  | dag_1737020261367_0001_1  |
| laszlobodor_20250116014039_02490436-32d8-4d30-9d1a-927ce2a611e5 | 1e3dac96-294f-4750-9213-4e5771caf97e  | NULL                      |
| laszlobodor_20250116013834_77c4054f-5237-4076-b982-1a4a2ab9f3a8 | 1e3dac96-294f-4750-9213-4e5771caf97e  | NULL                      |
| laszlobodor_20250116014304_5cfbdf45-ef45-4512-a49c-8a9cec996904 | 39e3babf-8990-451e-9636-ac4e5152734b  | NULL                      |
| laszlobodor_20250116014226_2292d871-fffc-4a6b-aadc-97705569efc8 | 39e3babf-8990-451e-9636-ac4e5152734b  | NULL                      |
| laszlobodor_20250116014230_72ef549d-9091-4608-b655-99121c733e9e | 39e3babf-8990-451e-9636-ac4e5152734b  | NULL                      |
+----------------------------------------------------+---------------------------------------+---------------------------+

3a) made a lot of sanity tests on Cloudera Data Warehouse: enabled the query history while I was staging data for tpcds tables and ran tpcds tests
3b) tested that records (in memory) are persisted according to different persistence strategies (records limit, memory limit)
3c) tested that records (in memory) are persisted when HS2 shuts down

@abstractdog abstractdog changed the title [DRAFT] HIVE-28095: Hive Query History HIVE-28095: Hive Query History Jan 16, 2025
@InvisibleProgrammer
Copy link
Contributor

Why it is turned on, by default?

@abstractdog
Copy link
Contributor Author

Why it is turned on, by default?

it's a new feature, that works out of the box, I thought it was eligible to be turned on by default, please let me know if you think otherwise
in general, I feel/agree that turning off new features can reduce the risk of bumping into early failures

@deniskuzZ
Copy link
Member

deniskuzZ commented Jan 21, 2025

What is the use case for that service? can't I check the query history in HUE or DAS (removed for some storage reason), etc
Please take a look at #5319 which is being worked on by rtrivedi12
I think it provides some extra details for an active queries
cc @nrg4878

@abstractdog
Copy link
Contributor Author

What is the use case for that service? can't I check the query history in HUE or DAS (removed for some storage reason), etc Please take a look at #5319 which is being worked on by rtrivedi12 I think it provides some extra details for an active queries cc @nrg4878

looks like #5319 is completely different, it uses a well-known SHOW PROCESSLIST for the live queries (live==recent==present in hs2 memory), whereas Query History Service is meant to be a scalable historical query service, scalable in a sense that it uses the iceberg table format

HUE/DAS might work from different sources, like the protobuf history, which's data source also created by a query hook, but this service aims to redesign the way of persisting data while trying to use the same or similar field names that has already been implemented by impala

the current HiveProtoLoggingHook contains much information about the storage details (e.g. rolling over files and stuff), which makes it look a bit less modern when compared to e.g. iceberg format, by which we win everything (in terms of performance for instance) that we achieve by working on integrating iceberg into our product

@abstractdog
Copy link
Contributor Author

precommit is green, about to merge soon

@abstractdog abstractdog merged commit 242ba8c into apache:master Feb 12, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants