-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HIVE-28095: Hive Query History #5613
Conversation
22f1654
to
bf75848
Compare
Why it is turned on, by default? |
it's a new feature, that works out of the box, I thought it was eligible to be turned on by default, please let me know if you think otherwise |
What is the use case for that service? can't I check the query history in HUE or DAS (removed for some storage reason), etc |
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
Outdated
Show resolved
Hide resolved
...t/src/test/java/org/apache/hadoop/hive/ql/queryhistory/schema/ExampleQueryHistoryRecord.java
Outdated
Show resolved
Hide resolved
looks like #5319 is completely different, it uses a well-known SHOW PROCESSLIST for the live queries (live==recent==present in hs2 memory), whereas Query History Service is meant to be a scalable historical query service, scalable in a sense that it uses the iceberg table format HUE/DAS might work from different sources, like the protobuf history, which's data source also created by a query hook, but this service aims to redesign the way of persisting data while trying to use the same or similar field names that has already been implemented by impala the current HiveProtoLoggingHook contains much information about the storage details (e.g. rolling over files and stuff), which makes it look a bit less modern when compared to e.g. iceberg format, by which we win everything (in terms of performance for instance) that we achieve by working on integrating iceberg into our product |
0d10e1b
to
9f0d4ca
Compare
c95ad62
to
cc2a495
Compare
cc2a495
to
3ef511e
Compare
|
precommit is green, about to merge soon |
What changes were proposed in this pull request?
This change introduces the Hive Query History service. Technically, the service runs in HS2 and stores historical data about queries into an iceberg table (implementation / table type is pluggable).
11a: Compiler: set query type after analysis when the full plan is available
11b: Driver: hook the history service, this is the entry point of handling records (QueryHistoryService.handle(driverContext)
11c: DriverContext: several changes, as DriverContext is the most usable object to maintain and encapsulate full query related data, hence it was improve to fulfill this requirement
11d: QueryInfo: this object is extensively used in this service and other parts, so introduce a convenience creator method to prevent null QueryInfos in many place of hive code
11e: QueryProperties: holds queryType and ddlType fields, assembled by the semantic analyzer
11f: TezTask: take care of filling the RuntimeContext which is the info source of the QueryHistory too, also using the console object in TezJobMonitor to be able to capture the counters string in the query history
11g: TezJobMonitor: collection the summary for the exec_summary field of query_history
Why are the changes needed?
Because this is a cool thing. Query history provides useful information about finished queries on all levels: user, support, devs.
Does this PR introduce any user-facing change?
Yes. After this change, as the history service is enabled by default, the query_history table appears in the sys database. Both the sys database and query_history table is automatically created on HS2 startup.
Is the change a dependency upgrade?
No.
How was this patch tested?
and then:
DESCRIBE FORMATTED sys.query_history:
https://issues.apache.org/jira/secure/attachment/13074633/describe_formatted_query_history.log
or:
3a) made a lot of sanity tests on Cloudera Data Warehouse: enabled the query history while I was staging data for tpcds tables and ran tpcds tests
3b) tested that records (in memory) are persisted according to different persistence strategies (records limit, memory limit)
3c) tested that records (in memory) are persisted when HS2 shuts down