Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[opt](hive) support orc generated from hive 1.x for all file scan node #28806

Merged
merged 3 commits into from
Jan 6, 2024

Conversation

morningman
Copy link
Contributor

@morningman morningman commented Dec 21, 2023

Proposed changes

Previous, we only handle orc generated from hive 1.x for HiveScanNode
if user set "hive.version" = "1.1.0" explicitly.

It should be for all kinds for FileScanNode.
And no need to set hive.version = 1.1.0 explicitly for this feature.

In this PR,

Further comments

If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...

@morningman
Copy link
Contributor Author

run buildall

@morningman morningman changed the title [opt] support orc generated from hive 1.x for all file scan node [opt](hive) support orc generated from hive 1.x for all file scan node Dec 21, 2023
@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 43.93 seconds
stream load tsv: 567 seconds loaded 74807831229 Bytes, about 125 MB/s
stream load json: 19 seconds loaded 2358488459 Bytes, about 118 MB/s
stream load orc: 66 seconds loaded 1101869774 Bytes, about 15 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.8 seconds inserted 10000000 Rows, about 347K ops/s
storage size: 17183811648 Bytes

@xiaokang xiaokang added the usercase Important user case type label label Dec 21, 2023
@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'

Tpch sf100 test result on commit 40bd7aabffb9b2beeac7b35485a0b3c32a39f6fd, data reload: false

run tpch-sf100 query with default conf and session variables
q1	4722	4436	4419	4419
q2	365	152	158	152
q3	1491	1222	1236	1222
q4	1119	958	918	918
q5	3139	3188	3161	3161
q6	248	130	128	128
q7	989	486	477	477
q8	2209	2237	2213	2213
q9	6699	6658	6707	6658
q10	3229	3278	3289	3278
q11	308	181	185	181
q12	361	210	211	210
q13	4596	3803	3788	3788
q14	245	212	211	211
q15	563	528	521	521
q16	441	385	386	385
q17	1015	589	535	535
q18	7212	6991	6977	6977
q19	1520	1377	1405	1377
q20	544	308	313	308
q21	3056	2636	2696	2636
q22	347	273	282	273
Total cold run time: 44418 ms
Total hot run time: 40028 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	4354	4320	4315	4315
q2	269	163	175	163
q3	3535	3519	3520	3519
q4	2382	2372	2371	2371
q5	5757	5763	5755	5755
q6	244	123	125	123
q7	2390	1891	1873	1873
q8	3530	3528	3535	3528
q9	9015	9044	8979	8979
q10	3887	4003	3996	3996
q11	495	368	370	368
q12	761	593	604	593
q13	4273	3538	3570	3538
q14	285	253	266	253
q15	573	520	523	520
q16	516	472	467	467
q17	1885	1846	1824	1824
q18	8606	8275	8263	8263
q19	1730	1745	1726	1726
q20	2256	1949	1935	1935
q21	6513	6136	6170	6136
q22	502	417	442	417
Total cold run time: 63758 ms
Total hot run time: 60662 ms

@morningman morningman removed the usercase Important user case type label label Dec 22, 2023
@morningman
Copy link
Contributor Author

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@@ -483,6 +484,19 @@ class OrcReader : public GenericReader {
int64_t get_remaining_rows() { return _remaining_rows; }
void set_remaining_rows(int64_t rows) { _remaining_rows = rows; }

// check if the given name is like _col0, _col1, ...
bool inline _is_hive1_col_name(const std::string& name) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: method '_is_hive1_col_name' can be made static [readability-convert-member-functions-to-static]

Suggested change
bool inline _is_hive1_col_name(const std::string& name) {
static bool inline _is_hive1_col_name(const std::string& name) {

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G', run with scripts in https://github.com/apache/doris/tree/master/tools/tpch-tools

Tpch sf100 test result on commit c7019d1674e884937e5d03a35ab5a15de8cea5ce, data reload: false

------ Round 1 ----------------------------------
q1	17646	5183	5099	5099
q2	2024	154	138	138
q3	10534	1091	1076	1076
q4	10176	782	795	782
q5	7789	2959	2883	2883
q6	204	132	132	132
q7	909	539	517	517
q8	9301	2003	2017	2003
q9	6803	6379	6367	6367
q10	8253	3034	3001	3001
q11	421	201	227	201
q12	384	230	231	230
q13	18000	3607	3631	3607
q14	246	217	202	202
q15	593	542	527	527
q16	445	408	384	384
q17	958	480	524	480
q18	7254	6584	6573	6573
q19	1575	1334	1337	1334
q20	730	355	346	346
q21	2786	2400	2414	2400
q22	376	329	317	317
Total cold run time: 107407 ms
Total hot run time: 38599 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5090	5129	5056	5056
q2	332	246	241	241
q3	3301	3279	3240	3240
q4	2090	2015	2009	2009
q5	5810	5742	5756	5742
q6	213	124	125	124
q7	2288	1903	1919	1903
q8	3366	3429	3443	3429
q9	8779	8730	8703	8703
q10	3775	3825	3804	3804
q11	571	485	499	485
q12	820	646	674	646
q13	9556	3201	3216	3201
q14	288	280	264	264
q15	597	539	525	525
q16	543	523	513	513
q17	1927	1745	1712	1712
q18	8650	8327	8322	8322
q19	1621	1620	1628	1620
q20	2215	1955	1946	1946
q21	5527	5255	5180	5180
q22	526	483	477	477
Total cold run time: 67885 ms
Total hot run time: 59142 ms

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G', run with scripts in https://github.com/apache/doris/tree/master/tools/tpch-tools

Tpch sf100 test result on commit c7019d1674e884937e5d03a35ab5a15de8cea5ce, data reload: false

run tpch-sf100 query with default conf and session variables
q1	5509	5158	5170	5158
q2	390	173	159	159
q3	1463	1197	1154	1154
q4	1092	839	821	821
q5	3132	3100	3079	3079
q6	225	133	140	133
q7	963	570	530	530
q8	2153	2272	2238	2238
q9	6661	6658	6625	6625
q10	3168	3103	3160	3103
q11	347	222	206	206
q12	380	236	236	236
q13	4400	3642	3622	3622
q14	254	219	221	219
q15	625	551	552	551
q16	453	420	421	420
q17	1032	565	537	537
q18	7087	6674	6781	6674
q19	1654	1495	1449	1449
q20	577	323	329	323
q21	2893	2407	2459	2407
q22	395	332	323	323
Total cold run time: 44853 ms
Total hot run time: 39967 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	5143	5058	5167	5058
q2	340	253	254	253
q3	3386	3328	3303	3303
q4	2141	1995	1992	1992
q5	5965	5920	5907	5907
q6	228	130	122	122
q7	2395	1958	1908	1908
q8	3564	3665	3686	3665
q9	9020	8989	8964	8964
q10	3895	3874	3885	3874
q11	575	470	476	470
q12	801	640	691	640
q13	3873	3190	3170	3170
q14	294	284	280	280
q15	622	555	542	542
q16	569	533	524	524
q17	2027	1806	1786	1786
q18	8800	8222	9262	8222
q19	1747	1679	1679	1679
q20	2290	2007	1975	1975
q21	5786	5357	5397	5357
q22	566	509	501	501
Total cold run time: 64027 ms
Total hot run time: 60192 ms

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.62% (8614/23521)
Line Coverage: 28.69% (70019/244090)
Region Coverage: 27.66% (36232/130968)
Branch Coverage: 24.37% (18519/75980)
Coverage Report: http://coverage.selectdb-in.cc/coverage/c7019d1674e884937e5d03a35ab5a15de8cea5ce_c7019d1674e884937e5d03a35ab5a15de8cea5ce/report/index.html

@doris-robot
Copy link

TPC-DS test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G', run with scripts in https://github.com/apache/doris/tree/master/tools/tpcds-tools

TPC-DS sf100 test result on commit c7019d1674e884937e5d03a35ab5a15de8cea5ce, data reload: false

run tpcds-sf100 query with default conf and session variables
query1	935	354	339	339
query2	6455	1869	1833	1833
query3	6645	208	202	202
query4	27541	22386	22437	22386
query5	5590	521	544	521
query6	272	182	182	182
query7	4580	282	257	257
query8	239	194	197	194
query9	8244	2568	2506	2506
query10	436	245	258	245
query11	16385	15653	15626	15626
query12	131	81	78	78
query13	1647	321	336	321
query14	11909	7143	7195	7143
query15	238	190	189	189
query16	6420	264	269	264
query17	1836	494	494	494
query18	1924	267	272	267
query19	275	143	140	140
query20	85	77	81	77
query21	193	97	94	94
query22	5136	4891	4356	4356
query23	31933	31165	31285	31165
query24	11760	2790	2811	2790
query25	581	345	343	343
query26	1711	143	144	143
query27	2832	281	273	273
query28	7038	1947	1938	1938
query29	2053	397	405	397
query30	286	150	147	147
query31	963	775	768	768
query32	87	60	61	60
query33	737	281	268	268
query34	846	451	434	434
query35	874	773	689	689
query36	1314	1262	1269	1262
query37	108	76	79	76
query38	3354	3313	3298	3298
query39	1332	1278	1277	1277
query40	307	94	91	91
query41	37	35	34	34
query42	99	88	96	88
query43	525	510	505	505
query44	1062	700	713	700
query45	200	181	181	181
query46	1073	648	656	648
query47	1710	1575	1574	1574
query48	337	262	265	262
query49	1213	325	321	321
query50	753	328	358	328
query51	5304	5217	5281	5217
query52	95	91	88	88
query53	218	154	163	154
query54	1358	580	590	580
query55	103	96	87	87
query56	210	201	205	201
query57	1042	944	972	944
query58	231	206	207	206
query59	2909	2613	2637	2613
query60	256	239	246	239
query61	86	84	84	84
query62	666	454	503	454
query63	165	151	153	151
query64	5909	1721	1706	1706
query65	3336	3276	3233	3233
query66	1289	345	336	336
query67	15813	15286	15592	15286
query68	12681	538	516	516
query69	533	253	252	252
query70	1648	1549	1486	1486
query71	495	234	230	230
query72	5602	3574	3574	3574
query73	2957	315	312	312
query74	6952	6401	6459	6401
query75	5276	2251	2286	2251
query76	6324	1105	1141	1105
query77	663	301	292	292
query78	9118	8863	8604	8604
query79	1084	511	500	500
query80	625	375	372	372
query81	444	210	215	210
query82	220	105	98	98
query83	164	134	134	134
query84	246	62	51	51
query85	920	280	276	276
query86	383	361	389	361
query87	3574	3401	3328	3328
query88	3120	2238	2242	2238
query89	339	266	255	255
query90	1854	209	219	209
query91	120	88	97	88
query92	62	53	55	53
query93	1798	494	442	442
query94	795	188	187	187
query95	465	416	409	409
query96	632	319	316	316
query97	4295	4157	4169	4157
query98	208	204	190	190
query99	1140	868	815	815
Total cold run time: 295968 ms
Total hot run time: 179065 ms

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 47.42 seconds
stream load tsv: 582 seconds loaded 74807831229 Bytes, about 122 MB/s
stream load json: 19 seconds loaded 2358488459 Bytes, about 118 MB/s
stream load orc: 66 seconds loaded 1101869774 Bytes, about 15 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.3 seconds inserted 10000000 Rows, about 353K ops/s
storage size: 17183853552 Bytes

@morningman
Copy link
Contributor Author

run buildall

Copy link
Contributor

github-actions bot commented Jan 1, 2024

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G', run with scripts in https://github.com/apache/doris/tree/master/tools/tpch-tools

Tpch sf100 test result on commit 50463bf5fe9e18828a52514ebcd7a8f0519aadcd, data reload: false

run tpch-sf100 query with default conf and session variables
q1	5580	5153	5217	5153
q2	393	162	159	159
q3	1466	1208	1246	1208
q4	1101	846	848	846
q5	3105	2992	3153	2992
q6	248	140	133	133
q7	1006	549	509	509
q8	2150	2280	2242	2242
q9	6709	6635	6666	6635
q10	3178	3119	3156	3119
q11	349	216	220	216
q12	407	236	230	230
q13	4397	3665	3664	3664
q14	257	220	208	208
q15	605	540	579	540
q16	452	402	433	402
q17	1041	538	491	491
q18	7096	6806	6781	6781
q19	1650	1571	1443	1443
q20	588	358	762	358
q21	2935	2441	2502	2441
q22	383	320	327	320
Total cold run time: 45096 ms
Total hot run time: 40090 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	5086	5107	5059	5059
q2	331	259	238	238
q3	3375	3330	3300	3300
q4	2134	2007	1982	1982
q5	5943	5912	5928	5912
q6	228	127	128	127
q7	2369	1922	1939	1922
q8	3576	3635	3705	3635
q9	9097	8981	8969	8969
q10	3896	3906	3926	3906
q11	593	470	472	470
q12	790	641	672	641
q13	3882	3229	3164	3164
q14	297	266	277	266
q15	614	544	550	544
q16	537	513	533	513
q17	2032	1823	1835	1823
q18	8740	8330	8370	8330
q19	1742	1696	1657	1657
q20	2284	1996	1983	1983
q21	5732	5366	5326	5326
q22	560	510	522	510
Total cold run time: 63838 ms
Total hot run time: 60277 ms

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.62% (8612/23520)
Line Coverage: 28.67% (69993/244091)
Region Coverage: 27.66% (36231/130970)
Branch Coverage: 24.36% (18510/75982)
Coverage Report: http://coverage.selectdb-in.cc/coverage/50463bf5fe9e18828a52514ebcd7a8f0519aadcd_50463bf5fe9e18828a52514ebcd7a8f0519aadcd/report/index.html

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 47.26 seconds
stream load tsv: 563 seconds loaded 74807831229 Bytes, about 126 MB/s
stream load json: 19 seconds loaded 2358488459 Bytes, about 118 MB/s
stream load orc: 66 seconds loaded 1101869774 Bytes, about 15 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.1 seconds inserted 10000000 Rows, about 355K ops/s
storage size: 17183453965 Bytes

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G', run with scripts in https://github.com/apache/doris/tree/master/tools/tpch-tools

Tpch sf100 test result on commit 50463bf5fe9e18828a52514ebcd7a8f0519aadcd, data reload: false

------ Round 1 ----------------------------------
q1	17645	5089	5075	5075
q2	2015	153	142	142
q3	10550	1109	1183	1109
q4	10188	780	819	780
q5	7796	2955	2965	2955
q6	211	134	134	134
q7	922	545	491	491
q8	9257	1959	1996	1959
q9	6865	6423	6353	6353
q10	8431	3022	3074	3022
q11	427	206	215	206
q12	380	231	248	231
q13	18003	3647	3596	3596
q14	245	215	213	213
q15	582	539	519	519
q16	453	403	407	403
q17	953	495	452	452
q18	7173	6674	6587	6587
q19	1581	1268	1359	1268
q20	748	340	345	340
q21	2764	2392	2394	2392
q22	378	329	335	329
Total cold run time: 107567 ms
Total hot run time: 38556 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5127	5088	5053	5053
q2	343	252	251	251
q3	3297	3267	3230	3230
q4	2096	1980	1974	1974
q5	5780	5754	5729	5729
q6	215	124	120	120
q7	2307	1890	1889	1889
q8	3363	3442	3458	3442
q9	8796	8702	8679	8679
q10	3792	3796	3816	3796
q11	562	514	476	476
q12	798	632	642	632
q13	7460	3218	3192	3192
q14	295	269	263	263
q15	611	512	526	512
q16	572	495	493	493
q17	1937	1781	1749	1749
q18	8618	8317	8370	8317
q19	1621	1564	1595	1564
q20	2214	1960	1937	1937
q21	5644	5217	5251	5217
q22	574	496	470	470
Total cold run time: 66022 ms
Total hot run time: 58985 ms

@doris-robot
Copy link

TPC-DS test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G', run with scripts in https://github.com/apache/doris/tree/master/tools/tpcds-tools

TPC-DS sf100 test result on commit 50463bf5fe9e18828a52514ebcd7a8f0519aadcd, data reload: false

run tpcds-sf100 query with default conf and session variables
query1	939	361	333	333
query2	6420	1962	1862	1862
query3	6648	201	201	201
query4	27027	22586	22208	22208
query5	4313	505	516	505
query6	263	181	178	178
query7	4577	273	267	267
query8	231	213	194	194
query9	8300	2680	2658	2658
query10	405	236	256	236
query11	16203	15439	15567	15439
query12	141	79	75	75
query13	1635	321	331	321
query14	11783	7058	7206	7058
query15	242	184	192	184
query16	6510	270	275	270
query17	1845	508	499	499
query18	1960	265	267	265
query19	247	145	140	140
query20	80	78	77	77
query21	185	101	92	92
query22	5151	4702	4577	4577
query23	31905	31351	31380	31351
query24	12033	2803	2833	2803
query25	590	352	359	352
query26	1705	139	142	139
query27	2911	275	276	275
query28	7108	1975	1964	1964
query29	2040	401	391	391
query30	287	142	149	142
query31	946	766	769	766
query32	86	58	57	57
query33	722	271	271	271
query34	882	439	452	439
query35	898	778	767	767
query36	1311	1209	1259	1209
query37	190	72	76	72
query38	3395	3329	3287	3287
query39	1318	1281	1288	1281
query40	312	91	90	90
query41	37	34	35	34
query42	96	94	91	91
query43	499	472	502	472
query44	1137	704	714	704
query45	198	185	181	181
query46	1072	640	645	640
query47	1690	1526	1561	1526
query48	333	255	256	255
query49	1214	334	316	316
query50	743	327	325	325
query51	5270	5244	5256	5244
query52	91	91	85	85
query53	207	146	144	144
query54	1379	549	587	549
query55	97	87	84	84
query56	206	201	205	201
query57	1008	944	960	944
query58	224	195	207	195
query59	2817	2553	2579	2553
query60	245	234	228	228
query61	90	88	88	88
query62	645	452	435	435
query63	164	147	149	147
query64	5894	1752	1757	1752
query65	3341	3263	3271	3263
query66	1386	335	335	335
query67	15629	15243	15406	15243
query68	12590	540	534	534
query69	524	251	248	248
query70	1746	1558	1540	1540
query71	491	229	217	217
query72	5745	3641	3609	3609
query73	2944	308	310	308
query74	7074	6371	6427	6371
query75	5230	2262	2257	2257
query76	6343	1140	1158	1140
query77	651	276	293	276
query78	9087	8697	8587	8587
query79	1023	508	509	508
query80	551	370	362	362
query81	459	207	207	207
query82	204	102	104	102
query83	161	136	134	134
query84	249	54	55	54
query85	970	300	285	285
query86	388	382	389	382
query87	3549	3393	3367	3367
query88	3071	2246	2243	2243
query89	333	257	254	254
query90	1931	208	219	208
query91	121	95	95	95
query92	59	53	52	52
query93	1460	505	441	441
query94	799	195	188	188
query95	457	418	406	406
query96	630	313	313	313
query97	4316	4170	4182	4170
query98	213	201	187	187
query99	1078	866	827	827
Total cold run time: 293903 ms
Total hot run time: 179201 ms

Copy link
Contributor

@kaka11chen kaka11chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

github-actions bot commented Jan 2, 2024

PR approved by anyone and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 4, 2024
Copy link
Contributor

github-actions bot commented Jan 4, 2024

PR approved by at least one committer and no changes requested.

@yiguolei yiguolei merged commit 2adb0fc into apache:master Jan 6, 2024
30 of 31 checks passed
HappenLee pushed a commit to HappenLee/incubator-doris that referenced this pull request Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.0.4 dev/3.0.0-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants