Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhance](fe) Iceberg table in HMS catalog supports broker scan #28107

Merged
merged 2 commits into from
Jan 3, 2024

Conversation

WinkerDu
Copy link
Contributor

@WinkerDu WinkerDu commented Dec 7, 2023

Proposed changes

Issue Number: close #xxx

My organization uses HMS catalog to accelerate Lake query. Sine we have custom distributed file system and hard to integrate to FE / BE, we introduce HMS Catalog broker scan support (#24830) and implement custom distributed file system adaption in broker.

We want to expand the scope of use to Iceberg table scan in HMS Catalog. This PR introduces broker-scan-related IcebergBrokerIO, BrokerInputFile, BrokerInputStream for Iceberg table scan

Further comments

If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...

@WinkerDu
Copy link
Contributor Author

WinkerDu commented Dec 7, 2023

run buildall

@WinkerDu
Copy link
Contributor Author

WinkerDu commented Dec 7, 2023

@morningman @chenlinzhong @luozenglin Please have a review, thx

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 44.26 seconds
stream load tsv: 581 seconds loaded 74807831229 Bytes, about 122 MB/s
stream load json: 19 seconds loaded 2358488459 Bytes, about 118 MB/s
stream load orc: 66 seconds loaded 1101869774 Bytes, about 15 MB/s
stream load parquet: 33 seconds loaded 861443392 Bytes, about 24 MB/s
insert into select: 28.6 seconds inserted 10000000 Rows, about 349K ops/s
storage size: 17195047947 Bytes

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
Tpch sf100 test result on commit 6add64693b64451f6cefba44941a227cccf5887f, data reload: false

run tpch-sf100 query with default conf and session variables
q1	4693	4463	4528	4463
q2	379	187	157	157
q3	1464	1242	1214	1214
q4	1129	981	917	917
q5	3194	3156	3164	3156
q6	249	128	130	128
q7	991	503	488	488
q8	2238	2217	2192	2192
q9	6720	6700	6704	6700
q10	3213	3254	3264	3254
q11	327	210	202	202
q12	353	210	211	210
q13	4570	3782	3798	3782
q14	245	215	219	215
q15	562	524	519	519
q16	447	387	391	387
q17	1006	564	581	564
q18	7394	7921	7108	7108
q19	1509	1365	1401	1365
q20	519	338	616	338
q21	3075	2725	2683	2683
q22	358	287	296	287
Total cold run time: 44635 ms
Total hot run time: 40329 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	4399	4407	4431	4407
q2	269	162	172	162
q3	3539	3524	3536	3524
q4	2378	2365	2363	2363
q5	5724	5734	5721	5721
q6	242	120	123	120
q7	2355	1867	1857	1857
q8	3547	3517	3518	3517
q9	9107	9079	9017	9017
q10	3892	4006	3966	3966
q11	504	370	389	370
q12	762	586	594	586
q13	4294	3537	3541	3537
q14	289	262	259	259
q15	571	523	526	523
q16	501	471	477	471
q17	1867	1848	1879	1848
q18	8765	8227	8323	8227
q19	1732	1729	1726	1726
q20	2271	1951	1935	1935
q21	6476	6141	6131	6131
q22	498	433	419	419
Total cold run time: 63982 ms
Total hot run time: 60686 ms

@WinkerDu
Copy link
Contributor Author

WinkerDu commented Dec 7, 2023

run p0


if (props.containsKey(HMSExternalCatalog.BIND_BROKER_NAME)) {
// Set Iceberg FileIO implementation as `IcebergBrokerIO` when Catalog binding broker is specified.
props.put("io-impl", "org.apache.doris.datasource.iceberg.broker.IcebergBrokerIO");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this property should be added when creating catalog?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll edit the docs to illustrate the usage of this property

@WinkerDu
Copy link
Contributor Author

WinkerDu commented Jan 1, 2024

run buildall

@WinkerDu WinkerDu force-pushed the master-iceberg-broker-scan branch from 181af7b to 1479075 Compare January 1, 2024 17:49
@WinkerDu
Copy link
Contributor Author

WinkerDu commented Jan 1, 2024

run buildall

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G', run with scripts in https://github.com/apache/doris/tree/master/tools/tpch-tools

Tpch sf100 test result on commit 147907567984d29a3a308bcd29a1ae22e0371941, data reload: false

------ Round 1 ----------------------------------
q1	17748	5178	5112	5112
q2	2025	165	145	145
q3	10594	1082	1177	1082
q4	10197	825	864	825
q5	7795	2967	2953	2953
q6	214	135	134	134
q7	918	535	559	535
q8	9297	2039	2005	2005
q9	6885	6378	6384	6378
q10	8305	3021	3083	3021
q11	444	228	209	209
q12	390	231	239	231
q13	18016	3662	3674	3662
q14	244	215	218	215
q15	590	551	547	547
q16	493	431	393	393
q17	962	530	435	435
q18	7441	6733	6702	6702
q19	1592	1357	1384	1357
q20	670	349	331	331
q21	2767	2347	2416	2347
q22	386	322	333	322
Total cold run time: 107973 ms
Total hot run time: 38941 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5102	5112	5125	5112
q2	340	242	261	242
q3	3331	3280	3235	3235
q4	2084	2017	2012	2012
q5	5816	5815	5780	5780
q6	215	126	127	126
q7	2316	1913	1963	1913
q8	3410	3412	3443	3412
q9	8827	8771	8754	8754
q10	3789	3879	3865	3865
q11	572	467	479	467
q12	809	723	662	662
q13	16272	3234	3172	3172
q14	297	281	279	279
q15	596	534	518	518
q16	548	503	513	503
q17	1964	1801	1736	1736
q18	8730	8334	8283	8283
q19	1619	1617	1580	1580
q20	2222	1975	1953	1953
q21	5654	5306	5286	5286
q22	551	495	492	492
Total cold run time: 75064 ms
Total hot run time: 59382 ms

@doris-robot
Copy link

TPC-DS test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G', run with scripts in https://github.com/apache/doris/tree/master/tools/tpcds-tools

TPC-DS sf100 test result on commit 147907567984d29a3a308bcd29a1ae22e0371941, data reload: false

run tpcds-sf100 query with default conf and session variables
query1	858	374	348	348
query2	4134	2018	2048	2018
query3	4264	217	204	204
query4	26745	22375	22628	22375
query5	2759	584	543	543
query6	235	190	186	186
query7	3498	267	269	267
query8	236	204	211	204
query9	6311	2819	2740	2740
query10	339	263	272	263
query11	16253	15561	15602	15561
query12	136	81	79	79
query13	988	333	323	323
query14	10018	7242	7227	7227
query15	223	189	195	189
query16	5230	287	273	273
query17	1380	509	502	502
query18	1584	274	270	270
query19	187	144	145	144
query20	84	79	81	79
query21	171	105	99	99
query22	4899	4578	4520	4520
query23	32297	31246	31581	31246
query24	12313	2882	2890	2882
query25	624	364	346	346
query26	1810	151	151	151
query27	2821	284	288	284
query28	7160	1973	1965	1965
query29	2561	395	385	385
query30	295	154	151	151
query31	1096	796	781	781
query32	93	64	61	61
query33	740	279	274	274
query34	1008	464	458	458
query35	888	812	767	767
query36	1284	1165	1244	1165
query37	176	75	76	75
query38	3419	3317	3281	3281
query39	1330	1309	1288	1288
query40	280	96	95	95
query41	38	36	38	36
query42	100	92	93	92
query43	570	512	484	484
query44	1088	713	731	713
query45	200	197	186	186
query46	1080	666	663	663
query47	1704	1565	1580	1565
query48	354	271	260	260
query49	1224	345	342	342
query50	766	347	337	337
query51	5463	5385	5300	5300
query52	98	94	92	92
query53	229	160	151	151
query54	2279	567	585	567
query55	97	91	94	91
query56	214	199	202	199
query57	1001	978	942	942
query58	243	219	214	214
query59	2915	2674	2651	2651
query60	242	232	242	232
query61	84	82	80	80
query62	684	446	468	446
query63	172	155	155	155
query64	5677	1728	1742	1728
query65	3350	3286	3291	3286
query66	1387	352	335	335
query67	15687	15680	15620	15620
query68	10760	525	538	525
query69	456	270	267	267
query70	1557	1547	1399	1399
query71	375	233	232	232
query72	5411	3515	3583	3515
query73	2248	320	313	313
query74	7000	6399	6518	6399
query75	4750	2325	2270	2270
query76	5143	1147	1139	1139
query77	651	274	286	274
query78	9103	9037	8628	8628
query79	1026	528	521	521
query80	564	361	371	361
query81	475	212	217	212
query82	208	110	97	97
query83	171	142	140	140
query84	238	54	55	54
query85	891	278	262	262
query86	378	370	375	370
query87	3551	3428	3378	3378
query88	2781	2248	2285	2248
query89	353	275	265	265
query90	1818	213	197	197
query91	124	93	91	91
query92	59	59	57	57
query93	1469	516	441	441
query94	749	181	190	181
query95	485	434	412	412
query96	627	317	316	316
query97	4299	4175	4189	4175
query98	215	202	193	193
query99	1083	835	793	793
Total cold run time: 276691 ms
Total hot run time: 180561 ms

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G', run with scripts in https://github.com/apache/doris/tree/master/tools/tpch-tools

Tpch sf100 test result on commit 147907567984d29a3a308bcd29a1ae22e0371941, data reload: false

run tpch-sf100 query with default conf and session variables
q1	5489	5226	5187	5187
q2	399	165	158	158
q3	1457	1220	1179	1179
q4	1092	853	842	842
q5	3113	3139	3114	3114
q6	225	131	129	129
q7	993	541	514	514
q8	2161	2275	2270	2270
q9	6707	6648	6702	6648
q10	3184	3085	3077	3077
q11	349	215	216	215
q12	392	238	239	238
q13	4391	3638	3635	3635
q14	262	227	227	227
q15	607	559	539	539
q16	456	404	399	399
q17	1041	627	538	538
q18	7091	6713	6723	6713
q19	1640	1529	1449	1449
q20	582	345	327	327
q21	2860	2475	2451	2451
q22	400	324	342	324
Total cold run time: 44891 ms
Total hot run time: 40173 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	5108	5072	5025	5025
q2	340	267	251	251
q3	3388	3342	3300	3300
q4	2113	2015	2000	2000
q5	5953	5912	5916	5912
q6	230	125	123	123
q7	2395	1885	1943	1885
q8	3547	3668	3660	3660
q9	9046	8959	8974	8959
q10	3868	3929	3940	3929
q11	586	487	502	487
q12	801	643	628	628
q13	3871	3198	3185	3185
q14	297	288	274	274
q15	600	541	529	529
q16	550	497	520	497
q17	2115	1849	1783	1783
q18	8788	8396	8440	8396
q19	1741	1686	1697	1686
q20	2294	1993	1984	1984
q21	5564	5379	5298	5298
q22	587	527	491	491
Total cold run time: 63782 ms
Total hot run time: 60282 ms

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 47.48 seconds
stream load tsv: 562 seconds loaded 74807831229 Bytes, about 126 MB/s
stream load json: 19 seconds loaded 2358488459 Bytes, about 118 MB/s
stream load orc: 66 seconds loaded 1101869774 Bytes, about 15 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.4 seconds inserted 10000000 Rows, about 352K ops/s
storage size: 17183563601 Bytes

@WinkerDu
Copy link
Contributor Author

WinkerDu commented Jan 2, 2024

run p0

@morningman morningman force-pushed the master-iceberg-broker-scan branch from 1479075 to 57432eb Compare January 2, 2024 16:35
@morningman
Copy link
Contributor

run buildall

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G', run with scripts in https://github.com/apache/doris/tree/master/tools/tpch-tools

Tpch sf100 test result on commit 57432ebd2ecc4c7d9567246c22c1fccaf74f03c9, data reload: false

run tpch-sf100 query with default conf and session variables
q1	5449	5185	5151	5151
q2	390	165	159	159
q3	1456	1162	1200	1162
q4	1087	802	814	802
q5	3090	2967	3077	2967
q6	224	137	139	137
q7	957	550	526	526
q8	2144	2218	2218	2218
q9	6725	6677	6675	6675
q10	3162	3093	3130	3093
q11	351	228	225	225
q12	391	236	242	236
q13	4455	3625	3670	3625
q14	252	219	223	219
q15	606	570	559	559
q16	465	422	408	408
q17	1044	554	516	516
q18	7044	6802	6774	6774
q19	1638	1503	1598	1503
q20	558	341	336	336
q21	2889	2448	2543	2448
q22	408	334	323	323
Total cold run time: 44785 ms
Total hot run time: 40062 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	5190	5173	5144	5144
q2	342	253	241	241
q3	3351	3326	3324	3324
q4	2138	2026	2045	2026
q5	5980	5960	5928	5928
q6	227	127	124	124
q7	2402	1945	1908	1908
q8	3602	3659	3658	3658
q9	9120	9030	9003	9003
q10	3864	3963	3948	3948
q11	576	481	484	481
q12	804	640	667	640
q13	3900	3195	3192	3192
q14	302	269	275	269
q15	624	552	549	549
q16	593	540	486	486
q17	2039	1863	1813	1813
q18	8760	8379	8495	8379
q19	1744	1682	1697	1682
q20	2283	1994	1988	1988
q21	5826	5413	5357	5357
q22	570	468	492	468
Total cold run time: 64237 ms
Total hot run time: 60608 ms

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 47.28 seconds
stream load tsv: 578 seconds loaded 74807831229 Bytes, about 123 MB/s
stream load json: 19 seconds loaded 2358488459 Bytes, about 118 MB/s
stream load orc: 66 seconds loaded 1101869774 Bytes, about 15 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.2 seconds inserted 10000000 Rows, about 354K ops/s
storage size: 17188079796 Bytes

Copy link
Contributor

@wsjz wsjz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

github-actions bot commented Jan 3, 2024

PR approved by anyone and no changes requested.

@morningman morningman merged commit 08353f6 into apache:master Jan 3, 2024
27 of 28 checks passed
seawinde pushed a commit to seawinde/doris that referenced this pull request Jan 3, 2024
…he#28107)

My organization uses HMS catalog to accelerate Lake query. Sine we have custom distributed file system and hard to integrate to FE / BE, we introduce HMS Catalog broker scan support (apache#24830) and implement custom distributed file system adaption in broker.

We want to expand the scope of use to Iceberg table scan in HMS Catalog. This PR introduces broker-scan-related `IcebergBrokerIO`, `BrokerInputFile`, `BrokerInputStream` for Iceberg table scan
HappenLee pushed a commit to HappenLee/incubator-doris that referenced this pull request Jan 12, 2024
…he#28107)

My organization uses HMS catalog to accelerate Lake query. Sine we have custom distributed file system and hard to integrate to FE / BE, we introduce HMS Catalog broker scan support (apache#24830) and implement custom distributed file system adaption in broker.

We want to expand the scope of use to Iceberg table scan in HMS Catalog. This PR introduces broker-scan-related `IcebergBrokerIO`, `BrokerInputFile`, `BrokerInputStream` for Iceberg table scan
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants