[EXP][C++] Deduplicate schemas when scanning Dataset #45340

pitrou · 2025-01-23T17:00:54Z

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

pitrou · 2025-01-23T17:01:06Z

@ursabot please benchmark

ursabot · 2025-01-23T17:01:12Z

Benchmark runs are scheduled for commit f681035. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

pitrou · 2025-01-23T17:24:00Z

@ursabot please benchmark

ursabot · 2025-01-23T17:24:06Z

Benchmark runs are scheduled for commit a0503f3. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

pitrou · 2025-01-23T17:25:43Z

@icexelloss This is a quick experiment that you might want to try out on a real-world use case. I don't seem to get any tangible benefits on a synthetic dataset, though it might be due to memory fragmentation.

conbench-apache-arrow · 2025-01-23T17:32:39Z

Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit f681035.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

conbench-apache-arrow · 2025-01-23T23:53:58Z

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit a0503f3.

There were 775 benchmark results indicating a performance regression:

Pull Request Run on test-mac-arm at 2025-01-23 18:46:18Z
- TakeChunkedChunkedStringFewMonotonicIndices (C++) with params=4194304/1, source=cpp-micro, suite=arrow-compute-vector-selection-benchmark
- ListSliceStringListViewWithStop (C++) with params=65536/0, source=cpp-micro, suite=arrow-compute-scalar-list-benchmark
and 773 more (see the report linked below)

The full Conbench report has more details.

pitrou · 2025-01-24T13:48:26Z

Ok, there are so many unrelated "regressions" in the report above that I'm going to launch another benchmarking run, as it's likely that external factors have influenced that run.

pitrou · 2025-01-24T13:48:33Z

@ursabot please benchmark

ursabot · 2025-01-24T13:48:36Z

Commit a0503f3 already has scheduled benchmark runs.

pitrou · 2025-01-24T13:49:15Z

@ursabot please benchmark

ursabot · 2025-01-24T13:49:20Z

Benchmark runs are scheduled for commit 335a46b. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

conbench-apache-arrow · 2025-01-24T20:23:36Z

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit 335a46b.

There were 23 benchmark results indicating a performance regression:

Pull Request Run on test-mac-arm at 2025-01-24 15:13:21Z
- BenchmarkTemporal (C++) with params=<Subsecond, non_zoned>/4194304/0, source=cpp-micro, suite=arrow-compute-scalar-temporal-benchmark
- FilterOverhead (C++) with params=selectivity_benchmark/batch_size:100000/null_prob:100/bool_true_prob:50/real_time, source=cpp-micro, suite=arrow-acero-filter-benchmark
and 21 more (see the report linked below)

The full Conbench report has more details.

icexelloss · 2025-01-27T16:16:30Z

@pitrou Thanks! cc @timothydijamco

timothydijamco · 2025-02-05T22:26:40Z

Thanks for this PR, seems like a good idea to try out.

In response to #45287 (comment): I didn't observe any difference between a version of Arrow with the metadata-clearing patch (#45330) vs a version of Arrow with the metadata-clearing patch plus this patch here in this PR.

Synthetic data

500 files, each with 1 row and 10,000 columns with 200-character-long column names

Peak memory

Performing a "scan" or loading the table into memory

	Only metadata-clearing (#45330)	With metadata-clearing (#45330) and schema deduplication (#45340)
One "scan" (pull batches from `scanner->RecordBatchReader()` until exhausted)	1.49GB	1.48GB
One `scanner->ToTable()`	1.57GB	1.55GB

Memory profiles

Performing two "scans"

Only metadata-clearing (#45330)	With metadata-clearing (#45330) and schema deduplication (#45340)

Real data

I ran on a variety of real datasets we have internally (data size varies from <1GB of data to 40GB of data, num columns varies from hundreds to thousands, number of files varies from 1 to hundreds) in the "scan" use case and "load table" use case and did not observe any memory usage difference between the two Arrow versions as well

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Jan 23, 2025

pitrou force-pushed the exp_deduplicate_schema branch 2 times, most recently from 9610573 to a0503f3 Compare January 23, 2025 17:23

pitrou force-pushed the exp_deduplicate_schema branch from a0503f3 to 335a46b Compare January 24, 2025 13:49

[EXP][C++] Deduplicate schemas when scanning Dataset

11b7ad8

pitrou force-pushed the exp_deduplicate_schema branch from 335a46b to 11b7ad8 Compare January 27, 2025 14:39

pitrou mentioned this pull request Feb 4, 2025

[C++] Metadata related memory leak when reading parquet dataset #45287

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EXP][C++] Deduplicate schemas when scanning Dataset #45340

[EXP][C++] Deduplicate schemas when scanning Dataset #45340

pitrou commented Jan 23, 2025

pitrou commented Jan 23, 2025

ursabot commented Jan 23, 2025

pitrou commented Jan 23, 2025

ursabot commented Jan 23, 2025

pitrou commented Jan 23, 2025

conbench-apache-arrow bot commented Jan 23, 2025

conbench-apache-arrow bot commented Jan 23, 2025

pitrou commented Jan 24, 2025

pitrou commented Jan 24, 2025

ursabot commented Jan 24, 2025

pitrou commented Jan 24, 2025

ursabot commented Jan 24, 2025

conbench-apache-arrow bot commented Jan 24, 2025

icexelloss commented Jan 27, 2025

timothydijamco commented Feb 5, 2025 •

edited

Loading

[EXP][C++] Deduplicate schemas when scanning Dataset #45340

Are you sure you want to change the base?

[EXP][C++] Deduplicate schemas when scanning Dataset #45340

Conversation

pitrou commented Jan 23, 2025

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

pitrou commented Jan 23, 2025

ursabot commented Jan 23, 2025

pitrou commented Jan 23, 2025

ursabot commented Jan 23, 2025

pitrou commented Jan 23, 2025

conbench-apache-arrow bot commented Jan 23, 2025

conbench-apache-arrow bot commented Jan 23, 2025

pitrou commented Jan 24, 2025

pitrou commented Jan 24, 2025

ursabot commented Jan 24, 2025

pitrou commented Jan 24, 2025

ursabot commented Jan 24, 2025

conbench-apache-arrow bot commented Jan 24, 2025

icexelloss commented Jan 27, 2025

timothydijamco commented Feb 5, 2025 • edited Loading

Synthetic data

Peak memory

Memory profiles

Real data

timothydijamco commented Feb 5, 2025 •

edited

Loading