Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Update to arrow-rs / parquet 54.1.0 #14328

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jan 27, 2025

Which issue does this PR close?

Rationale for this change

Keep up with dependencies

What changes are included in this PR?

  1. Update version of arrow-rs used in DataFusion (currently pinned until release)
  2. Update for API changes

Are these changes tested?

By existing CI

Are there any user-facing changes?

@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate labels Jan 27, 2025
@@ -369,8 +370,8 @@ fn try_cast_numeric_literal(
// Different precision for decimal128 can store different range of value.
// For example, the precision is 3, the max of value is `999` and the min
// value is `-999`
MIN_DECIMAL_FOR_EACH_PRECISION[*precision as usize - 1],
MAX_DECIMAL_FOR_EACH_PRECISION[*precision as usize - 1],
MIN_DECIMAL128_FOR_EACH_PRECISION[*precision as usize],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -311,7 +311,7 @@ impl FileOpener for ArrowOpener {
{
decoder.read_dictionary(
dict_block,
&Buffer::from_bytes(dict_result.into()),
&Buffer::from(dict_result),

This comment was marked as outdated.

dict_block,
&Buffer::from_bytes(dict_result.into()),
)?;
decoder
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor Author

alamb commented Jan 27, 2025

https://github.com/apache/datafusion/actions/runs/12996259121/job/36244947022?pr=14328 appears to be a real difference

     Running bin/sqllogictests.rs (target/ci/deps/sqllogictests-71f9d985f37bce43)
External error: query result mismatch:
[SQL] SELECT
  column1,
  lag(column2) OVER (order by column1),
  lead(column2) OVER (order by column1)
FROM array_data;
[Diff] (-expected|+actual)
    a NULL [4, 5, 6]
    b [1, 2, 3] [7, 8, 9]
-   c [4, 5, 6] NULL
+   c [1, 2, 3] NULL
at test_files/window.slt:4853

External error: query result mismatch:
[SQL] select array_append(column1, column2) from nested_arrays;
[Diff] (-expected|+actual)
-   [[1, 2, 3], [2, 9, 1], [7, 8, 9], [1, 2, 3], [1, 7, 4], [4, 5, 6], [7, 8, 9]]
-   [[4, 5, 6], [10, 11, 12], [4, 9, 8], [7, 8, 9], [10, 11, 12], [1, 8, 7], [10, 11, 12]]
+   [[1, 2, 3], [2, 9, 1], [7, 8, 9], [1, 2, 3], [1, 7, 4], [4, 5, 6], [4, 5, 6]]
+   [[10, 11, 12], [4, 9, 8], [7, 8, 9], [10, 11, 12], [1, 8, 7], [7, 8, 9], [10, 11, 12]]
at test_files/array.slt:2424

I am git bisecting to figure out what is going on

@alamb
Copy link
Contributor Author

alamb commented Jan 27, 2025

According to git bisect the error was introduced in this PR

andrewlamb@Andrews-MacBook-Pro-2:~/Software/arrow-rs$ git bisect bad
27992680d5677b629d7b4ca11d3f94c2fe905f1a is the first bad commit
commit 27992680d5677b629d7b4ca11d3f94c2fe905f1a (HEAD)
Author: Raz Luvaton <[email protected]>
Date:   Sat Jan 4 12:24:48 2025 +0200

    feat(arrow-select): `concat` kernel will merge dictionary values for list of dictionaries (#6893)

    * feat(arrow-select): make list of dictionary merge dictionary keys

    TODO:
    - [ ] Add support to nested lists
    - [ ] Add more tests
    - [ ] Fix failing test

    * fix concat lists of dictionaries

    * format

    * remove unused import

    * improve test helper

    * feat: add merge offset buffers into one

    * format

    * add reproduction tst

    * recommit

    * fix clippy

    * fix clippy

    * fix clippy

    * improve offsets code according to code review

    * use concat dictionaries

    * add specialize code to concat lists to be able to use the concat dictionary logic

    * remove the use of ArrayData
    * 

You can verify this by applying a diff such as this locally:

diff --git a/Cargo.toml b/Cargo.toml
index 1c8f9f0a1..8025d32b8 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -190,15 +190,15 @@ unexpected_cfgs = { level = "warn", check-cfg = ["cfg(tarpaulin)"] }
 unused_qualifications = "deny"

 [patch.crates-io]
-arrow = { git = "https://github.com/apache/arrow-rs.git", rev = "6aaff7e38a573f797b31f89f869c3706cbe26e37" }
-arrow-array = { git = "https://github.com/apache/arrow-rs.git", rev = "6aaff7e38a573f797b31f89f869c3706cbe26e37" }
-arrow-buffer = { git = "https://github.com/apache/arrow-rs.git", rev = "6aaff7e38a573f797b31f89f869c3706cbe26e37" }
-arrow-cast = { git = "https://github.com/apache/arrow-rs.git", rev = "6aaff7e38a573f797b31f89f869c3706cbe26e37" }
-arrow-data = { git = "https://github.com/apache/arrow-rs.git", rev = "6aaff7e38a573f797b31f89f869c3706cbe26e37" }
-arrow-ipc = { git = "https://github.com/apache/arrow-rs.git", rev = "6aaff7e38a573f797b31f89f869c3706cbe26e37" }
-arrow-schema = { git = "https://github.com/apache/arrow-rs.git", rev = "6aaff7e38a573f797b31f89f869c3706cbe26e37" }
-arrow-select = { git = "https://github.com/apache/arrow-rs.git", rev = "6aaff7e38a573f797b31f89f869c3706cbe26e37" }
-arrow-string = { git = "https://github.com/apache/arrow-rs.git", rev = "6aaff7e38a573f797b31f89f869c3706cbe26e37" }
-arrow-ord = { git = "https://github.com/apache/arrow-rs.git", rev = "6aaff7e38a573f797b31f89f869c3706cbe26e37" }
-arrow-flight = { git = "https://github.com/apache/arrow-rs.git", rev = "6aaff7e38a573f797b31f89f869c3706cbe26e37" }
-parquet = { git = "https://github.com/apache/arrow-rs.git", rev = "6aaff7e38a573f797b31f89f869c3706cbe26e37" }
+arrow = { path="/Users/andrewlamb/Software/arrow-rs/arrow" }
+arrow-array = { path="/Users/andrewlamb/Software/arrow-rs/arrow-array" }
+arrow-buffer = { path="/Users/andrewlamb/Software/arrow-rs/arrow-buffer" }
+arrow-cast = { path="/Users/andrewlamb/Software/arrow-rs/arrow-cast" }
+arrow-data = { path="/Users/andrewlamb/Software/arrow-rs/arrow-data" }
+arrow-ipc = { path="/Users/andrewlamb/Software/arrow-rs/arrow-ipc" }
+arrow-schema = { path="/Users/andrewlamb/Software/arrow-rs/arrow-schema" }
+arrow-select = { path="/Users/andrewlamb/Software/arrow-rs/arrow-select" }
+arrow-string = { path="/Users/andrewlamb/Software/arrow-rs/arrow-string" }
+arrow-ord = { path="/Users/andrewlamb/Software/arrow-rs/arrow-ord" }
+arrow-flight = { path="/Users/andrewlamb/Software/arrow-rs/arrow-flight" }
+parquet = { path="/Users/andrewlamb/Software/arrow-rs/parquet" }

And then running

$ cargo test --test sqllogictests -- window

From the list of commits on https://github.com/apache/arrow-rs/commits/main/?after=6aaff7e38a573f797b31f89f869c3706cbe26e37+34 you can see:

@alamb alamb force-pushed the alamb/update_arrow_44.1 branch from a0097a2 to 7758936 Compare January 29, 2025 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant