feat: support 'col IN (a, b, c)' type expressions #652

roeap · 2025-01-18T15:48:35Z

What changes are proposed in this pull request?

Currently, evaluation expressions of type col IN (a, b, c) is missing an implementation. While this might be the exact case @scovich cautioned us about, where the rhs might get significant in size and should really be handled as EngineData, I hope that we at least do not make things worse here. Unfortunately delta-rs already has support for these types of expressions, so the main intend right now is to retain feature parity over there while migrating.

How was this change tested?

Additional tests for specific expression flavor.

codecov · 2025-01-18T15:52:56Z

Codecov Report

Attention: Patch coverage is 77.12766% with 43 lines in your changes missing coverage. Please review.

Project coverage is 84.00%. Comparing base (6751838) to head (def21c1).

Files with missing lines	Patch %	Lines
kernel/src/engine/arrow_expression.rs	82.08%	21 Missing and 3 partials ⚠️
kernel/src/expressions/scalars.rs	66.03%	18 Missing ⚠️
kernel/src/engine/arrow_conversion.rs	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #652      +/-   ##
==========================================
- Coverage   84.08%   84.00%   -0.08%     
==========================================
  Files          76       76              
  Lines       17526    17713     +187     
  Branches    17526    17713     +187     
==========================================
+ Hits        14736    14880     +144     
- Misses       2077     2115      +38     
- Partials      713      718       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Robert Pack <[email protected]>

roeap · 2025-01-18T16:31:25Z

kernel/src/engine/arrow_conversion.rs

@@ -208,7 +208,7 @@ impl TryFrom<&ArrowDataType> for DataType {
            ArrowDataType::Date64 => Ok(DataType::DATE),
            ArrowDataType::Timestamp(TimeUnit::Microsecond, None) => Ok(DataType::TIMESTAMP_NTZ),
            ArrowDataType::Timestamp(TimeUnit::Microsecond, Some(tz))
-                if tz.eq_ignore_ascii_case("utc") =>
+                if tz.eq_ignore_ascii_case("utc") || tz.eq_ignore_ascii_case("+00:00") =>


The data in arrow arrays should always represent a timestamp in UTC, so is this check even necessary?

https://github.com/apache/arrow-rs/blob/af777cd53e56f8382382137b6e08af249c475397/arrow-schema/src/datatype.rs#L179-L182

scovich · 2025-01-20T12:59:39Z

kernel/src/engine/arrow_expression.rs

+                    })?;
+
+                fn op(
+                    col: impl Iterator<Item = Option<impl Into<Scalar>>>,


nit: Why not impl IntoIterator? Avoids having to call iter() at the call site?

scovich · 2025-01-20T13:03:55Z

kernel/src/engine/arrow_expression.rs

+                }
+
+                // safety: as_* methods on arrow arrays can panic, but we checked the data type before applying.
+                let arr: BooleanArray = match (column.data_type(), data_type) {


nit: with the strongly typed op, we shouldn't need a type annotation here?

Suggested change

let arr: BooleanArray = match (column.data_type(), data_type) {

let arr = match (column.data_type(), data_type) {

scovich · 2025-01-20T13:42:29Z

kernel/src/engine/arrow_expression.rs

+                fn op(
+                    col: impl Iterator<Item = Option<impl Into<Scalar>>>,
+                    ad: &ArrayData,
+                ) -> BooleanArray {
+                    #[allow(deprecated)]
+                    let res = col.map(|val| val.map(|v| ad.array_elements().contains(&v.into())));
+                    BooleanArray::from_iter(res)
+                }


I think it might help both readability to factor out op a tad differently (playground):

#[allow(deprecated)] let inlist = ad.array_elements(); fn op<T>( inlist: &[Scalar], values: impl IntoIterator<Item = Option<T>>, from: fn(T) -> Scalar, ) -> BooleanArray { values .into_iter() .map(|v| Some(inlist.contains(&from(v?)))) .collect() }

Then the primitive array cases simplify to e.g.:

(ArrowDataType::Float64, PrimitiveType::Double) => { op(inlist, column.as_primitive::<Float64Type>(), Scalar::from) }

and e.g. the TimestampNtz case simplifies to:

let array = column.as_primitive::<TimestampMicrosecondType>(); op(inlist, array, Scalar::TimestampNtz)

(might need to tweak the type signature of values a bit to match whatever the primitive column iterator actually returns -- but the general approach seems to work)

We should also be able to make T: ArrowPrimitiveType and derive the expected primitive rust type from there:

fn op<T: ArrowPrimitiveType>( inlist: &[Scalar], values: &dyn Array, from: fn(T::Native) -> Scalar, ) -> BooleanArray { values .as_primitive::<T> .into_iter() .map(|v| Some(inlist.contains(&from(v?)))) .collect() }

which allows pulling the as_primitive call inside op:

(ArrowDataType::Float64, PrimitiveType::Double) => { op::<Float64Type>(inlist, column, Scalar::from) }

and e.g. the TimestampNtz case simplifies to:

op::<TimestampMicrosecondType>(inlist, column, Scalar::TimestampNtz)

yeah - this turs out much nicer! Needed to make some adjustments, but hope kept the spirit!

kernel/src/engine/arrow_expression.rs

scovich · 2025-01-20T20:04:37Z

kernel/src/engine/arrow_expression.rs

+                    ad: &ArrayData,
+                ) -> BooleanArray {
+                    #[allow(deprecated)]
+                    let res = col.map(|val| val.map(|v| ad.array_elements().contains(&v.into())));


I don't think this handles NULL values correctly? See e.g. https://spark.apache.org/docs/3.5.1/sql-ref-null-semantics.html#innot-in-subquery-:

TRUE is returned when the non-NULL value in question is found in the list

FALSE is returned when the non-NULL value is not found in the list and the list does not contain NULL values

UNKNOWN is returned when the value is NULL, or the non-NULL value is not found in the list and the list contains at least one NULL value

I think, instead of calling contains, you could borrow the code from PredicateEvaluatorDefaults::finish_eval_variadic, with true as the "dominator" value.

Actually, I think you could just invoke that method directly, with a properly crafted iterator?

// `v IN (k1, ..., kN)` is logically equivalent to `v = k1 OR ... OR v = kN`, so evaluate // it as such, ensuring correct handling of NULL inputs (including `Scalar::Null`). col.map(|v| { PredicateEvaluatorDefaults::finish_eval_variadic( VariadicOperator::Or, inlist.iter().map(Some(Scalar::partial_cmp(v?, k?)? == Ordering::Equal)), false, ) })

Was I correct in thinking that None - no dominant value, but found Null - should just be false in this case?

scovich · 2025-01-20T20:12:42Z

kernel/src/engine/arrow_expression.rs

+                    ad: &ArrayData,
+                ) -> BooleanArray {
+                    #[allow(deprecated)]
+                    let res = col.map(|val| val.map(|v| ad.array_elements().contains(&v.into())));


Aside: We actually have a lurking bug -- Scalar derives PartialEq which will allow two Scalar::Null to compare equal. But SQL semantics dictate that NULL doesn't compare equal to anything -- not even itself.

Our manual impl of PartialOrd for Scalar does this correctly, but it breaks the rules for PartialEq:

If PartialOrd or Ord are also implemented for Self and Rhs, their methods must also be consistent with PartialEq (see the documentation of those traits for the exact requirements). It’s easy to accidentally make them disagree by deriving some of the traits and manually implementing others.

Looks like we'll need to define a manual impl PartialEq for Scalar that follows the same approach.

This is indeed not covered. Added an implementation for PartialEq that mirrors PartialOrd.

Signed-off-by: Robert Pack <[email protected]>

scovich · 2025-01-25T04:21:44Z

kernel/src/engine/arrow_expression.rs

+                                // None is returned when no dominant value (true) is found and there is at least one NULL
+                                // In th case of IN, this is equivalent to false


Rescuing #652 (comment):

https://spark.apache.org/docs/3.5.1/sql-ref-null-semantics.html#innot-in-subquery-:

TRUE is returned when the non-NULL value in question is found in the list

FALSE is returned when the non-NULL value is not found in the list and the list does not contain NULL values

UNKNOWN (NULL) is returned when the value is NULL, or the non-NULL value is not found in the list and the list contains at least one NULL value

If I understand the above correctly:

NULL IN (1, 2, NULL) -- NULL because the value to search for was NULL 10 IN (1, 2, NULL) -- NULL because no direct match and in-list contains a NULL 10 IN (1, 2) -- FALSE because no match and also no NULL anywhere 10 IN (10, 20, NULL) -- TRUE in spite of the NULL because of direct match

So instead of Some(<expr>.unwrap_or(false)) we should just return <expr> directly and let the NULL propagate.

scovich · 2025-01-25T04:29:02Z

kernel/src/engine/arrow_expression.rs

+                            Some(
+                                PredicateEvaluatorDefaults::finish_eval_variadic(
+                                    VariadicOperator::Or,
+                                    inlist.iter().map(|k| v.as_ref().map(|vv| vv == k)),


This isn't correct -- we need comparisons against Scalar::Null to return None. That's why I had previously recommended using Scalar::partial_cmp instead of ==.

Also, can we not use ? to unwrap the various options here?

Suggested change

inlist.iter().map(|k| v.as_ref().map(|vv| vv == k)),

inlist.iter().map(Some(Scalar::partial_cmp(v?, k?)? == Ordering::Equal)),

Unpacking that -- if the value we search for is NULL, or if the inlist entry is NULL, or if the two values are incomparable, then return None for that pair. Otherwise, return Some boolean indicating whether the values compared equal or not. That automatically covers the various required cases, and also makes us robust to any type mismatches that might creep in.

Note: If we wanted to be a tad more efficient, we could also unpack v outside the inner loop:

values.map(|v| { let v = v?; PredicateEvaluatorDefaults::finish_eval_variadic(...) })

Hmm -- empty in-lists pose a corner case with respect to unpacking v:

NULL IN ()

Operator OR with zero inputs normally produces FALSE (which is correct if you stop to think about it) -- but unpacking a NULL v first makes the operator return NULL instead (which is also correct if you squint, because NULL input always produces NULL output).

Unfortunately, the only clear docs I could find -- https://spark.apache.org/docs/3.5.1/sql-ref-null-semantics.html#innot-in-subquery- -- are also ambiguous:

Conceptually a IN expression is semantically equivalent to a set of equality condition separated by a disjunctive operator (OR).

... suggests FALSE while

UNKNOWN is returned when the value is NULL

... suggests NULL

The difference matters for NOT IN, because NULL NOT IN () would either return TRUE (keep rows) or NULL (drop row).

NOTE: SQL engines normally forbid statically empty in-list but do not forbid subqueries from producing an empty result.

I tried the following expression on three engines (sqlite, mysql, postgres):

SELECT 1 WHERE NULL NOT IN (SELECT 1 WHERE FALSE)

And all three returned 1. So OR semantics prevail, and we must NOT unpack v outside the loop.

scovich · 2025-01-25T04:38:42Z

kernel/src/engine/arrow_expression.rs

+                }
+
+                fn str_op<'a>(
+                    column: impl Iterator<Item = Option<&'a str>> + 'a,


Suggested change

column: impl Iterator<Item = Option<&'a str>> + 'a,

column: impl IntoIterator<Item = Option<&'a str>> + 'a,

(avoids the need for callers to invoke iter() -- we can call into_iter() once here instead)

(the column has type e.g. &GenericByteArray, whose impl IntoIterator is equivalent to calling iter())

scovich · 2025-01-25T04:41:45Z

kernel/src/engine/arrow_expression.rs

+                    (ArrowDataType::Utf8, PrimitiveType::String) => op_in(inlist, str_op(column.as_string::<i32>().iter())),
+                    (ArrowDataType::LargeUtf8, PrimitiveType::String) => op_in(inlist, str_op(column.as_string::<i64>().iter())),
+                    (ArrowDataType::Utf8View, PrimitiveType::String) => op_in(inlist, str_op(column.as_string_view().iter())),
+                    (ArrowDataType::Int8, PrimitiveType::Byte) => op_in(inlist,op::<Int8Type>( column.as_ref(), Scalar::from)),
+                    (ArrowDataType::Int16, PrimitiveType::Short) => op_in(inlist,op::<Int16Type>(column.as_ref(), Scalar::from)),
+                    (ArrowDataType::Int32, PrimitiveType::Integer) => op_in(inlist,op::<Int32Type>(column.as_ref(), Scalar::from)),
+                    (ArrowDataType::Int64, PrimitiveType::Long) => op_in(inlist,op::<Int64Type>(column.as_ref(), Scalar::from)),
+                    (ArrowDataType::Float32, PrimitiveType::Float) => op_in(inlist,op::<Float32Type>(column.as_ref(), Scalar::from)),
+                    (ArrowDataType::Float64, PrimitiveType::Double) => op_in(inlist,op::<Float64Type>(column.as_ref(), Scalar::from)),
+                    (ArrowDataType::Date32, PrimitiveType::Date) => op_in(inlist,op::<Date32Type>(column.as_ref(), Scalar::Date)),


These are all a lot longer than 100 chars... why doesn't the fmt check blow up??

scovich · 2025-01-25T04:48:13Z

kernel/src/engine/arrow_expression.rs

@@ -280,6 +281,84 @@ fn evaluate_expression(
                    (ArrowDataType::Decimal256(_, _), Decimal256Type)
                }
            }
+            (Column(name), Literal(Scalar::Array(ad))) => {
+                fn op<T: ArrowPrimitiveType>(
+                    values: &dyn Array,


Suggested change

values: &dyn Array,

values: ArrayRef,

(avoids the need for .as_ref() at the call site)

scovich · 2025-01-25T04:58:58Z

kernel/src/expressions/scalars.rs

+        // NOTE: We intentionally do two match arms for each variant to avoid a catch-all, so
+        // that new variants trigger compilation failures instead of being silently ignored.
+        match (self, other) {


Looking at the requirements for PartialEq and PartialOrd, I think it would be much safer (and more compact) to move the current PartialOrd::partial_cmp to Scalar::partial_cmp. Then PartialOrd::partial_cmp is just a thin wrapper around self.partial_cmp, and PartialEq::eq is self.partial_cmp(...) == Some(Ordering::Equal)

See e.g. https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=dc8e783e26f94c98da88258a4854dbd9

scovich · 2025-01-25T05:08:54Z

kernel/src/expressions/scalars.rs

+    #[test]
+    fn test_partial_cmp() {
+        let a = Scalar::Integer(1);
+        let b = Scalar::Integer(2);


probably needs a Scalar::Null as well, to exercise None result?

(ditto below, to ensure that comparing == against Scalar::null produces false result)

github-actions bot assigned roeap Jan 18, 2025

roeap force-pushed the feat/col-in-arr branch 2 times, most recently from 007b4e2 to 6977db9 Compare January 18, 2025 16:19

feat: support 'col IN (a, b, c)' type expressions

28a5648

Signed-off-by: Robert Pack <[email protected]>

roeap force-pushed the feat/col-in-arr branch from 6977db9 to 28a5648 Compare January 18, 2025 16:22

roeap commented Jan 18, 2025

View reviewed changes

roeap requested review from nicklan, scovich, zachschuermann and OussamaSaoudi January 18, 2025 16:32

scovich reviewed Jan 21, 2025

View reviewed changes

roeap added 4 commits January 24, 2025 21:32

Merge branch 'main' into feat/col-in-arr

d6e3730

fix: PR feedback

848ef11

Signed-off-by: Robert Pack <[email protected]>

chore: clippy

290d65d

Signed-off-by: Robert Pack <[email protected]>

chore: fmt

def21c1

Signed-off-by: Robert Pack <[email protected]>

roeap requested a review from scovich January 24, 2025 23:55

scovich reviewed Jan 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support 'col IN (a, b, c)' type expressions #652

feat: support 'col IN (a, b, c)' type expressions #652

roeap commented Jan 18, 2025

codecov bot commented Jan 18, 2025 •

edited

Loading

roeap Jan 18, 2025

scovich Jan 20, 2025

scovich Jan 20, 2025

scovich Jan 20, 2025

roeap Jan 24, 2025

scovich Jan 20, 2025

scovich Jan 20, 2025

roeap Jan 25, 2025

scovich Jan 20, 2025

roeap Jan 24, 2025

scovich Jan 25, 2025 •

edited

Loading

scovich Jan 25, 2025 •

edited

Loading

scovich Jan 25, 2025

scovich Jan 25, 2025

scovich Jan 25, 2025 •

edited

Loading

scovich Jan 25, 2025

scovich Jan 25, 2025

scovich Jan 25, 2025

scovich Jan 25, 2025 •

edited

Loading

scovich Jan 25, 2025

	let arr: BooleanArray = match (column.data_type(), data_type) {
	let arr = match (column.data_type(), data_type) {

		// None is returned when no dominant value (true) is found and there is at least one NULL
		// In th case of IN, this is equivalent to false

	inlist.iter().map(\|k\| v.as_ref().map(\|vv\| vv == k)),
	inlist.iter().map(Some(Scalar::partial_cmp(v?, k?)? == Ordering::Equal)),

	column: impl Iterator<Item = Option<&'a str>> + 'a,
	column: impl IntoIterator<Item = Option<&'a str>> + 'a,

feat: support 'col IN (a, b, c)' type expressions #652

Are you sure you want to change the base?

feat: support 'col IN (a, b, c)' type expressions #652

Conversation

roeap commented Jan 18, 2025

What changes are proposed in this pull request?

How was this change tested?

codecov bot commented Jan 18, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich Jan 25, 2025 • edited Loading

Choose a reason for hiding this comment

scovich Jan 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich Jan 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scovich Jan 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 18, 2025 •

edited

Loading

scovich Jan 25, 2025 •

edited

Loading

scovich Jan 25, 2025 •

edited

Loading

scovich Jan 25, 2025 •

edited

Loading

scovich Jan 25, 2025 •

edited

Loading