feat: Support Spark ArraySort with lambda function #10138

boneanxs · 2024-06-11T11:05:20Z

Support Spark array_sort to allow lambda function to sort elements.

Since Spark has different comparisons implementation than presto(see #5569), we can't directly reuse presto array_sort logic to rewrite lambda function to a simple comparator if possible.

This pr tries to:

Move Presto array_sort to velox/functions/lib whereas both presto and spark can use it
Add a new option nullsFirst to support nulls to be placed at the start of the array(to support spark function sort_array
Extract the common logic of SimpleComparisonMatcher and move it to velox/functions/lib, and create different SimpleComparisonChecker for spark and presto to do the comparison match(e.g, = is eq in presto, but equalto in spark)
Add tests to cover spark rewrite function logic

netlify · 2024-06-11T11:05:38Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`41bf70a`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/67aeb9f01f51ab00087d8289

boneanxs · 2024-06-11T11:07:18Z

velox/functions/sparksql/ArraySort.cpp

+    return rewritten;
+  }
+
+  VELOX_USER_FAIL(kNotSupported, lambda->toString())


Do we need to throw error out if the rewrite is not possible for spark? I follow the presto's logic, but not sure it's necessary for spark.

boneanxs · 2024-06-11T11:08:57Z

@PHILO-HE @rui-mo Hey, could you pls review this? We need this refactor to support spark array_sort function

rui-mo

Since Spark has different comparisons implementation than presto

Could we add a block the PR description to describe the semantic difference?

rui-mo

Thanks!

velox/functions/lib/ArraySort.h

velox/functions/lib/SimpleComparisonMatcher.h

velox/functions/sparksql/tests/SortArrayTest.cpp

boneanxs · 2024-06-13T06:44:56Z

Since Spark has different comparisons implementation than presto

Could we add a block the PR description to describe the semantic difference?

@rui-mo done, address the comments, pls have a look again.

rui-mo · 2024-06-14T03:54:46Z

Since Spark has different comparisons implementation than presto(see #5569)

@boneanxs I wonder if we are tackling the difference of NaN semantics in this PR. There is a plan in Velox to adjust its semantics, and some PRs have been merged. Perhaps we can fix the Presto function directly, seeing #7237.

boneanxs · 2024-06-14T10:27:44Z

Since Spark has different comparisons implementation than presto(see #5569)

@boneanxs I wonder if we are tackling the difference of NaN semantics in this PR. There is a plan in Velox to adjust its semantics, and some PRs have been merged. Perhaps we can fix the Presto function directly, seeing #7237.

@rui-mo We might still need a special spark rewrite arraySort logic even the NaN semantics difference is fixed, given:

We have different comparison implementation now, gluten maps spark's implementation instead of presto's, it could throw errors since presto's rewriteArraySortCall can't correctly identify spark's comparison names.

e.g. for the expression array_sort(array(), (left, right) -> if (left > right, 1, if(left < right, -1, 0))), gluten could transform the expression to array_sort(array(), lambda ROW<left:INTEGER,right:INTEGER> -> if(greaterthan(left, right), 1, if(lessthan(left, right), -1, 0))), while presto side can only identify array_sort(array(), lambda ROW<left:INTEGER,right:INTEGER> -> if(gt(left, right), 1, if(lt(left, right), -1, 0)))(The function names are different now).

Do we have plan to fix all spark comparison functions after the NaN semantics is unified? Also, possibly other comparison functions could have other semantic difference than NaN? And I'm thinking it might be a long term to address it?

Is it possible spark could have other funtions that is different from presto and we need to specially handle it in rewriteArraySort, given many functions have different meaning and some functions independently inside spark, it's possible we can do some optimization inside spark while presto doesn't support or vice visa. So separating the rewriteArraySort can give us this possibility.

rui-mo

Thanks.

velox/functions/sparksql/SimpleComparisonChecker.h

velox/functions/sparksql/ArraySort.cpp

rui-mo

Thanks. Added several comments.

velox/functions/lib/ArraySort.cpp

velox/functions/lib/SimpleComparisonMatcher.h

velox/functions/sparksql/ArraySort.h

velox/functions/sparksql/SimpleComparisonMatcher.h

rui-mo

Thanks. Added several questions.

velox/docs/functions/spark/array.rst

velox/functions/lib/SimpleComparisonMatcher.h

rui-mo · 2024-07-03T08:39:37Z

velox/functions/sparksql/Register.cpp

-      prefix + "array_sort", arraySortSignatures(), makeArraySort);
+      prefix + "array_sort", arraySortSignatures(true), makeArraySortAsc);
+  exec::registerStatefulVectorFunction(
+      prefix + "array_sort_desc", arraySortDescSignatures(), makeArraySortDesc);


Do we have corresponding function for array_sort_desc in Spark?

We don't have array_sort_desc in Spark, but this is required since rewriteArraySort need it:

velox/velox/functions/lib/ArraySort.cpp

Line 559 in 0093ee9

: prefix + "array_sort_desc";

velox/docs/functions/spark/array.rst

velox/functions/lib/ArraySort.cpp

velox/functions/lib/ArraySort.h

velox/functions/lib/SimpleComparisonMatcher.h

velox/functions/sparksql/ArraySort.cpp

stale · 2024-10-24T20:24:17Z

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

rui-mo · 2024-10-25T01:25:46Z

@boneanxs Would you like to update this PR? Thanks.

boneanxs · 2024-10-25T01:41:11Z

@boneanxs Would you like to update this PR? Thanks.

Oh, forgot it. Sure, will update it recently

rui-mo

Some comments on the documentation.

rui-mo · 2024-11-04T09:06:59Z

velox/docs/functions/spark/array.rst

+    :noindex:
+
+    Returns the array sorted by values computed using specified lambda in ascending
+    order. ``U`` must be an orderable type. If the value from the lambda function is NULL, the element will be placed at the end.  ::


NULL or NaN for floating type?

Currently for lambda function returned values, Nan is not handled. Do we need to handle NaN since it shouldn't returned by lambda functions

Do we need to handle NaN since it shouldn't returned by lambda functions

Hi @boneanxs, could you provide more details on why NaN shouldn't be returned?

Oh, sorry, I overlooked this before, for array_sort with lambda functions, it supports sorting with NaN in SimpleVector.comparePrimitiveAsc and follows the logic of NaN is before NULL, I also add a test to cover this.

Let me explain more here. After looking into the presto/spark implementation, they both say

It returns -1, 0, or 1 as the first nullable element is less than, equal to, or greater than the second nullable element. If the comparator function returns other values (including NULL), the query will fail and raise an error

see presto and spark (Though spark says it doesn't support returning null values, but it doesn't throw errors for query like SELECT array_sort(ARRAY ('bc', 'ab', 'dc'), (x, y) -> IF(x < y, 1, IF(x = y, 0, null))) in Spark3.2 which might be a bug)

So null values and NaN shouldn't be return for lambda function function(T,T, int), and in SimpleComparisonMatcher, we do the match that the return value must be int.

SimpleComparisonMatcher could optimize function(T,T, int) to function(T, U) where U is orderable(not limited to int), it's possible that it creates float values, such as function(float, float, int): IF( x > y, 1, IF(x < y, -1, 0)) will be optimized to function(float, float): x -> x, at such point, they should still be the same since both goes into SimpleVector.compare to do the comparison(except NULLs are filtered in ArraySort.sortElements in advance to respect nullsFirst flag). And inside SimpleVector.compare, NaN is smaller than Null.

I tried SELECT array_sort(ARRAY ('bc', 'ab', 'dc'), (x, y) -> IF(x < y, 1, IF(x = y, 0, null))) in Spark 3.5 and got below exception. Would you like the add a unit test for this case to make sure exception is thrown?

Caused by: org.apache.spark.SparkException: [COMPARATOR_RETURNS_NULL] The comparator has returned a NULL for a comparison between dc and dc. It should return a positive integer for "greater than", 0 for "equal" and a negative integer for "less than". To revert to deprecated behavior where NULL is treated as 0 (equal), you must set "spark.sql.legacy.allowNullComparisonResultInArraySort" to "true".

I also notice Spark requires the function must return integer type, would you like to confirm?
https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L412-L421

Sorry to miss this comment, yes, for the integer type handling, Spark itself will throw the exception if the return type is not integer type(and it handles return type in analyze stage). Also, velox can't rewrite the lambda function if the lambda function doesn't match, I add a test in ArraySortTest.unsupporteLambda to ensure it.

As for null returned by comparator, it's fixed by apache/spark#36812 since Spark 3.2.2( I use 3.2.1 so it can pass). Spark 3.5 doesn't have this issue.

velox/docs/functions/spark/array.rst

velox/functions/lib/ArraySort.h

velox/functions/lib/SimpleComparisonMatcher.h

rui-mo

Thanks for iterating. Some minors and the others look good!

velox/functions/sparksql/SimpleComparisonMatcher.h

velox/docs/functions/spark/array.rst

rui-mo · 2024-11-12T07:55:01Z

velox/docs/functions/spark/array.rst

+    :noindex:
+
+    Returns the array sorted by values computed using specified lambda in ascending
+    order. ``U`` must be an orderable type. If the value from the lambda function is NULL, the element will be placed at the end.  ::


Do we need to handle NaN since it shouldn't returned by lambda functions

Hi @boneanxs, could you provide more details on why NaN shouldn't be returned?

velox/docs/functions/spark/array.rst

velox/functions/sparksql/tests/ArraySortTest.cpp

velox/functions/sparksql/tests/SortArrayTest.cpp

velox/functions/sparksql/tests/ArraySortTest.cpp

rui-mo

@boneanxs Thanks for iterating. Would you also rebase this PR?

rui-mo · 2024-11-22T14:26:11Z

velox/docs/functions/spark/array.rst

+    :noindex:
+
+    Returns the array sorted by values computed using specified lambda in ascending order. ``U`` must be an orderable type.
+    Null/NaN elements returned by the lambda function will be placed at the end of the returned array, with NaN elements appearing before Null elements. This functions is not supported in Spark and is only used inside velox. ::


Perhaps clarify the purpose in the document.

used inside velox for rewring :spark:func:`xxx` as :spark:func:`xxx`.

rui-mo · 2024-11-26T03:47:13Z

velox/functions/sparksql/tests/ArraySortTest.cpp

@@ -35,6 +36,24 @@ class ArraySortTest : public SparkFunctionBaseTest {
    assertEqualVectors(expected, result);
  }

+  void testArraySort(
+      const std::string& lamdaExpr,
+      const bool asc,


nit: drop const when passing by value.

rui-mo · 2024-11-26T03:51:30Z

velox/docs/functions/spark/array.rst

+    Returns the array sorted by values computed using specified lambda in ascending
+    order. ``U`` must be an orderable type. If the value from the lambda function is NULL, the element will be placed at the end.
+    The function attempts to analyze the lambda function and rewrite it into a simpler call that 
+    specifies the sort-by expression (like :spark:func:`array_sort(array(T), function(T,U)) -> array(T)`). For example, ``(left, right) -> if(length(left) > length(right), 1, if(length(left) < length(right), -1, 0))`` will be rewritten to ``x -> length(x)``. ::


Perhaps clarify the behavior when rewrite is not possible.

rui-mo · 2024-11-26T03:52:47Z

velox/functions/sparksql/tests/ArraySortTest.cpp

@@ -140,5 +163,50 @@ TEST_F(ArraySortTest, constant) {
  expected = makeConstantArray<int64_t>(size, {6, 6, 6, 6});
  assertEqualVectors(expected, result);
 }
+
+TEST_F(ArraySortTest, lambda) {


Would you like to add test for the case when rewriting is not possible?

rui-mo · 2024-11-26T04:22:21Z

velox/docs/functions/spark/array.rst

+    :noindex:
+
+    Returns the array sorted by values computed using specified lambda in ascending
+    order. ``U`` must be an orderable type. If the value from the lambda function is NULL, the element will be placed at the end.  ::


I tried SELECT array_sort(ARRAY ('bc', 'ab', 'dc'), (x, y) -> IF(x < y, 1, IF(x = y, 0, null))) in Spark 3.5 and got below exception. Would you like the add a unit test for this case to make sure exception is thrown?

Caused by: org.apache.spark.SparkException: [COMPARATOR_RETURNS_NULL] The comparator has returned a NULL for a comparison between dc and dc. It should return a positive integer for "greater than", 0 for "equal" and a negative integer for "less than". To revert to deprecated behavior where NULL is treated as 0 (equal), you must set "spark.sql.legacy.allowNullComparisonResultInArraySort" to "true".

I also notice Spark requires the function must return integer type, would you like to confirm?
https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala#L412-L421

boneanxs · 2025-01-14T06:46:41Z

@rui-mo can see apache/incubator-gluten#8526, added tests passed,

Also, it will be fallback if can't be rewritable.

2025-01-14T04:03:49.5216547Z - array_sort with lambda functions
2025-01-14T04:03:49.5478095Z 04:03:49.547 WARN org.apache.spark.sql.execution.GlutenFallbackReporter: Validation failed for plan: Project, due to: Native validation failed:
2025-01-14T04:03:49.5481453Z Validation failed due to exception caught at file:SubstraitToVeloxPlanValidator.cc line:1380 function:validate, thrown from file:ArraySort.cpp line:573 function:rewriteArraySortCall, reason:array_sort with comparator lambda that cannot be rewritten into a transform is not supported: lambda ROW<x_12:MAP<VARCHAR,INTEGER>,y_13:MAP<VARCHAR,INTEGER>> -> subtract(size("x_12",true),size("y_13",true)).
2025-01-14T04:03:49.5542142Z 04:03:49.553 WARN org.apache.spark.sql.execution.GlutenFallbackReporter: Validation failed for plan: Project, due to: Native validation failed:
2025-01-14T04:03:49.5545407Z Validation failed due to exception caught at file:SubstraitToVeloxPlanValidator.cc line:1380 function:validate, thrown from file:ArraySort.cpp line:573 function:rewriteArraySortCall, reason:array_sort with comparator lambda that cannot be rewritten into a transform is not supported: lambda ROW<x_12:MAP<VARCHAR,INTEGER>,y_13:MAP<VARCHAR,INTEGER>> -> subtract(size("x_12",true),size("y_13",true)).
2025-01-14T04:03:49.6574296Z 04:03:49.656 WARN org.apache.spark.sql.execution.GlutenFallbackReporter: Validation failed for plan: Project, due to: Native validation failed:
2025-01-14T04:03:49.6577446Z Validation failed due to exception caught at file:SubstraitToVeloxPlanValidator.cc line:1380 function:validate, thrown from file:ArraySort.cpp line:573 function:rewriteArraySortCall, reason:array_sort with comparator lambda that cannot be rewritten into a transform is not supported: lambda ROW<x:MAP<VARCHAR,INTEGER>,y:MAP<VARCHAR,INTEGER>> -> subtract(size("x",true),size("y",true)).
2025-01-14T04:03:49.6660157Z 04:03:49.665 WARN org.apache.spark.sql.execution.GlutenFallbackReporter: Validation failed for plan: Project, due to: Native validation failed:
2025-01-14T04:03:49.6663312Z Validation failed due to exception caught at file:SubstraitToVeloxPlanValidator.cc line:1380 function:validate, thrown from file:ArraySort.cpp line:573 function:rewriteArraySortCall, reason:array_sort with comparator lambda that cannot be rewritten into a transform is not supported: lambda ROW<x:MAP<VARCHAR,INTEGER>,y:MAP<VARCHAR,INTEGER>> -> subtract(size("x",true),size("y",true)).

velox/docs/functions/spark/array.rst

boneanxs · 2025-02-11T03:15:44Z

Hey @rui-mo , any more comments for this?

PHILO-HE

Just reviewed the added doc. Could you compile the rst file to see the added content is well displayed (including hyper link) in the generated doc? Thanks!

velox/docs/functions/spark/array.rst

PHILO-HE · 2025-02-11T06:06:42Z

velox/docs/functions/spark/array.rst

+    :noindex:
+
+    Returns the array sorted by values computed using specified lambda in ascending
+    order. ``U`` must be an orderable type. If the value from the lambda function is NULL, the element will be placed at the end.


Suggestion (if my understanding is right):

If the value from the lambda function is NULL, the element will be placed at the end.
->
If the lambda function returns NULL, the corresponding element will be placed at the end.

Oh, not the same, it means the value returned from the rewritten function, such as x -> length(x) in the example.

I update here to make it more clear

velox/docs/functions/spark/array.rst

rui-mo

Hi @boneanxs, I added some nits. And do we have feedback for #10138 (comment)?

velox/docs/functions/spark/array.rst

PHILO-HE · 2025-02-12T01:57:13Z

@boneanxs, could you file a pr in Gluten to enable this function and also to see its CI feedback? Thus, we can make sure Spark UTs pass. In Gluten pr, your personal Velox branch with this patch should be referenced.

boneanxs · 2025-02-14T03:39:44Z

Just reviewed the added doc. Could you compile the rst file to see the added content is well displayed (including hyper link) in the generated doc? Thanks!

Hey @PHILO-HE, thanks for mentioning it, yes, it's displayed well

@boneanxs, could you file a pr in Gluten to enable this function and also to see its CI feedback? Thus, we can make sure Spark UTs pass. In Gluten pr, your personal Velox branch with this patch should be referenced.

There's a pr test it before, apache/incubator-gluten#8526, and it all passes.

boneanxs · 2025-02-14T03:40:41Z

Hi @boneanxs, I added some nits. And do we have feedback for #10138 (comment)?

@rui-mo Oh, sorry I miss that before, added comment now. Can help review it again :)

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 11, 2024

boneanxs commented Jun 11, 2024

View reviewed changes

rui-mo reviewed Jun 12, 2024

View reviewed changes

velox/functions/lib/ArraySort.h Outdated Show resolved Hide resolved

velox/functions/lib/SimpleComparisonMatcher.h Outdated Show resolved Hide resolved

velox/functions/sparksql/tests/SortArrayTest.cpp Outdated Show resolved Hide resolved

boneanxs force-pushed the array_sort branch from 6f42db0 to 2759f55 Compare June 13, 2024 06:40

PHILO-HE mentioned this pull request Jun 17, 2024

[GLUTEN-6103][VL] Support array_sort apache/incubator-gluten#6104

Closed

rui-mo reviewed Jun 17, 2024

View reviewed changes

velox/functions/sparksql/SimpleComparisonChecker.h Outdated Show resolved Hide resolved

velox/functions/sparksql/SimpleComparisonChecker.h Outdated Show resolved Hide resolved

velox/functions/sparksql/ArraySort.cpp Outdated Show resolved Hide resolved

boneanxs force-pushed the array_sort branch 2 times, most recently from b209dd1 to 901aae4 Compare June 20, 2024 02:30

boneanxs requested a review from rui-mo June 20, 2024 02:31

rui-mo reviewed Jun 21, 2024

View reviewed changes

boneanxs changed the title ~~Add rewriteArraySort for Spark~~ Support ArraySort for Spark Jul 2, 2024

boneanxs force-pushed the array_sort branch from 901aae4 to 0093ee9 Compare July 2, 2024 08:59

boneanxs requested a review from rui-mo July 2, 2024 09:02

rui-mo reviewed Jul 3, 2024

View reviewed changes

boneanxs force-pushed the array_sort branch 3 times, most recently from 5c0e1f4 to 80fa6c3 Compare July 19, 2024 09:52

boneanxs changed the title ~~Support ArraySort for Spark~~ Support Spark ArraySort with lambda function Jul 19, 2024

boneanxs requested a review from rui-mo July 23, 2024 01:35

rui-mo reviewed Jul 24, 2024

View reviewed changes

rui-mo reviewed Jul 25, 2024

View reviewed changes

stale bot added the stale label Oct 24, 2024

stale bot removed the stale label Oct 25, 2024

boneanxs requested review from assignUser and majetideepak as code owners October 31, 2024 08:38

boneanxs force-pushed the array_sort branch from b10900a to 64af035 Compare October 31, 2024 08:40

boneanxs requested a review from rui-mo October 31, 2024 08:40

rui-mo reviewed Nov 4, 2024

View reviewed changes

boneanxs mentioned this pull request Nov 7, 2024

[VL] Unsupported spark function list [please leave a comment if you plan to pick some] apache/incubator-gluten#4039

Open

99 tasks

rui-mo reviewed Nov 12, 2024

View reviewed changes

rui-mo reviewed Nov 26, 2024

View reviewed changes

boneanxs requested a review from rui-mo January 14, 2025 06:50

rui-mo reviewed Jan 24, 2025

View reviewed changes

velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved

rui-mo changed the title ~~Support Spark ArraySort with lambda function~~ feat: Support Spark ArraySort with lambda function Jan 24, 2025

boneanxs force-pushed the array_sort branch 2 times, most recently from d950987 to f800969 Compare February 6, 2025 10:28

PHILO-HE reviewed Feb 11, 2025

View reviewed changes

rui-mo reviewed Feb 11, 2025

View reviewed changes

velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved

velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved

velox/docs/functions/spark/array.rst Outdated Show resolved Hide resolved

boneanxs added 3 commits February 13, 2025 20:53

feat: Support Spark ArraySort with lambda function

1d44043

update doc

ee992d7

address comments

41bf70a

boneanxs force-pushed the array_sort branch from f800969 to 41bf70a Compare February 14, 2025 03:35

feat: Support Spark ArraySort with lambda function #10138

Are you sure you want to change the base?

feat: Support Spark ArraySort with lambda function #10138

Conversation

boneanxs commented Jun 11, 2024 • edited Loading

netlify bot commented Jun 11, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

boneanxs Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

boneanxs commented Jun 11, 2024

rui-mo left a comment

Choose a reason for hiding this comment

rui-mo left a comment

Choose a reason for hiding this comment

boneanxs commented Jun 13, 2024

rui-mo commented Jun 14, 2024

boneanxs commented Jun 14, 2024 • edited Loading

rui-mo left a comment

Choose a reason for hiding this comment

rui-mo left a comment

Choose a reason for hiding this comment

rui-mo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stale bot commented Oct 24, 2024

rui-mo commented Oct 25, 2024

boneanxs commented Oct 25, 2024

rui-mo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rui-mo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rui-mo left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

boneanxs commented Jan 14, 2025

boneanxs commented Feb 11, 2025

PHILO-HE left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rui-mo left a comment

Choose a reason for hiding this comment

PHILO-HE commented Feb 12, 2025

boneanxs commented Feb 14, 2025 • edited Loading

boneanxs commented Feb 14, 2025

boneanxs commented Jun 11, 2024 •

edited

Loading

netlify bot commented Jun 11, 2024 •

edited

Loading

boneanxs Jun 11, 2024 •

edited

Loading

boneanxs commented Jun 14, 2024 •

edited

Loading

rui-mo left a comment •

edited

Loading

boneanxs commented Feb 14, 2025 •

edited

Loading