Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize scalar string functions #14833

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

bziobrowski
Copy link
Contributor

@bziobrowski bziobrowski commented Jan 17, 2025

PR optimizes a number of string scalar functions, including:

  • ltrim
  • rtrim
  • unique_ngrams
  • concat
  • concat_ws
  • replace
    , changes other functions assuming that pattern is constant:
  • regexp_replace
  • regexp_like
  • regexp_extract
  • like
    while keeping existing generic implementations with _var suffix:
  • regexp_replace_var
  • regexp_like_var
  • regexp_extract_var
  • like_var

All of the functions mentioned above have been changed to initialize temporary objects and clear/reuse them in each call.
As can be seen in the following benchmark output, this change can speed up a raw function call even 4+ times.

Benchmark                                          (_regex)  Mode  Cnt    Score    Error  Units
BenchmarkRegexpReplace.testRegexpReplaceConst  q.[aeiou]c.*  avgt    3   25.720 ±  0.262  us/op
BenchmarkRegexpReplace.testRegexpReplaceConst           .*a  avgt    3   92.530 ±  2.315  us/op
BenchmarkRegexpReplace.testRegexpReplaceConst           b.*  avgt    3   34.444 ±  3.076  us/op
BenchmarkRegexpReplace.testRegexpReplaceConst            .*  avgt    3   42.251 ±  1.791  us/op
BenchmarkRegexpReplace.testRegexpReplaceConst        .*ated  avgt    3  121.553 ±  1.767  us/op
BenchmarkRegexpReplace.testRegexpReplaceConst        .*ba.*  avgt    3  130.567 ±  1.258  us/op
BenchmarkRegexpReplace.testRegexpReplaceOld    q.[aeiou]c.*  avgt    3  101.532 ± 10.586  us/op
BenchmarkRegexpReplace.testRegexpReplaceOld             .*a  avgt    3  153.493 ±  8.621  us/op
BenchmarkRegexpReplace.testRegexpReplaceOld             b.*  avgt    3   75.913 ±  2.909  us/op
BenchmarkRegexpReplace.testRegexpReplaceOld              .*  avgt    3   75.989 ±  4.248  us/op
BenchmarkRegexpReplace.testRegexpReplaceOld          .*ated  avgt    3  214.719 ± 91.627  us/op
BenchmarkRegexpReplace.testRegexpReplaceOld          .*ba.*  avgt    3  212.798 ±  5.929  us/op

If query processing is dominated by function call then effect on actual query performance is similar:

Benchmark                   (_numRows)                                                                                                                      (_query)  (_scenario)  Mode  Cnt   Score   Error  Units
BenchmarkQueriesMSQE.query     1500000  select * from 
(
  select RAW_STRING_COL
  from MyTable 
  limit 100000
) 
where regexp_like_const('.*a.*', RAW_STRING_COL )   EXP(0.001)  avgt    5  12.351 ± 1.591  ms/op

BenchmarkQueriesMSQE.query     1500000        select * from 
(
  select RAW_STRING_COL
  from MyTable 
  limit 100000
) 
where regexp_like('.*a.*', RAW_STRING_COL )   EXP(0.001)  avgt    5  42.298 ± 3.225  ms/op

NOTE: the reason I added _const function is that currently there is no way for engine to choose implementation based on function argument being constant or variable. If we change, e.g. regexp_replace, it will start returning wrong results if regular expression is variable, without raising an error or warning.

@codecov-commenter
Copy link

codecov-commenter commented Jan 17, 2025

Codecov Report

Attention: Patch coverage is 71.87500% with 36 lines in your changes missing coverage. Please review.

Project coverage is 63.71%. Comparing base (59551e4) to head (ef82ced).
Report is 1648 commits behind head on master.

Files with missing lines Patch % Lines
...ction/scalar/regexp/RegexpReplaceVarFunctions.java 0.00% 21 Missing ⚠️
...ion/scalar/regexp/RegexpReplaceConstFunctions.java 52.17% 9 Missing and 2 partials ⚠️
...common/function/scalar/string/StringFunctions.java 87.09% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14833      +/-   ##
============================================
+ Coverage     61.75%   63.71%   +1.95%     
- Complexity      207     1471    +1264     
============================================
  Files          2436     2720     +284     
  Lines        133233   151998   +18765     
  Branches      20636    23471    +2835     
============================================
+ Hits          82274    96838   +14564     
- Misses        44911    47879    +2968     
- Partials       6048     7281    +1233     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.66% <71.87%> (+1.95%) ⬆️
java-21 63.60% <71.87%> (+1.98%) ⬆️
skip-bytebuffers-false 63.67% <71.87%> (+1.92%) ⬆️
skip-bytebuffers-true 63.59% <71.87%> (+35.86%) ⬆️
temurin 63.71% <71.87%> (+1.95%) ⬆️
unittests 63.70% <71.87%> (+1.95%) ⬆️
unittests1 56.21% <71.87%> (+9.32%) ⬆️
unittests2 34.03% <6.25%> (+6.30%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gortiz gortiz requested a review from Jackie-Jiang January 17, 2025 18:34
@gortiz
Copy link
Contributor

gortiz commented Jan 17, 2025

I suggested that Bolek introduce new functions to compare performance differences easily. At the same time, it is not clear to us if users could run these queries with arguments that were not constant. SSE has a list of functions whose arguments must be literal, but it isn't easy to know if these are all.

@Jackie-Jiang, what do you think? Should we replace the older functions with the faster implementation suggested by Bolek?

@bziobrowski
Copy link
Contributor Author

I've checked calcite code and it seems there are at least two options:

  • keep only functions that assume patterns are static and change function signature to allow string literal only
  • keep all function variants but reimplement similar to PolymorphicBinaryArithmeticScalarFunction, with literal information being used to choose implementation at runtime

* @return trim spaces from left side of the string
*/
@ScalarFunction
public static String ltrim(String input) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the new implementation of LTRIM better than this implementation ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new implementation shares the matcher, instead of allocating a new one per call.

@siddharthteotia
Copy link
Contributor

When we say "constant", are we referring to the (1) ability to distinguish an identifier vs literal OR (2) multiple invocations of the function with the same input which helps save the initialization cost ?

@siddharthteotia
Copy link
Contributor

I don't think there is a need to introduce separate functions for CONSTANT flavor. We should go down the polymorphism route.

@bziobrowski
Copy link
Contributor Author

By 'constant' I meant that function is called with literal as the pattern argument, and not e.g. column name. I think we could either implement it like polymorphic functions or add a init()/newInstance() method that'd allow choosing implementations based on actual arguments and get rid of conditional initialization in method body.

…ctions

# Conflicts:
#	pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkQueries.java
Copy link
Contributor

@gortiz gortiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but please upgrade the PR description. Given clear performance increase and the fact that these functions must be used with literal arguments in SSE, we decided to change the implementation instead of creating the _const alternative

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants