Expression
is a executable node (in a Catalyst tree) that can evaluate a result value given input values, i.e. can produce a JVM object per InternalRow
.
Note
|
Expression is often called a Catalyst expression even though it is merely built using (not be part of) the Catalyst — Tree Manipulation Framework.
|
// evaluating an expression
// Use Literal expression to create an expression from a Scala object
import org.apache.spark.sql.catalyst.expressions.Expression
import org.apache.spark.sql.catalyst.expressions.Literal
val e: Expression = Literal("hello")
import org.apache.spark.sql.catalyst.expressions.EmptyRow
val v: Any = e.eval(EmptyRow)
// Convert to Scala's String
import org.apache.spark.unsafe.types.UTF8String
scala> val s = v.asInstanceOf[UTF8String].toString
s: String = hello
Expression
can generate a Java source code that is then used in evaluation.
Expression
is deterministic when always evaluates the same result for the same inputs. By default, an expression is deterministic if all the child expressions are (which for leaf expressions with no child expressions is true).
Note
|
A deterministic expression is like a pure function in functional programming languages. |
scala> spark.version
res0: String = 2.3.0
val e = $"a".expr
scala> :type e
org.apache.spark.sql.catalyst.expressions.Expression
scala> println(e.deterministic)
true
verboseString
is…FIXME
Name | Scala Kind | Behaviour | Examples |
---|---|---|---|
abstract class |
|||
trait |
Does not support code generation and falls back to interpreted mode |
||
trait |
|||
trait |
Marks |
||
abstract class |
Has no child expressions (and hence "terminates" the expression tree). |
||
Can later be referenced in a dataflow graph. |
|||
trait |
|||
trait |
Expression with no SQL representation Gives the only custom sql method that is non-overridable (i.e. When requested SQL representation, |
||
trait |
|||
abstract class |
|||
trait |
Timezone-aware expressions |
||
abstract class |
|||
|
trait |
Cannot be evaluated, i.e. eval and doGenCode are not supported and report an
|
package org.apache.spark.sql.catalyst.expressions
abstract class Expression extends TreeNode[Expression] {
// only required methods that have no implementation
def dataType: DataType
def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode
def eval(input: InternalRow = EmptyRow): Any
def nullable: Boolean
}
Method | Description | ||
---|---|---|---|
Data type of the result of evaluating an expression |
|||
Code-generated expression evaluation that generates a Java source code (that is used to evaluate the expression in a more optimized way not directly using eval). Used when |
|||
Interpreted (non-code-generated) expression evaluation that evaluates an expression to a JVM object for a given internal binary row (without generating a corresponding Java code.)
|
|||
Generates the Java source code for code-generated (non-interpreted) expression evaluation (on an input internal row in a more optimized way not directly using eval). Similar to doGenCode but supports expression reuse (aka subexpression elimination).
|
|||
reduceCodeSize(ctx: CodegenContext, eval: ExprCode): Unit
reduceCodeSize
does its work only when all of the following are met:
-
Length of the generate code is above 1024
-
INPUT_ROW of the input
CodegenContext
is defined -
currentVars of the input
CodegenContext
is not defined
Caution
|
FIXME When would the above not be met? What’s so special about such an expression? |
reduceCodeSize
sets the value
of the input ExprCode
to the fresh term name for the value
name.
In the end, reduceCodeSize
sets the code of the input ExprCode
to the following:
[javaType] [newValue] = [funcFullName]([INPUT_ROW]);
The funcFullName
is the fresh term name for the name of the current expression node.
Tip
|
Use the expression node name to search for the function that corresponds to the expression in a generated code. |
Note
|
reduceCodeSize is used exclusively when Expression is requested to generate the Java source code for code-generated expression evaluation.
|
flatArguments: Iterator[Any]
flatArguments
…FIXME
Note
|
flatArguments is used when…FIXME
|
sql: String
sql
gives a SQL representation.
Internally, sql
gives a text representation with prettyName followed by sql
of children in the round brackets and concatenated using the comma (,
).
import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.expressions.Sentences
val sentences = Sentences("Hi there! Good morning.", "en", "US")
import org.apache.spark.sql.catalyst.expressions.Expression
val expr: Expression = count("*") === 5 && count(sentences) === 5
scala> expr.sql
res0: String = ((count('*') = 5) AND (count(sentences('Hi there! Good morning.', 'en', 'US')) = 5))
Note
|
sql is used when…FIXME
|