OptimizeIn
is a logical optimization that transforms logical plans with In predicate expressions as follows:
-
Replaces an
In
expression that has an empty list and the value expression not nullable tofalse
-
Eliminates duplicates of Literal expressions in an In predicate expression that is inSetConvertible
-
Replaces an
In
predicate expression that is inSetConvertible with InSet expressions when the number of literal expressions in the list expression is greater than spark.sql.optimizer.inSetConversionThreshold internal configuration property (default:10
)
OptimizeIn
is part of Operator Optimizations batch in the standard batches of the base Spark Optimizer.
Technically, OptimizeIn
is a Catalyst rule for transforming logical plans, i.e. Rule[LogicalPlan]
.
// Use Catalyst DSL to define a logical plan
// HACK: Disable symbolToColumn implicit conversion
// It is imported automatically in spark-shell (and makes demos impossible)
// implicit def symbolToColumn(s: Symbol): org.apache.spark.sql.ColumnName
trait ThatWasABadIdea
implicit def symbolToColumn(ack: ThatWasABadIdea) = ack
import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.dsl.plans._
import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
val rel = LocalRelation('a.int, 'b.int, 'c.int)
import org.apache.spark.sql.catalyst.expressions.{In, Literal}
val plan = rel
.where(In('a, Seq[Literal](1, 2, 3)))
.analyze
scala> println(plan.numberedTreeString)
00 Filter a#6 IN (1,2,3)
01 +- LocalRelation <empty>, [a#6, b#7, c#8]
// In --> InSet
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", 0)
import org.apache.spark.sql.catalyst.optimizer.OptimizeIn
val optimizedPlan = OptimizeIn(plan)
scala> println(optimizedPlan.numberedTreeString)
00 Filter a#6 INSET (1,2,3)
01 +- LocalRelation <empty>, [a#6, b#7, c#8]
apply(plan: LogicalPlan): LogicalPlan
Note
|
apply is part of Rule Contract to apply a rule to a TreeNode, e.g. logical plan.
|
apply
…FIXME