Untyped transformations are part of the Dataset API for transforming a Dataset
to a DataFrame, a Column, a RelationalGroupedDataset, a DataFrameNaFunctions or a DataFrameStatFunctions (and hence untyped).
Note
|
Untyped transformations are the methods in the Dataset Scala class that are grouped in untypedrel group name, i.e. @group untypedrel .
|
Transformation | Description |
---|---|
agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame
agg(exprs: Map[String, String]): DataFrame
agg(exprs: java.util.Map[String, String]): DataFrame
agg(expr: Column, exprs: Column*): DataFrame |
|
Selects a column based on the column name (i.e. maps a apply(colName: String): Column |
|
Reliably checkpointing a Dataset checkpoint(): Dataset[T]
checkpoint(eager: Boolean): Dataset[T] |
|
Selects a column based on the column name (i.e. maps a col(colName: String): Column |
|
(New in 2.3.0) Selects a column based on the column name specified as a regex (i.e. maps a colRegex(colName: String): Column |
|
crossJoin(right: Dataset[_]): DataFrame |
|
cube(cols: Column*): RelationalGroupedDataset
cube(col1: String, cols: String*): RelationalGroupedDataset |
|
drop(colName: String): DataFrame
drop(colNames: String*): DataFrame
drop(col: Column): DataFrame |
|
groupBy(cols: Column*): RelationalGroupedDataset
groupBy(col1: String, cols: String*): RelationalGroupedDataset |
|
join(right: Dataset[_]): DataFrame
join(right: Dataset[_], usingColumn: String): DataFrame
join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
join(right: Dataset[_], joinExprs: Column): DataFrame
join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame |
|
(New in 2.3.0) Locally (and unreliably) checkpointing a Dataset localCheckpoint(): Dataset[T]
localCheckpoint(eager: Boolean): Dataset[T] |
|
na: DataFrameNaFunctions |
|
rollup(cols: Column*): RelationalGroupedDataset
rollup(col1: String, cols: String*): RelationalGroupedDataset |
|
select(cols: Column*): DataFrame
select(col: String, cols: String*): DataFrame |
|
selectExpr(exprs: String*): DataFrame |
|
stat: DataFrameStatFunctions |
|
toDF(): DataFrame
toDF(colNames: String*): DataFrame |
|
withColumn(colName: String, col: Column): DataFrame |
|
withColumnRenamed(existingName: String, newName: String): DataFrame |
agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame
agg(exprs: Map[String, String]): DataFrame
agg(exprs: java.util.Map[String, String]): DataFrame
agg(expr: Column, exprs: Column*): DataFrame
agg
…FIXME
apply(colName: String): Column
apply
selects a column based on the column name (i.e. maps a Dataset
onto a Column
).
checkpoint(): Dataset[T] // (1)
checkpoint(eager: Boolean): Dataset[T] // (2)
-
eager
andreliableCheckpoint
flags enabled -
reliableCheckpoint
flag enabled
Note
|
checkpoint is an experimental operator and the API is evolving towards becoming stable.
|
checkpoint
simply requests the Dataset
to checkpoint with the given eager
flag and the reliableCheckpoint
flag enabled.
col(colName: String): Column
col
selects a column based on the column name (i.e. maps a Dataset
onto a Column
).
Internally, col
branches off per the input column name.
If the column name is *
(a star), col
simply creates a Column with ResolvedStar expression (with the schema output attributes of the analyzed logical plan of the QueryExecution).
Otherwise, col
uses colRegex untyped transformation when spark.sql.parser.quotedRegexColumnNames configuration property is enabled.
In the case when the column name is not *
and spark.sql.parser.quotedRegexColumnNames configuration property is disabled, col
creates a Column with the column name resolved (as a NamedExpression).
colRegex(colName: String): Column
(New in 2.3.0) colRegex
selects a column based on the column name specified as a regex (i.e. maps a Dataset
onto a Column
).
Note
|
colRegex is used in col when spark.sql.parser.quotedRegexColumnNames configuration property is enabled (and the column name is not * ).
|
Internally, colRegex
matches the input column name to different regular expressions (in the order):
-
For column names with quotes without a qualifier,
colRegex
simply creates a Column with a UnresolvedRegex (with no table) -
For column names with quotes with a qualifier,
colRegex
simply creates a Column with a UnresolvedRegex (with a table specified) -
For other column names,
colRegex
(behaves like col and) creates a Column with the column name resolved (as a NamedExpression)
cube(cols: Column*): RelationalGroupedDataset
cube(col1: String, cols: String*): RelationalGroupedDataset
cube
…FIXME
drop(colName: String): DataFrame
drop(colNames: String*): DataFrame
drop(col: Column): DataFrame
drop
…FIXME
groupBy(cols: Column*): RelationalGroupedDataset
groupBy(col1: String, cols: String*): RelationalGroupedDataset
groupBy
…FIXME
join(right: Dataset[_]): DataFrame
join(right: Dataset[_], usingColumn: String): DataFrame
join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
join(right: Dataset[_], joinExprs: Column): DataFrame
join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
join
…FIXME
localCheckpoint(): Dataset[T] // (1)
localCheckpoint(eager: Boolean): Dataset[T]
-
eager
flag enabled
(New in 2.3.0) localCheckpoint
simply uses Dataset.checkpoint operator with the input eager
flag and reliableCheckpoint
flag disabled (false
).
na: DataFrameNaFunctions
na
simply creates a DataFrameNaFunctions to work with missing data.
rollup(cols: Column*): RelationalGroupedDataset
rollup(col1: String, cols: String*): RelationalGroupedDataset
rollup
…FIXME
select(cols: Column*): DataFrame
select(col: String, cols: String*): DataFrame
select
…FIXME
selectExpr(exprs: String*): DataFrame
selectExpr
is like select
, but accepts SQL statements.
val ds = spark.range(5)
scala> ds.selectExpr("rand() as random").show
16/04/14 23:16:06 INFO HiveSqlParser: Parsing command: rand() as random
+-------------------+
| random|
+-------------------+
| 0.887675894185651|
|0.36766085091074086|
| 0.2700020856675186|
| 0.1489033635529543|
| 0.5862990791950973|
+-------------------+
Internally, it executes select
with every expression in exprs
mapped to Column (using SparkSqlParser.parseExpression).
scala> ds.select(expr("rand() as random")).show
+------------------+
| random|
+------------------+
|0.5514319279894851|
|0.2876221510433741|
|0.4599999092045741|
|0.5708558868374893|
|0.6223314406247136|
+------------------+
stat: DataFrameStatFunctions
stat
simply creates a DataFrameStatFunctions to work with statistic functions.
toDF(): DataFrame
toDF(colNames: String*): DataFrame
Internally, the empty-argument toDF
creates a Dataset[Row]
using the Dataset
's SparkSession and QueryExecution with the encoder being RowEncoder.
Caution
|
FIXME Describe toDF(colNames: String*)
|
withColumn(colName: String, col: Column): DataFrame
withColumn
…FIXME