Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement lazyframe profiling and optimization toggles #323

Merged
merged 24 commits into from
Aug 8, 2023
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -97,5 +97,5 @@ Collate:
'translation.R'
'vctrs.R'
'zzz.R'
Config/rextendr/version: 0.3.1
Config/rextendr/version: 0.3.1.9000
VignetteBuilder: knitr
3 changes: 1 addition & 2 deletions R/expr__expr.R
Original file line number Diff line number Diff line change
Expand Up @@ -1121,8 +1121,7 @@ Expr_map_alias = function(fun) {
) {
assign(".warn_map_alias", 1L, envir = runtime_state)
# it does not seem map alias is executed multi-threaded but rather immediately during building lazy query
# if ever crashing, any lazy method like select, filter, with_columns must use something like handle_thread_r_requests()
# then handle_thread_r_requests should be rewritten to handle any type.
# if ever crashing, any lazy method like select, filter, with_columns must use something like filter_with_r_func_support()
message("map_alias function is experimentally without some thread-safeguards, please report any crashes") # TODO resolve
}
if (!is.function(fun)) pstop(err = "alias_map fun must be a function")
Expand Down
10 changes: 8 additions & 2 deletions R/extendr-wrappers.R
Original file line number Diff line number Diff line change
Expand Up @@ -53,10 +53,14 @@ dtype_str_repr <- function(dtype) .Call(wrap__dtype_str_repr, dtype)

test_robj_to_usize <- function(robj) .Call(wrap__test_robj_to_usize, robj)

test_robj_to_f64 <- function(robj) .Call(wrap__test_robj_to_f64, robj)

test_robj_to_i64 <- function(robj) .Call(wrap__test_robj_to_i64, robj)

test_robj_to_u32 <- function(robj) .Call(wrap__test_robj_to_u32, robj)

test_robj_to_i32 <- function(robj) .Call(wrap__test_robj_to_i32, robj)

test_print_string <- function(s) invisible(.Call(wrap__test_print_string, s))

RPolarsErr <- new.env(parent = emptyenv())
Expand Down Expand Up @@ -881,8 +885,6 @@ LazyFrame$collect_background <- function() .Call(wrap__LazyFrame__collect_backgr

LazyFrame$collect <- function() .Call(wrap__LazyFrame__collect, self)

LazyFrame$collect_handled <- function() .Call(wrap__LazyFrame__collect_handled, self)

LazyFrame$first <- function() .Call(wrap__LazyFrame__first, self)

LazyFrame$last <- function() .Call(wrap__LazyFrame__last, self)
Expand Down Expand Up @@ -949,6 +951,10 @@ LazyFrame$rename <- function(existing, new) .Call(wrap__LazyFrame__rename, self,

LazyFrame$schema <- function() .Call(wrap__LazyFrame__schema, self)

LazyFrame$optimization_toggle <- function(type_coercion, predicate_pushdown, projection_pushdown, simplify_expr, slice_pushdown, cse, streaming) .Call(wrap__LazyFrame__optimization_toggle, self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expr, slice_pushdown, cse, streaming)

LazyFrame$profile <- function() .Call(wrap__LazyFrame__profile, self)

LazyFrame$explode <- function(columns, dotdotdot_args) .Call(wrap__LazyFrame__explode, self, columns, dotdotdot_args)

#' @export
Expand Down
95 changes: 91 additions & 4 deletions R/lazyframe__lazy.R
Original file line number Diff line number Diff line change
Expand Up @@ -265,15 +265,71 @@ LazyFrame_filter = "use_extendr_wrapper"

#' @title New DataFrame from LazyFrame_object$collect()
#' @description collect DataFrame by lazy query
#' @param type_coercion Boolean. Do type coercion optimization.
#' @param predicate_pushdown Boolean. Do predicate pushdown optimization.
#' @param projection_pushdown Boolean. Do projection pushdown optimization.
#' @param simplify_expression Boolean. Run simplify expressions optimization.
#' @param no_optimization Boolean. Turn off (certain) optimizations.
#' @param slice_pushdown Boolean. Slice pushdown optimization.
sorhawell marked this conversation as resolved.
Show resolved Hide resolved
#' @param common_subplan_elimination Boolean. Will try to cache branching subplans that occur on
#' self-joins or unions.
#' @param streaming Boolean. Run parts of the query in a streaming fashion
#' (this is in an alpha state)
#' @param collect_in_background Boolean. Detach this query from R session. Computation will start
#' in background. Get a handle which later can be converted into the resulting DataFrame. Useful
#' in interactive mode to not lock R session.
#' @details
#' Note: use `$fetch()` if you want to run your query on the first `n` rows only.
sorhawell marked this conversation as resolved.
Show resolved Hide resolved
#' This can be a huge time saver in debugging queries.
#' @keywords LazyFrame DataFrame_new
#' @return collected `DataFrame`
#' @return collected `DataFrame` or if colkect
sorhawell marked this conversation as resolved.
Show resolved Hide resolved
#' @examples pl$DataFrame(iris)$lazy()$filter(pl$col("Species") == "setosa")$collect()
sorhawell marked this conversation as resolved.
Show resolved Hide resolved
LazyFrame_collect = function() {
unwrap(.pr$LazyFrame$collect_handled(self), "in $collect():")
LazyFrame_collect = function(
type_coercion = TRUE,
predicate_pushdown = TRUE,
projection_pushdown = TRUE,
simplify_expression = TRUE,
no_optimization = FALSE,
slice_pushdown = TRUE,
common_subplan_elimination = TRUE,
streaming = FALSE,
collect_in_background = FALSE
) {

if (isTRUE(no_optimization)) {
predicate_pushdown = FALSE
projection_pushdown = FALSE
slice_pushdown = FALSE
common_subplan_elimination = FALSE
}
sorhawell marked this conversation as resolved.
Show resolved Hide resolved

if (isTRUE(streaming)) {
common_subplan_elimination = FALSE
}

collect_f = if( isTRUE(collect_in_background)) {
sorhawell marked this conversation as resolved.
Show resolved Hide resolved
.pr$LazyFrame$collect_background
} else {
.pr$LazyFrame$collect
}

self |>
.pr$LazyFrame$optimization_toggle(
type_coercion,
predicate_pushdown,
projection_pushdown,
simplify_expression,
slice_pushdown,
common_subplan_elimination,
streaming
) |>
and_then(collect_f) |>
unwrap("in $collect():")
}

#' @title New DataFrame from LazyFrame_object$collect()
#' @description collect DataFrame by lazy query
#' @description collect DataFrame by lazy query (SOFT DEPRECATED)
#' @details This function is soft deprecated. Use $collect(collect_in_background = TRUE) instead
sorhawell marked this conversation as resolved.
Show resolved Hide resolved
#' @keywords LazyFrame DataFrame_new
#' @return collected `DataFrame`
#' @examples pl$DataFrame(iris)$lazy()$filter(pl$col("Species") == "setosa")$collect()
Expand Down Expand Up @@ -926,6 +982,37 @@ LazyFrame_dtypes = method_as_property(function() {
unwrap("in $dtypes()")
})

#' @title Collect and profile a lazy query.
#' @description This will run the query and return a tuple containing the materialized DataFrame and
sorhawell marked this conversation as resolved.
Show resolved Hide resolved
#' a DataFrame that contains profiling information of each node that is executed.
#' @details The units of the timings are microseconds.
#'
#' @keywords LazyFrame
#' @return List of two DataFrames, (collected result, profile stats)
sorhawell marked this conversation as resolved.
Show resolved Hide resolved
#' @examples
#'
#' #Use $profile() to compare two queries
etiennebacher marked this conversation as resolved.
Show resolved Hide resolved
#'
#' # print one '.', take a series convert to r vector, take first value, add 5
#' r_func = \(s) {cat(".");s$to_r()[1] + 5}
sorhawell marked this conversation as resolved.
Show resolved Hide resolved
#'
#' # map each Species-group of each numeric column with an R function, takes ~7000us slow !
sorhawell marked this conversation as resolved.
Show resolved Hide resolved
#' pl$LazyFrame(iris)$
#' sort("Sepal.Length")$ #for no specific reason
sorhawell marked this conversation as resolved.
Show resolved Hide resolved
#' groupby("Species", maintain_order = TRUE)$
#' agg(pl$col(pl$Float64)$apply(r_func))$
#' profile()
#'
#' # map each Species-group with native polars, takes ~120us better
sorhawell marked this conversation as resolved.
Show resolved Hide resolved
#' pl$LazyFrame(iris)$
#' sort("Sepal.Length")$
#' groupby("Species", maintain_order = TRUE)$
#' agg(pl$col(pl$Float64)$first() + 5 )$
#' profile()
LazyFrame_profile = function() {
.pr$LazyFrame$profile(self) |> unwrap("in $profile()")
}

#' @title Explode the DataFrame to long format by exploding the given columns
#' @keywords LazyFrame
#'
Expand Down
41 changes: 39 additions & 2 deletions man/LazyFrame_collect.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 4 additions & 1 deletion man/LazyFrame_collect_background.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

40 changes: 40 additions & 0 deletions man/LazyFrame_profile.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 4 additions & 4 deletions man/nanoarrow.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions src/rust/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ features = [
"repeat_by",
"interpolate",
#"list",
"cse",
"ewma",
"rank",
"diff",
Expand Down
Loading