selective pushdown #66

bolcman · 2021-02-23T18:34:32Z

Hi,

Is there a way we can do a selective pushdown ?
For example, can we specify the list of potential query types that can be considered for pushdown logic ?
in some cases we want to pushdown only filtering and do the joins , or rank functions on spark side.

Thanks,
Aleks

carlsverre · 2021-02-24T00:13:24Z

Hi Aleks - that's an interesting feature request. Why would you want to only pushdown part of a query shape? Our pushdown system is designed to return the same results with/without pushdown and in every situation we are aware of the pushdown query is faster. Is there a specific query shape for which this is not true?

bolcman · 2021-03-02T13:51:24Z

Hi Carl,

There are actually a few reasons for this:

it increases transparency for end user,
gives you manually control of pushdown queries depending on job nature.
reduce the payload on single store cluster in some cases.
gives you ability to enable full parallel read functionality with particular pushdown queries in some complex scenarios
if you have some specific join expression which is not really 'typical', instead of changing indices and table schemas,
you can push down filters only for example, do repartitioning in spark and try do the join there, gives you more flexibility.
also sometimes it's not only about speed, there are different type of jobs that we are running,
and in some cases we don't really care about the execution time, we want to pushdown filtering for example, and do the rest in spark. so this functionality would help us balance resources easier for "more important" jobs which we want to run in complete pushdown mode. Right now you can create a job with multiple sessions and combining them (with and without pushdown) you can accomplish something like this, but it introduces some complexity from coding perspective.

Thanks,
Aleks

AdalbertMemSQL · 2022-09-21T16:16:44Z

Hi Aleks,

Sorry for the long reply.
It is possible to add selective pushdowns by using some expressions that can't be pushed down.
For example, you can use .cache() before the operation that should be executed on the Spark side.
book.where(col("writer_name") === "John").cache().join(writerDS, bookDS("writer_id") === writerDS("writer_id"), "inner")
Here final join won't be pushed down.

Is this solution sufficient for your case?

bolcman · 2022-09-26T15:00:03Z

Hi,

Thanks for reply!
So this is a good starting point.
Is there any way I can pushdown the DataFrame that I have in memory as a filter without materializing it ?
for example if it would be really great if I can pushdown bookDS("writer_id") as a filter to single store. i just want to avoid
collect() and isIn(array) in this case.

Thanks,
Aleks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

selective pushdown #66

selective pushdown #66

bolcman commented Feb 23, 2021

carlsverre commented Feb 24, 2021

bolcman commented Mar 2, 2021

AdalbertMemSQL commented Sep 21, 2022

bolcman commented Sep 26, 2022

selective pushdown #66

selective pushdown #66

Comments

bolcman commented Feb 23, 2021

carlsverre commented Feb 24, 2021

bolcman commented Mar 2, 2021

AdalbertMemSQL commented Sep 21, 2022

bolcman commented Sep 26, 2022