Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

selective pushdown #66

Open
bolcman opened this issue Feb 23, 2021 · 4 comments
Open

selective pushdown #66

bolcman opened this issue Feb 23, 2021 · 4 comments

Comments

@bolcman
Copy link

bolcman commented Feb 23, 2021

Hi,

Is there a way we can do a selective pushdown ?
For example, can we specify the list of potential query types that can be considered for pushdown logic ?
in some cases we want to pushdown only filtering and do the joins , or rank functions on spark side.

Thanks,
Aleks

@carlsverre
Copy link
Contributor

Hi Aleks - that's an interesting feature request. Why would you want to only pushdown part of a query shape? Our pushdown system is designed to return the same results with/without pushdown and in every situation we are aware of the pushdown query is faster. Is there a specific query shape for which this is not true?

@bolcman
Copy link
Author

bolcman commented Mar 2, 2021

Hi Carl,

There are actually a few reasons for this:

  • it increases transparency for end user,
  • gives you manually control of pushdown queries depending on job nature.
  • reduce the payload on single store cluster in some cases.
  • gives you ability to enable full parallel read functionality with particular pushdown queries in some complex scenarios
  • if you have some specific join expression which is not really 'typical', instead of changing indices and table schemas,
    you can push down filters only for example, do repartitioning in spark and try do the join there, gives you more flexibility.
  • also sometimes it's not only about speed, there are different type of jobs that we are running,
    and in some cases we don't really care about the execution time, we want to pushdown filtering for example, and do the rest in spark. so this functionality would help us balance resources easier for "more important" jobs which we want to run in complete pushdown mode. Right now you can create a job with multiple sessions and combining them (with and without pushdown) you can accomplish something like this, but it introduces some complexity from coding perspective.

Thanks,
Aleks

@AdalbertMemSQL
Copy link
Collaborator

Hi Aleks,

Sorry for the long reply.
It is possible to add selective pushdowns by using some expressions that can't be pushed down.
For example, you can use .cache() before the operation that should be executed on the Spark side.
book.where(col("writer_name") === "John").cache().join(writerDS, bookDS("writer_id") === writerDS("writer_id"), "inner")
Here final join won't be pushed down.

Is this solution sufficient for your case?

@bolcman
Copy link
Author

bolcman commented Sep 26, 2022

Hi,

Thanks for reply!
So this is a good starting point.
Is there any way I can pushdown the DataFrame that I have in memory as a filter without materializing it ?
for example if it would be really great if I can pushdown bookDS("writer_id") as a filter to single store. i just want to avoid
collect() and isIn(array) in this case.

Thanks,
Aleks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants