Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why spark generate Java code and not scala code? #18

Open
igreenfield opened this issue Nov 4, 2019 · 6 comments
Open

Why spark generate Java code and not scala code? #18

igreenfield opened this issue Nov 4, 2019 · 6 comments

Comments

@igreenfield
Copy link

No description provided.

@bartosz25
Copy link
Owner

Thank you @igreenfield for such an amazing question! I was looking for the reasons in the documentation and old PR but found any information about that. I've just posted a question on Spark users group. You can follow the conversation on https://mail-archives.apache.org/mod_mbox/spark-user/201911.mbox/browser or if not, I'll keep you up to date on this Issue.

Cheers,
Bartosz.

@igreenfield
Copy link
Author

@bartosz25 I was looking into the code generation phase and I think that if the code was scala it was easier to reduce the number of code line so many cases of compilation failed due to method grows more then 64KB will disappear.

@bartosz25
Copy link
Owner

bartosz25 commented Nov 10, 2019

Hi @igreenfield ,

I've some answers from the mailing list:

Long story short, it's all about the compilation performance :)

Regarding your point about 64KB limitation, AFAIK, Spark has a protection against too long methods. First, it's able to split too long function into multiple methods (spark.sql.codegen.methodSplitThreshold). Second, it's also able to desactivate codegen to handle the JVM max method length limit (spark.sql.codegen.hugeMethodLimit).

Did you already have some issues about "too long" generated method which made your pipeline fail? I've never experienced that so I'm really curious to learn new things and maybe help you to overcome the issue by reworking the code?

@igreenfield
Copy link
Author

igreenfield commented Nov 10, 2019

Hi @bartosz25
First thanks for the help!!

  1. The compilation performance could be eliminated using a compile server.
  2. Yes, I hit the 64KB limit all the time. my use case is very complex: we are migrating SQL engine into spark. (most cases processNext method) for example

we can schedule a call and I can explain in more details.

another thing, one of the answers:

Also for low level code we can’t use (due to perf concerns) any of the
edges scala has over java, eg we can’t use the Scala collection library,
functional programming, map/flatMap. So using scala doesn’t really buy
anything even if there is no compilation speed concerns.

I think the ability to return more than one object from a function can do the different in splitting the huge methods into smaller ones.

@bartosz25
Copy link
Owner

bartosz25 commented Nov 14, 2019

Re @igreenfield

At that moment I don't have much time so I won't be able to help you. Sorry for that, late January it should be better. Meantime, maybe you can take a look at my series about Apache Spark customization. I cover them how to alter logical and physical plans, how to add a new parser and so forth. Maybe with that you can write your own code generation which will be much shorter than the code you've just shown me. The articles were published here: https://www.waitingforcode.com/tags/spark-sql-customization

Anyway, I doubt that Spark community agrees on switching code generation to Scala because of a single demand. But you can always take a try and ask directly on the mailing list https://spark.apache.org/community.html

Cheers,
Bartosz.

@igreenfield
Copy link
Author

Hi, @bartosz25 Thanks! I will be in touch with you in late January.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants