-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom sink provider for structured streaming #238
base: master
Are you sure you want to change the base?
Conversation
Thanks for this PR, @sutugin cc: @dongjoon-hyun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
...rc/main/scala/org/apache/spark/sql/execution/datasources/hbase/HBaseStreamSinkProvider.scala
Outdated
Show resolved
Hide resolved
You forget that SHC suppoorts Avro schema . The user should be able to pass any key in options to define them. |
I propose for options to use:
|
Then,
And i get the following exception:
|
I just declare the Avro schema in the alphabetical order of field names: and it works. No idea where is the problem, but it has to be corrected.
|
Registred short name "hbase" of sink provider. For all option related with HBase checking prefix "hbase." (like catalog, newtable and etc.)
@sbarnoud, Unfortunately I have never used avro and can't comment on it, the only thing I can assume is that in the incoming streaming dataset it is necessary to order the columns exactly as specified in the avro schema |
Hi, For counters:
QueryProgressEvent contains some counters like numInputRows ... that are not updated. For Avro, I found the bug, but didn't succeed to test (my patch is not loaded first ...).
As you can see, the avro schema used to serialize is f.get.schema.get which is the dataset schema instead of f.get.exeSchema.get which is the user supplied one. |
|
Hi Sorry but metrics reporting doesn't work for me. I already configure my job to have them, and if i change my sink to parquet on the same stream, i get them. I didn't understand why you use pattern matching (and not startsWith) here:
If you really want to, please use the right pattern: hbase\..*
I will open an issue for Avro. |
That's the reason why metrics are not updated. |
This code works, and avoids the current comment "allows us to do magic" ;-):
18/04/13 14:26:11 INFO StreamExecution: Streaming query made progress: { |
@sbarnoud, Hi! Thank you, great work! |
Hi, I have on my own version added the support of "short names" Could you validate those short names ? core/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
org/apache/spark/sql/execution/datasources/hbase/HBaseRelation.scala
org/apache/spark/sql/execution/streaming/HBaseStreamSinkProvider.scala
|
@sbarnoud, hi! |
No, i let you decide the names but i sent you the correct code. Both batch and streaming short names must be defined and different and both classes must be in the ressources. |
I think with the advent of spark 2.4, this is no longer relevant, foreachBatch will solve all the problems of custom sinks (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch) |
What changes were proposed in this pull request?
Custom sink provider for using shc in structured streaming job.
#205
For all HBase-related options must be set prefixed "hbase."
How was this patch tested?
Run structured streaming job and write to HBase)))