-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add: parameters to control number and the distribution of messages in a micro-batch #63
base: master
Are you sure you want to change the base?
add: parameters to control number and the distribution of messages in a micro-batch #63
Conversation
b77373d
to
dd8645a
Compare
028d348
to
b9da6dc
Compare
One thing to add here is that - as you might have seen in the PR - is that, unlike the Kafka connector we are forwarding by entries which can contain one or more messages. The reason behind this is that we could not find a suitable method for forwarding by an exact number of messages using a Reader in the We have a close candidate though: I think we can use https://pulsar.apache.org/docs/en/next/admin-api-topics/#examine-messages, but the problem there is that it can look for messages relative to the first or the last message on a topic. If we can make this method work for any message ID, I think it can be sufficiently used in this feature as well. I was also thinking about we can modify the Reader interface so that it will be able to forward itself by n number of messages (on the other hand, that would require 2 API calls per topic to get the forwarded message ID, now we issue only 1, which already causes problems when we read ~100 topics using a connector due to the large number of API requests). Please let me know what do you think about this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General Feedback:
- Can we separate the changes of introducing maxMesagesPerBatch config and the forward strategy feature into two PRs?
- If each micro-batch we are calling to get all topics stats, I'm afraid of there will be performance issues
- the "forward" used here is not very intuitive.
- Is is possible to construct an ending messageID and them check if it exists against pulsar?
caseInsensitiveParams.getOrElse(EnsureEntriesPerTopic, "0").toLong | ||
|
||
private def forwardStrategy(caseInsensitiveParams: Map[String, String]): String = | ||
caseInsensitiveParams.getOrElse(ForwardStrategy, "simple") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"simple" should also be a defined strategy as "proportional" or "largefirst"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
I can do that if needed, will add a new one about strategies then.
We are invoking a single
Will change the naming.
I believe we can do that, since we already construct that, but would that cause additional load on the Pulsar cluster? I think by creating a reader/consumer for this purpose this can be achieved. What do you think about this? Do you see any simpler ways for doing that? (I just do not want to place more load on the Pulsar cluster if that is not needed.) |
a081323
to
631eb8b
Compare
I have modified the PR with the following:
I can make further changes based on the answers for the questions above, see my previous comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atezs82 This PR contains a lot of conflicts. Are you willing to continue working on this?
Yes. I will resolve the conflicts shortly. I was waiting for your opinion about the questions raised in #63 (comment) before moving forward with this. |
6650f4c
to
48ce512
Compare
This feature is planned to be put into a different PR.
48ce512
to
4839989
Compare
Rebased on top of master, now looking for a way to fix the Codacy warnings and do not increase the code complexity too much in the effort. Please in the meantime let me know if there are any answers for my questions from #63 (comment). Thanks in advance! |
I have slightly reworked some imperative logic inside to |
Thanks for your contribution. I'll review this PR later this week. I was still busy working on |
If we have a considerably large backlog for one or more topics that are read by the connector, then using the current implementation we cannot really place an upper limit on the number of messages that are processed at once in a Spark Streaming pipeline.
This PR attempts to address this problem by adding an experimental parameter called
maxEntriesPerTrigger
to the code (this behaves like themaxOffsetsPerTrigger
parameter of the Kafka connector: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries). If the parameter is set, only a specified number of entries are read from the source. We are a little bit diverging from the Kafka connector here, since that one limits actual messages (we can limit Pulsar entries only with this approach).The feature is based upon the Pulsar Admin API call https://pulsar.apache.org/docs/en/2.7.3/admin-api-topics/#get-internal-stats, which can return all ledgers and entries for a topic. The use of this approach is required so that current functionalities of the Pulsar server side can be used to achieve the task at hand.
In addition to this, since our input data is very much different in backlog sizes, I have added some additional parameters, called
forwardStrategy
andensureEntriesPerTopic
, so that topics with a very low backlog are also forwarded continuously if needed and we have options for consuming backlogs in different fashion if that is needed.Please let me know what you think about this draft change. I can add further integration/unit tests and make final documentation changes if you agree with this approach outlined here.
We have also seen some (somewhat troubling) news about creating a brand new connector for Spark Streaming, can you please share additional information on that subject? Thanks in advance!