Adding maxBytesPerTrigger tag for Pulsar Admission Control #151

ericm-db · 2023-08-08T19:17:12Z

Motivation

Some users that request Pulsar Spark connector also request that the Pulsar source has ratelimit functionality. They would like to control the rate of data processing and resource consumption of streaming queries that use the Pulsar source. This can be achieved by implementing admission control in pulsar source.

Modifications

Added a config called maxBytesPerTrigger which allows users to configure how many bytes are consumed for each microbatch and shared between topic-partitions

Verifying this change

Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.
This change is already covered by existing tests, such as:
This change added tests and can be verified as follows: Run the PulsarAdmissionControlSuite

Documentation

Check the box below.

Need to update docs?

doc-required
no-need-doc
doc

ericm-db · 2023-08-09T18:31:49Z

@atezs82 We are currently trying to implement functionality similar to your PR here: #63

atezs82 · 2023-08-11T06:16:55Z

@ericm-db Thanks for picking this idea up! Since we do not use Pulsar anymore my work on the other PR was seriously down-prioritized. I'm glad though that this might be present in the connector in some form, since I personally think that this is very useful for eg. some CDC usecases.

src/main/scala/org/apache/spark/sql/pulsar/PulsarSource.scala

src/main/scala/org/apache/spark/sql/pulsar/PulsarHelper.scala

src/main/scala/org/apache/spark/sql/pulsar/PulsarSource.scala

src/main/scala/org/apache/spark/sql/pulsar/PulsarHelper.scala

src/main/scala/org/apache/spark/sql/pulsar/PulsarProvider.scala

src/main/scala/org/apache/spark/sql/pulsar/PulsarSource.scala

src/main/scala/org/apache/spark/sql/pulsar/PulsarHelper.scala

src/main/scala/org/apache/spark/sql/pulsar/PulsarSource.scala

src/test/scala/org/apache/spark/sql/pulsar/PulsarSourceSuiteBase.scala

chaoqin-li1123 · 2023-08-16T19:54:24Z

Please also update documentation for admin url and maxBytesPerTrigger in README.md

src/main/scala/org/apache/spark/sql/pulsar/PulsarHelper.scala

chaoqin-li1123 · 2023-08-17T18:35:01Z

src/main/scala/org/apache/spark/sql/pulsar/PulsarHelper.scala

+    }
+    val newTopics = topicPartitions.toSet.diff(existingStartOffsets.keySet)
+    val startPartitionOffsets = existingStartOffsets ++ newTopics.map(topicPartition
+    => topicPartition -> MessageId.earliest)


I still have some concern here.
For newly discovered topic, admission control start from Message.earliest(-1, -1, -1), this assume that all ledgers exposed by the stats are readable. Is this assumption valid? @nlu90

this should be okay

src/test/scala/org/apache/spark/sql/pulsar/PulsarAdmissionControlHelper.scala

src/main/scala/org/apache/spark/sql/pulsar/PulsarHelper.scala

…ndancy

src/main/scala/org/apache/spark/sql/pulsar/PulsarHelper.scala

chaoqin-li1123 · 2023-08-28T17:34:12Z

src/main/scala/org/apache/spark/sql/pulsar/PulsarHelper.scala

@@ -122,7 +135,9 @@ private[pulsar] case class PulsarHelper(
    offset.foreach { case (tp, mid) =>
      try {
        val (subscription, _) = extractSubscription(predefinedSubscription, tp)
-        CachedConsumer.getOrCreate(tp, subscription, client).seek(mid)
+        val consumer = CachedConsumer.getOrCreate(tp, subscription, client)
+        if (!consumer.isConnected) consumer.getLastMessageId


I know this is a bug that pulsar consumer do not attempt to reconnect when doing seek(), can you leave a comment here explaining why this change is needed and TODO that we will get rid of this once we upgraded to a version that has the fix?

chaoqin-li1123

LGTM

ericm-db requested review from nlu90 and a team as code owners August 8, 2023 19:17

github-actions bot assigned ericm-db Aug 8, 2023

github-actions bot added the no-need-doc This pr does not need any document label Aug 8, 2023

chaoqin-li1123 reviewed Aug 15, 2023

View reviewed changes

src/main/scala/org/apache/spark/sql/pulsar/PulsarSource.scala Outdated Show resolved Hide resolved

chaoqin-li1123 reviewed Aug 15, 2023

View reviewed changes

src/main/scala/org/apache/spark/sql/pulsar/PulsarSource.scala Outdated Show resolved Hide resolved

ericm-db force-pushed the admission-control branch from 072e45b to e7e87b6 Compare August 15, 2023 23:08

chaoqin-li1123 reviewed Aug 15, 2023

View reviewed changes

src/main/scala/org/apache/spark/sql/pulsar/PulsarHelper.scala Outdated Show resolved Hide resolved