-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Facilitate asynchronous realtime ingestion on decoding & transformation #13695
base: master
Are you sure you want to change the base?
Facilitate asynchronous realtime ingestion on decoding & transformation #13695
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #13695 +/- ##
============================================
+ Coverage 61.75% 61.99% +0.24%
+ Complexity 207 198 -9
============================================
Files 2436 2555 +119
Lines 133233 140750 +7517
Branches 20636 21891 +1255
============================================
+ Hits 82274 87257 +4983
- Misses 44911 46850 +1939
- Partials 6048 6643 +595
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you guarantee the message order when message batches are processed asynchronously?
private AtomicInteger _numRowsConsumed = new AtomicInteger(0); | ||
// Can be different from _numRowsConsumed when metrics update is enabled. | ||
private AtomicInteger _numRowsIndexed = new AtomicInteger(0); | ||
private AtomicInteger _numRowsErrored = new AtomicInteger(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can be final?
BlockingQueue<Pair<List<GenericRow>, Integer>> transformedQueue = new LinkedBlockingQueue<>(); | ||
AtomicInteger submittedMsgCount = new AtomicInteger(0); | ||
// TODO: tune the number of threads | ||
ExecutorService decodeAndTransformExecutor = Executors.newFixedThreadPool(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting a new executor per message batch can create big overhead. Consider creating an executor to be shared for different batches
} | ||
}); | ||
|
||
indexingThread.start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who is executing this thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consumer thread for the partition is the one kicking off this "indexingThread". I don't understand why we kick off a separate thread, and then in the next line, we wait for it to finish. What's the difference if we don't spin off a new thread, and use the main thread (consuming thread) to do the indexing?
cc: @sajjad-moradi |
Is is useful to create a sub-class of RealtimeSegmentDataManager that consumes asynchronously? |
To ensure the ingestion order, we might be able to use a producer-consumer pattern, where consumer thread creates |
How would ingestion rate limiting work? |
Thanks for the review. Right now, it is only using 1 thread to guarantee the order. For multiple threads, I am thinking about adapting the way mentioned in Multi-topic ingestion support if it goes through the review. |
|
||
BlockingQueue<Pair<List<GenericRow>, Integer>> transformedQueue = new LinkedBlockingQueue<>(); | ||
AtomicInteger submittedMsgCount = new AtomicInteger(0); | ||
// TODO: tune the number of threads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't have more than one thread here otherwise the order of indexed rows will be different, and that's something we can't tolerate.
} | ||
}); | ||
|
||
indexingThread.start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consumer thread for the partition is the one kicking off this "indexingThread". I don't understand why we kick off a separate thread, and then in the next line, we wait for it to finish. What's the difference if we don't spin off a new thread, and use the main thread (consuming thread) to do the indexing?
ingestion
enhancement
feature request
Resolving issues mentioned in #13319.
Pinot ingestion is currently using a strictly serial processing. It fetches a batch of messages from Kafka and then for messages in the batch, it would process one by one with the order of offsets to:
It provides benefits to reuse the objects created in between to achieve better memory efficiency but not able to utilize all system resources. There are multiple solutions with their pros and cons:
Comparing the CPU usage and consumption speed on same server before and after (at 10:00) enabling "ASYNCHRONOUS":


The more CPU usage contributes a ~10% ingestion speed increase.
Notes:
The new mode is better performed on computation heavy ingestion but cannot really help on light weight use cases. In light computation use cases, the extra memory and GC overhead would compensate the async process gain.