-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch multiple events as a single record in aws_kinesis_firehose
sink
#1407
Comments
@a-rodin how is this not accomplished via the |
@binarylogic As far as I understood the code, currently multiple records are batched in a single request, but each record contains a single event. I think this can be applied to |
Ah, yes, I can confirm this does significantly improve throughput. We implemented this strategy with the Timber pipeline and also |
Great! Most managed/hosted streaming services tend to charge per single event and hence batching multiple records into a single event helps to not run into service limits or helps with cost saving. I'd love to see it for the Kafka sink as well, see #560 (comment) for more detailed comments about that. Shall I open issue for that? |
@MOZGIII what do you think about using the merge transform to address cases like this? Seems like that could be a little cleaner than adding this logic to all the potentially relevant sinks. |
If it's sufficient, using |
Whether to extend the sink or suggest users add On the other hand, this is a question of good defaults. To me, it seems like if we want to add something as a default - we don't currently have a way other than to extend a sink/source in the code. If we continue this practice, I'd prepare some better way of doing this rather than getting into the code of each implementation - just to make our lives as maintainer easier. I'm talking about more explicit extension points at the sinks and sources - in particular, such that if we want to attach a transform to a source or sink we wouldn't have to read out through the whole implementation source to determine where to put the transform. Having that would make is much easier to work with "built-in transforms". |
This is important. We lean towards convention over configuration. This specific issue is such a fundamental one that I think it should be solved by default. To us it seems obvious, because we know about the problem and we are aware of all options Vectors provides, but to a user, it is not so obvious.
I'm interested in what you're thinking here. We've had a lot of related discussions around this, and @LucioFranco is currently working on something related (wrapping sinks to extend their behavior). #1061 and #832 also solve this in different ways. |
I was thinking in particular about something like changing a |
We're seeing the impact of not batching events with the We noticed that the sink batches events by the To work around this, we've generated our own partition key and used the
By doing the above, we've seen the throughput of the I thought I would share above in case someone else stumbles across this issue. All of that being said, it would be great to see both the The one downside to the workaround is that the custom |
…ctordotdev#1407 Send batches to AWS Kinesis Data Streams and AWS Firehose independent of ther parition keys. In both API's batches of events do not need to share the same partition key. This makes the protocol more efficient, as by default the partition key is a random key being different for every event.
…ctordotdev#1407 Send batches to AWS Kinesis Data Streams and AWS Firehose independent of ther parition keys. In both API's batches of events do not need to share the same partition key. This makes the protocol more efficient, as by default the partition key is a random key being different for every event.
…ctordotdev#1407 Send batches to AWS Kinesis Data Streams and AWS Firehose independent of ther parition keys. In both API's batches of events do not need to share the same partition key. This makes the protocol more efficient, as by default the partition key is a random key being different for every event.
…ctordotdev#1407 Send batches to AWS Kinesis Data Streams and AWS Firehose independent of ther parition keys. In both API's batches of events do not need to share the same partition key. This makes the protocol more efficient, as by default the partition key is a random key being different for every event.
Send batches to AWS Kinesis Data Streams and AWS Firehose independent of ther parition keys. In both API's batches of events do not need to share the same partition key. This makes the protocol more efficient, as by default the partition key is a random key being different for every event.
With the submission of #20653 I think this issue can now be closed. |
Agreed, thanks @steven-aerts . Closed by #20653 |
…ctordotdev#1407 (vectordotdev#20653) Send batches to AWS Kinesis Data Streams and AWS Firehose independent of ther parition keys. In both API's batches of events do not need to share the same partition key. This makes the protocol more efficient, as by default the partition key is a random key being different for every event.
…ctordotdev#1407 (vectordotdev#20653) Send batches to AWS Kinesis Data Streams and AWS Firehose independent of ther parition keys. In both API's batches of events do not need to share the same partition key. This makes the protocol more efficient, as by default the partition key is a random key being different for every event.
The newly added
aws_kinesis_firehose
sink encodes each event as a separate record. However, in case of logs data, there is the following caveat in the Firehose pricing:This means that if size of each log event is less than 500 bytes (just for example), then streaming this data with batching as many records as possible into a single one reduces the costs at least tenfold.
So I propose to add an additional Boolean option to the sink configuration, which would enable squashing multiple records into single ones when possible using greedy approach, and probably even enable this option by default. See the appendix for some considerations about delimiting the records.
With this feature, Vector could provide real cost-saving value by just acting as a middleware between applications which need to write log events in realtime (without any batching implemented inside of the application) and Firehose.
Appendix
All records sent to Firehose need to have
\n
in the end because Firehose itself can batch multiple records into a single one before sending them further (as specified inPutRecord
docs). This means that, for example in case of S3 destination, sending multiple events separated by\n
(and with\n
in the end) as a single record is equivalent to sending each event as a separate record (with\n
in the end as well).The text was updated successfully, but these errors were encountered: