Batch multiple events as a single record in `aws_kinesis_firehose` sink #1407

ghost · 2019-12-20T14:36:50Z

The newly added aws_kinesis_firehose sink encodes each event as a separate record. However, in case of logs data, there is the following caveat in the Firehose pricing:

Pricing is based on volume of data ingested into Amazon Kinesis Data Firehose, which is calculated as the number of data records you send to the service, times the size of each record rounded up to the nearest 5KB.

This means that if size of each log event is less than 500 bytes (just for example), then streaming this data with batching as many records as possible into a single one reduces the costs at least tenfold.

So I propose to add an additional Boolean option to the sink configuration, which would enable squashing multiple records into single ones when possible using greedy approach, and probably even enable this option by default. See the appendix for some considerations about delimiting the records.

With this feature, Vector could provide real cost-saving value by just acting as a middleware between applications which need to write log events in realtime (without any batching implemented inside of the application) and Firehose.

Appendix

All records sent to Firehose need to have \n in the end because Firehose itself can batch multiple records into a single one before sending them further (as specified in PutRecord docs). This means that, for example in case of S3 destination, sending multiple events separated by \n (and with \n in the end) as a single record is equivalent to sending each event as a separate record (with \n in the end as well).

The text was updated successfully, but these errors were encountered:

binarylogic · 2019-12-20T17:26:18Z

@a-rodin how is this not accomplished via the batch_size option? And is this also relevant for the aws_kinesis_streams sink?

ghost · 2019-12-21T11:25:35Z

@binarylogic As far as I understood the code, currently multiple records are batched in a single request, but each record contains a single event.

I think this can be applied to aws_kinesis_streams too, but possible cost reductions would not be as dramatic because of different pricing model used by Kinesis Streams.

binarylogic · 2019-12-21T21:06:03Z

Ah, yes, I can confirm this does significantly improve throughput. We implemented this strategy with the Timber pipeline and also lz4 compressed the data beforehand.

erikbos · 2019-12-30T16:18:51Z

Great! Most managed/hosted streaming services tend to charge per single event and hence batching multiple records into a single event helps to not run into service limits or helps with cost saving.

I'd love to see it for the Kafka sink as well, see #560 (comment) for more detailed comments about that. Shall I open issue for that?

lukesteensen · 2020-01-15T17:08:20Z

@MOZGIII what do you think about using the merge transform to address cases like this? Seems like that could be a little cleaner than adding this logic to all the potentially relevant sinks.

MOZGIII · 2020-01-15T20:41:23Z

If it's sufficient, using merge transform is definitely possible here. I'm confused with terminology in the domain (kinesis/firehose), and I don't quite get how events and records relate to each other, so I might not be aware enough to resolve this question properly. I'd suggest trying to come up with a sample configuration to see if it's actually possible to implement with merge transform. I can work on this.
In fact, I think it'd be nice to investigate what use cases the merge transform can cover. This would allow us to deduce what partial event marking mechanisms would be useful to implement - in addition the regexp based ones we have now. I'll elaborate more on this at #1488.

MOZGIII · 2020-01-15T20:53:57Z

Whether to extend the sink or suggest users add merge transform manually is a separate question. We added merge transform to the docker source, but in general, I'd prefer to tell users to add merge transforms explicitly. As it is now, it would really simplify the code and reduces the number of things that can go wrong. I think we do have to continuously test various configuration setups somehow - i.e. by adding them as correctness tests to our test harness.

On the other hand, this is a question of good defaults. To me, it seems like if we want to add something as a default - we don't currently have a way other than to extend a sink/source in the code. If we continue this practice, I'd prepare some better way of doing this rather than getting into the code of each implementation - just to make our lives as maintainer easier. I'm talking about more explicit extension points at the sinks and sources - in particular, such that if we want to attach a transform to a source or sink we wouldn't have to read out through the whole implementation source to determine where to put the transform. Having that would make is much easier to work with "built-in transforms".

binarylogic · 2020-01-15T22:00:18Z

On the other hand, this is a question of good defaults.

This is important. We lean towards convention over configuration. This specific issue is such a fundamental one that I think it should be solved by default. To us it seems obvious, because we know about the problem and we are aware of all options Vectors provides, but to a user, it is not so obvious.

we don't currently have a way other than to extend a sink/source in the code. If we continue this practice, I'd prepare some better way of doing this rather than getting into the code of each implementation

I'm interested in what you're thinking here. We've had a lot of related discussions around this, and @LucioFranco is currently working on something related (wrapping sinks to extend their behavior). #1061 and #832 also solve this in different ways.

MOZGIII · 2020-01-16T02:39:38Z

I was thinking in particular about something like changing a Source trait in such a way that it exposes an extension points to easily inject transforms into the source. After doing some research, I realized it's not as promising as I initially anticipated. Sources are very different from each other, and it's not worth generalizing the extension points, because for each source they'll be different. Just going into the code would be easier in general. We can experiment with building sources that are designed around exposing well-defined extension points, but it's easier if we do it with new ones building them from the ground up.

awprice · 2023-12-18T05:10:19Z

We're seeing the impact of not batching events with the aws_kinesis_streams sink too. Observing the traffic between Vector and the Kinesis endpoint we see the sink placing a single event into each request. At low rates, i.e. less than 5,000 events per second, this doesn't seem to be much of an issue performance wise, but when exceeding 5,000 events per second, the throughput of the sink seriously suffers. We could get to about 10,000 events per second whereby rate limiting by AWS prevents it from going any higher.

We noticed that the sink batches events by the partition_key, and so having a randomly generated key, whilst good for ensuring events are spread evenly across Kinesis shards means that each event will be in it's own request.

To work around this, we've generated our own partition key and used the partition_key_field to use this instead of the randomly generated partition key. We incorporated a timestamp with the nanoseconds formatted to 3 decimal places, so that events occurring in the same millisecond will be batched together.

.@partition_key = md5(format_timestamp!(.timestamp, format: "%Y-%m-%dT%H:%M:%S%.3f%:z"))

By doing the above, we've seen the throughput of the aws_kinesis_streams greatly improved and be sufficient for our needs.

I thought I would share above in case someone else stumbles across this issue.

All of that being said, it would be great to see both the aws_kinesis_firehose and aws_kinesis_streams sinks perform batching of multiple events into a single request automatically rather than have to rely on a workaround.

The one downside to the workaround is that the custom @partition_key will also be present in the event being sent off to Kinesis, which ideally we'd like to avoid.

…ctordotdev#1407 Send batches to AWS Kinesis Data Streams and AWS Firehose independent of ther parition keys. In both API's batches of events do not need to share the same partition key. This makes the protocol more efficient, as by default the partition key is a random key being different for every event.

Send batches to AWS Kinesis Data Streams and AWS Firehose independent of ther parition keys. In both API's batches of events do not need to share the same partition key. This makes the protocol more efficient, as by default the partition key is a random key being different for every event.

steven-aerts · 2024-07-04T03:43:35Z

With the submission of #20653 I think this issue can now be closed.

jszwedko · 2024-07-08T13:07:31Z

Agreed, thanks @steven-aerts . Closed by #20653

…ctordotdev#1407 (vectordotdev#20653) Send batches to AWS Kinesis Data Streams and AWS Firehose independent of ther parition keys. In both API's batches of events do not need to share the same partition key. This makes the protocol more efficient, as by default the partition key is a random key being different for every event.

ghost added type: enhancement A value-adding code change that enhances its existing functionality. sink: aws_kinesis_firehose Anything `aws_kinesis_firehose` sink related labels Dec 20, 2019

ghost mentioned this issue Dec 20, 2019

feat(new sink): Initial aws_kinesis_firehose sink #1388

Merged

MOZGIII mentioned this issue Jan 16, 2020

New merge transform #1488

Closed

7 tasks

binarylogic added domain: performance Anything related to Vector's performance domain: networking Anything related to Vector's networking meta: good first issue Anything that is good for new contributors. labels Aug 7, 2020

neuronull mentioned this issue Oct 25, 2023

aws_kinesis_firehose sink doesn't use batches properly #18939

Closed

gromnsk mentioned this issue Nov 9, 2023

fix(aws_kinesis_firehose sink): add workaround for batching #19108

Merged

jszwedko mentioned this issue May 29, 2024

aws_kinesis_stream sink batching is not working #20575

Closed

jszwedko closed this as completed Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch multiple events as a single record in `aws_kinesis_firehose` sink #1407

Batch multiple events as a single record in `aws_kinesis_firehose` sink #1407

ghost commented Dec 20, 2019 •

edited by ghost

Loading

binarylogic commented Dec 20, 2019

ghost commented Dec 21, 2019

binarylogic commented Dec 21, 2019

erikbos commented Dec 30, 2019

lukesteensen commented Jan 15, 2020

MOZGIII commented Jan 15, 2020

MOZGIII commented Jan 15, 2020

binarylogic commented Jan 15, 2020

MOZGIII commented Jan 16, 2020

awprice commented Dec 18, 2023 •

edited

Loading

steven-aerts commented Jul 4, 2024

jszwedko commented Jul 8, 2024

Batch multiple events as a single record in aws_kinesis_firehose sink #1407

Batch multiple events as a single record in aws_kinesis_firehose sink #1407

Comments

ghost commented Dec 20, 2019 • edited by ghost Loading

Appendix

binarylogic commented Dec 20, 2019

ghost commented Dec 21, 2019

binarylogic commented Dec 21, 2019

erikbos commented Dec 30, 2019

lukesteensen commented Jan 15, 2020

MOZGIII commented Jan 15, 2020

MOZGIII commented Jan 15, 2020

binarylogic commented Jan 15, 2020

MOZGIII commented Jan 16, 2020

awprice commented Dec 18, 2023 • edited Loading

steven-aerts commented Jul 4, 2024

jszwedko commented Jul 8, 2024

Batch multiple events as a single record in `aws_kinesis_firehose` sink #1407

Batch multiple events as a single record in `aws_kinesis_firehose` sink #1407

ghost commented Dec 20, 2019 •

edited by ghost

Loading

awprice commented Dec 18, 2023 •

edited

Loading