Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch multiple events as a single record in aws_kinesis_firehose sink #1407

Closed
ghost opened this issue Dec 20, 2019 · 12 comments
Closed

Batch multiple events as a single record in aws_kinesis_firehose sink #1407

ghost opened this issue Dec 20, 2019 · 12 comments
Labels
domain: networking Anything related to Vector's networking domain: performance Anything related to Vector's performance meta: good first issue Anything that is good for new contributors. sink: aws_kinesis_firehose Anything `aws_kinesis_firehose` sink related type: enhancement A value-adding code change that enhances its existing functionality.

Comments

@ghost
Copy link

ghost commented Dec 20, 2019

The newly added aws_kinesis_firehose sink encodes each event as a separate record. However, in case of logs data, there is the following caveat in the Firehose pricing:

Pricing is based on volume of data ingested into Amazon Kinesis Data Firehose, which is calculated as the number of data records you send to the service, times the size of each record rounded up to the nearest 5KB.

This means that if size of each log event is less than 500 bytes (just for example), then streaming this data with batching as many records as possible into a single one reduces the costs at least tenfold.

So I propose to add an additional Boolean option to the sink configuration, which would enable squashing multiple records into single ones when possible using greedy approach, and probably even enable this option by default. See the appendix for some considerations about delimiting the records.

With this feature, Vector could provide real cost-saving value by just acting as a middleware between applications which need to write log events in realtime (without any batching implemented inside of the application) and Firehose.

Appendix

All records sent to Firehose need to have \n in the end because Firehose itself can batch multiple records into a single one before sending them further (as specified in PutRecord docs). This means that, for example in case of S3 destination, sending multiple events separated by \n (and with \n in the end) as a single record is equivalent to sending each event as a separate record (with \n in the end as well).

@ghost ghost added type: enhancement A value-adding code change that enhances its existing functionality. sink: aws_kinesis_firehose Anything `aws_kinesis_firehose` sink related labels Dec 20, 2019
@binarylogic
Copy link
Contributor

@a-rodin how is this not accomplished via the batch_size option? And is this also relevant for the aws_kinesis_streams sink?

@ghost
Copy link
Author

ghost commented Dec 21, 2019

@binarylogic As far as I understood the code, currently multiple records are batched in a single request, but each record contains a single event.

I think this can be applied to aws_kinesis_streams too, but possible cost reductions would not be as dramatic because of different pricing model used by Kinesis Streams.

@binarylogic
Copy link
Contributor

Ah, yes, I can confirm this does significantly improve throughput. We implemented this strategy with the Timber pipeline and also lz4 compressed the data beforehand.

@erikbos
Copy link

erikbos commented Dec 30, 2019

Great! Most managed/hosted streaming services tend to charge per single event and hence batching multiple records into a single event helps to not run into service limits or helps with cost saving.

I'd love to see it for the Kafka sink as well, see #560 (comment) for more detailed comments about that. Shall I open issue for that?

@lukesteensen
Copy link
Member

@MOZGIII what do you think about using the merge transform to address cases like this? Seems like that could be a little cleaner than adding this logic to all the potentially relevant sinks.

@MOZGIII
Copy link
Contributor

MOZGIII commented Jan 15, 2020

If it's sufficient, using merge transform is definitely possible here. I'm confused with terminology in the domain (kinesis/firehose), and I don't quite get how events and records relate to each other, so I might not be aware enough to resolve this question properly. I'd suggest trying to come up with a sample configuration to see if it's actually possible to implement with merge transform. I can work on this.
In fact, I think it'd be nice to investigate what use cases the merge transform can cover. This would allow us to deduce what partial event marking mechanisms would be useful to implement - in addition the regexp based ones we have now. I'll elaborate more on this at #1488.

@MOZGIII
Copy link
Contributor

MOZGIII commented Jan 15, 2020

Whether to extend the sink or suggest users add merge transform manually is a separate question. We added merge transform to the docker source, but in general, I'd prefer to tell users to add merge transforms explicitly. As it is now, it would really simplify the code and reduces the number of things that can go wrong. I think we do have to continuously test various configuration setups somehow - i.e. by adding them as correctness tests to our test harness.

On the other hand, this is a question of good defaults. To me, it seems like if we want to add something as a default - we don't currently have a way other than to extend a sink/source in the code. If we continue this practice, I'd prepare some better way of doing this rather than getting into the code of each implementation - just to make our lives as maintainer easier. I'm talking about more explicit extension points at the sinks and sources - in particular, such that if we want to attach a transform to a source or sink we wouldn't have to read out through the whole implementation source to determine where to put the transform. Having that would make is much easier to work with "built-in transforms".

@binarylogic
Copy link
Contributor

On the other hand, this is a question of good defaults.

This is important. We lean towards convention over configuration. This specific issue is such a fundamental one that I think it should be solved by default. To us it seems obvious, because we know about the problem and we are aware of all options Vectors provides, but to a user, it is not so obvious.

we don't currently have a way other than to extend a sink/source in the code. If we continue this practice, I'd prepare some better way of doing this rather than getting into the code of each implementation

I'm interested in what you're thinking here. We've had a lot of related discussions around this, and @LucioFranco is currently working on something related (wrapping sinks to extend their behavior). #1061 and #832 also solve this in different ways.

@MOZGIII MOZGIII mentioned this issue Jan 16, 2020
7 tasks
@MOZGIII
Copy link
Contributor

MOZGIII commented Jan 16, 2020

I was thinking in particular about something like changing a Source trait in such a way that it exposes an extension points to easily inject transforms into the source. After doing some research, I realized it's not as promising as I initially anticipated. Sources are very different from each other, and it's not worth generalizing the extension points, because for each source they'll be different. Just going into the code would be easier in general. We can experiment with building sources that are designed around exposing well-defined extension points, but it's easier if we do it with new ones building them from the ground up.

@binarylogic binarylogic added domain: performance Anything related to Vector's performance domain: networking Anything related to Vector's networking meta: good first issue Anything that is good for new contributors. labels Aug 7, 2020
@awprice
Copy link

awprice commented Dec 18, 2023

We're seeing the impact of not batching events with the aws_kinesis_streams sink too. Observing the traffic between Vector and the Kinesis endpoint we see the sink placing a single event into each request. At low rates, i.e. less than 5,000 events per second, this doesn't seem to be much of an issue performance wise, but when exceeding 5,000 events per second, the throughput of the sink seriously suffers. We could get to about 10,000 events per second whereby rate limiting by AWS prevents it from going any higher.

We noticed that the sink batches events by the partition_key, and so having a randomly generated key, whilst good for ensuring events are spread evenly across Kinesis shards means that each event will be in it's own request.

To work around this, we've generated our own partition key and used the partition_key_field to use this instead of the randomly generated partition key. We incorporated a timestamp with the nanoseconds formatted to 3 decimal places, so that events occurring in the same millisecond will be batched together.

.@partition_key = md5(format_timestamp!(.timestamp, format: "%Y-%m-%dT%H:%M:%S%.3f%:z"))

By doing the above, we've seen the throughput of the aws_kinesis_streams greatly improved and be sufficient for our needs.

I thought I would share above in case someone else stumbles across this issue.


All of that being said, it would be great to see both the aws_kinesis_firehose and aws_kinesis_streams sinks perform batching of multiple events into a single request automatically rather than have to rely on a workaround.

The one downside to the workaround is that the custom @partition_key will also be present in the event being sent off to Kinesis, which ideally we'd like to avoid.

steven-aerts added a commit to steven-aerts/vector that referenced this issue Jun 12, 2024
…ctordotdev#1407

Send batches to AWS Kinesis Data Streams and AWS Firehose independent of
ther parition keys.  In both API's batches of events do not need to
share the same partition key.
This makes the protocol more efficient, as by default the partition key
is a random key being different for every event.
steven-aerts added a commit to steven-aerts/vector that referenced this issue Jun 12, 2024
…ctordotdev#1407

Send batches to AWS Kinesis Data Streams and AWS Firehose independent of
ther parition keys.  In both API's batches of events do not need to
share the same partition key.
This makes the protocol more efficient, as by default the partition key
is a random key being different for every event.
steven-aerts added a commit to steven-aerts/vector that referenced this issue Jul 2, 2024
…ctordotdev#1407

Send batches to AWS Kinesis Data Streams and AWS Firehose independent of
ther parition keys.  In both API's batches of events do not need to
share the same partition key.
This makes the protocol more efficient, as by default the partition key
is a random key being different for every event.
steven-aerts added a commit to steven-aerts/vector that referenced this issue Jul 2, 2024
…ctordotdev#1407

Send batches to AWS Kinesis Data Streams and AWS Firehose independent of
ther parition keys.  In both API's batches of events do not need to
share the same partition key.
This makes the protocol more efficient, as by default the partition key
is a random key being different for every event.
github-merge-queue bot pushed a commit that referenced this issue Jul 3, 2024
Send batches to AWS Kinesis Data Streams and AWS Firehose independent of
ther parition keys.  In both API's batches of events do not need to
share the same partition key.
This makes the protocol more efficient, as by default the partition key
is a random key being different for every event.
@steven-aerts
Copy link
Contributor

With the submission of #20653 I think this issue can now be closed.

@jszwedko
Copy link
Member

jszwedko commented Jul 8, 2024

Agreed, thanks @steven-aerts . Closed by #20653

@jszwedko jszwedko closed this as completed Jul 8, 2024
ym pushed a commit to ym/vector that referenced this issue Aug 18, 2024
…ctordotdev#1407 (vectordotdev#20653)

Send batches to AWS Kinesis Data Streams and AWS Firehose independent of
ther parition keys.  In both API's batches of events do not need to
share the same partition key.
This makes the protocol more efficient, as by default the partition key
is a random key being different for every event.
AndrooTheChen pushed a commit to discord/vector that referenced this issue Sep 23, 2024
…ctordotdev#1407 (vectordotdev#20653)

Send batches to AWS Kinesis Data Streams and AWS Firehose independent of
ther parition keys.  In both API's batches of events do not need to
share the same partition key.
This makes the protocol more efficient, as by default the partition key
is a random key being different for every event.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: networking Anything related to Vector's networking domain: performance Anything related to Vector's performance meta: good first issue Anything that is good for new contributors. sink: aws_kinesis_firehose Anything `aws_kinesis_firehose` sink related type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
Development

No branches or pull requests

7 participants