You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use s3 filename template prefix/'%Y_%m_%d__%H_%M_%S_%f to sort filenames alphabetically.
The next new file is guaranteed to receive the following name in alphabetical order.
In kafka, we have several partitions of one topic, each of them must be written with the same prefix (prefix=topic_name) in order.
It's possible to ensure the files order with this template by running no more than 1 connector task.
Issue:
Timestamp variable have next parameters:
unit parameter values:
yyyy - year, e.g. 2020 (please note that YYYY is deprecated and is interpreted as yyyy)
MM - month, e.g. 03
dd - day, e.g. 01
HH - hour, e.g. 24
Consequences:
With these parameters, files recorded within 1 hour will not differ in name.
Adding the partition number and offset to the file name in the template can solve this problem, but it makes working with the root prefix more difficult.
Uniqueness can be ensured by adding minutes, seconds, milliseconds to the timestamp variable.
Details:
Looks like it's enough to extend the following functionality :
Uniqueness can be ensured by adding minutes, seconds, milliseconds to the timestamp variable.
Not sure this is the case. At least it won't be guaranteed as there's still a chance for messages from different partitions to be in the same file (if I'm understanding your case correctly).
By reducing the time unit we are just hoping that within a shorter time window there's only messages from one partition.
I guess adding minutes is a valid request -- as some users may want to have messages rotated more frequently -- but I'm not sure going down to microseconds is the right way to proceed. Even more that this will add pressure to the connector task as it will lead to a large number of keys to keep in memory before all messages are flushed to S3.
Do you require ordering only between files? or also ordering within the messages in the file?
You may be solving the first ordering with your proposed approach, but I don't think the second ordering is guaranteed as it's mixing messages from different partitions in the same file.
Not sure yet how to handle this, but let's see if we can find a workaround with the existing configurations before considering changing the connector.
Scenario Overview
We use s3 filename template
prefix/'%Y_%m_%d__%H_%M_%S_%f
to sort filenames alphabetically.The next new file is guaranteed to receive the following name in alphabetical order.
In kafka, we have several partitions of one topic, each of them must be written with the same prefix (prefix=topic_name) in order.
It's possible to ensure the files order with this template by running no more than 1 connector task.
Issue:
Timestamp variable have next parameters:
Consequences:
With these parameters, files recorded within 1 hour will not differ in name.
Adding the partition number and offset to the file name in the template can solve this problem, but it makes working with the root prefix more difficult.
Uniqueness can be ensured by adding minutes, seconds, milliseconds to the timestamp variable.
Details:
Looks like it's enough to extend the following functionality :
with next parameters:
The text was updated successfully, but these errors were encountered: