Allow using event fields in s3 sink object_key #3310

cameronattard · 2023-09-07T01:11:19Z

Is your feature request related to a problem? Please describe.
Currently it seems like all objects from the s3 sink are sent using the same prefix, with only date-time being configurable. This means in order to retrieve a subset of events, e.g. logs from a specific hostname, you need to query all events for the time period.

Describe the solution you'd like
We would like to send events to different s3 object prefixes based on specific event fields, for example, hostname. This makes searching events in s3 simpler and cheaper as you can directly query the relevant subset of events.

Describe alternatives you've considered (Optional)
We could potentially use separate sinks for each subset of logs but this is not really dynamic or scalable.

Additional context
N/A

dlvenable · 2023-09-13T15:06:46Z

@cameronattard , Thank you for this suggestion. I think this could be a useful feature and could allow for Hive-style partitioning which is useful with use-cases such as Amazon Athena.

https://docs.aws.amazon.com/athena/latest/ug/partitions.html

One difficulty with this solution is that we would also need to route events to the desired object and have multiple objects "in-flight". This could work quite nicely with the new multipart buffer.

Would you be interested in taking this up?

cameronattard · 2023-09-14T03:25:33Z

@dlvenable thanks for the feedback. Unfortunately I have neither the expertise nor the bandwidth to implement this.

kkondaka · 2023-10-04T16:54:15Z

@dlvenable, it looks like the ask here is that we make the pattern in https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-plugins/s3-sink/src/main/java/org/opensearch/dataprepper/plugins/sink/s3/configuration/ObjectKeyOptions.java to be configurable. We can make pattern configurable and allow expressions in it. I think that would help here. Also, we can add getHostName() function available in expressions and that would address the example case mentioned.

dlvenable · 2023-10-04T19:18:44Z

@kkondaka , That is the basic ask yes. However, it is somewhat more complicated because the S3 sink will need to have multiple S3 objects and group events to go into those objects. For example, if the pattern includes the timestamp's year, month, and date, then we must group the events into different objects corresponding to the event's timestamp - not the current timestamp.

Also, we should consider how this intersects with the thresholds. Should the thresholds be applied per group? Or for the entire sink? The per-group approach is natural, but could lead to memory issues as the sink could have dozens of groups.

dlvenable · 2023-10-04T19:28:27Z

Also, Data Prepper should support Hadoop file system partitioning. For example, you can partition by a timestamp:

path_prefix: "events/year=%{yyyy}/month=%{MM}/day=%{dd}/"

The example above will partition by the current time. But, we really want to partition by the timestamp. We will need some additional capability in Data Prepper to get part of a timestamp.

Perhaps a date-time format method?

path_prefix: "events/year=${date_time_format(eventTime, "YYYY")}/month=${date_time_format(eventTime, "MM")}/day=${date_time_format(eventTime, "dd")}/"

dlvenable · 2023-10-04T19:35:55Z

I created #3434 for the timestamp formatting.

@cameronattard, If you are looking to use time formatting, please take a look and provide any feedback on that proposal. Thanks!

cameronattard · 2023-10-04T23:48:06Z

I should clarify that hostname is just a generic example. Ideally we should be able to inject any arbitrary event field into the object key.

kkondaka · 2023-10-07T00:02:05Z

@cameronattard of course. That's why I was suggesting adding a support for expression, so that any field and functions can be part of the object name

faisalabujabal · 2024-02-20T17:11:08Z

@dlvenable using expressions in the s3 sink config is a feature our project really needs. can it also be applied to the s3 bucket name to support dynamic buckets extracted or constructed from the event?

dlvenable · 2024-05-16T15:46:16Z

@graytaylor0 , Is this resolved by #4346 and #4385?

graytaylor0 · 2024-05-16T15:49:27Z

Yes those add dynamic path_prefix and dynamic bucket support. They do not add support to configure the object_key in the s3 sink, but the ask here is just about configuring path_prefix and bucket, so closing this issue

MrR0807 · 2024-07-08T07:26:59Z

Hello,

Can you please add additional documentation and pipeline examples how one could utilize this functionality? It is a very useful one, but I cannot understand the correct syntax, nor paid AWS support knows how to write one.

cameronattard · 2024-07-08T23:30:44Z

Hello,

Can you please add additional documentation and pipeline examples how one could utilize this functionality? It is a very useful one, but I cannot understand the correct syntax, nor paid AWS support knows how to write one.

I'm using it at the moment, here is an example:

        object_key:
          path_prefix: "opensearch-ingestion/${/your_field_name}/%{yyyy}/%{MM}/%{dd}/%{HH}"

cameronattard added the untriaged label Sep 7, 2023

github-project-automation bot added this to Data Prepper Tracking Board Sep 7, 2023

github-project-automation bot moved this to Unplanned in Data Prepper Tracking Board Sep 7, 2023

dlvenable added enhancement New feature or request and removed untriaged labels Sep 13, 2023

dlvenable added this to the v2.6 milestone Oct 4, 2023

dlvenable moved this from Unplanned to To do in Data Prepper Tracking Board Oct 4, 2023

dlvenable mentioned this issue Oct 4, 2023

Data Prepper function to format a date-time #3434

Open

dlvenable modified the milestones: v2.6, v2.7 Oct 23, 2023

dlvenable modified the milestones: v2.7, v2.8 Nov 1, 2023

dlvenable mentioned this issue Mar 27, 2024

Support dynamically grouping Events together in the S3 sink #4345

Closed

dlvenable assigned graytaylor0 Apr 9, 2024

graytaylor0 closed this as completed May 16, 2024

github-project-automation bot moved this from To do to Done in Data Prepper Tracking Board May 16, 2024

dlvenable mentioned this issue May 16, 2024

Release Notes for version 2.8 #4538

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow using event fields in s3 sink object_key #3310

Allow using event fields in s3 sink object_key #3310

cameronattard commented Sep 7, 2023 •

edited

Loading

dlvenable commented Sep 13, 2023

cameronattard commented Sep 14, 2023

kkondaka commented Oct 4, 2023

dlvenable commented Oct 4, 2023

dlvenable commented Oct 4, 2023

dlvenable commented Oct 4, 2023

cameronattard commented Oct 4, 2023

kkondaka commented Oct 7, 2023

faisalabujabal commented Feb 20, 2024

dlvenable commented May 16, 2024

graytaylor0 commented May 16, 2024

MrR0807 commented Jul 8, 2024 •

edited

Loading

cameronattard commented Jul 8, 2024

Allow using event fields in s3 sink object_key #3310

Allow using event fields in s3 sink object_key #3310

Comments

cameronattard commented Sep 7, 2023 • edited Loading

dlvenable commented Sep 13, 2023

cameronattard commented Sep 14, 2023

kkondaka commented Oct 4, 2023

dlvenable commented Oct 4, 2023

dlvenable commented Oct 4, 2023

dlvenable commented Oct 4, 2023

cameronattard commented Oct 4, 2023

kkondaka commented Oct 7, 2023

faisalabujabal commented Feb 20, 2024

dlvenable commented May 16, 2024

graytaylor0 commented May 16, 2024

MrR0807 commented Jul 8, 2024 • edited Loading

cameronattard commented Jul 8, 2024

cameronattard commented Sep 7, 2023 •

edited

Loading

MrR0807 commented Jul 8, 2024 •

edited

Loading