Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 buffer using pipeline transformations #4809

Open
dlvenable opened this issue Aug 2, 2024 · 1 comment
Open

S3 buffer using pipeline transformations #4809

dlvenable opened this issue Aug 2, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@dlvenable
Copy link
Member

Is your feature request related to a problem? Please describe.

For workloads that are smaller and want durability, using S3 as a buffer can be a good solution.

Describe the solution you'd like

Data Prepper already has a few things that we can combine to create an S3 buffer.

  1. An S3 source
  2. An S3 sink
  3. Pipeline transformations

I propose that we have a new buffer - pipeline_s3 which is implemented only as a pipeline transformation.

my-pipeline:
  source:
    http:
  buffer:
    pipeline_s3:
      bucket: mybucket
  sink:
    - opensearch:

This would transform into:

my-pipeline-source:
  source:
    http:
  buffer:
    bounded_blocking:
  sink:
    - s3:
        bucket: mybucket

my-pipeline-sink:
  source:
    s3:
      scan:
        buckets:
          - bucket:
               name: mybucket
  buffer:
    bounded_blocking:
  sink:
    - opensearch:

Describe alternatives you've considered (Optional)

We could implement an S3 buffer similar to the Kafka buffer that does not require splitting the pipeline. But, creating this would be quite a bit faster.

Also, I think we should leave room for a possible S3 buffer that is implement. My proposal is to alter the name of this buffer to make it distinct from an S3 buffer. And also to avoid confusing with other buffers such as Kafka. Thus, I called this pipeline_s3.

One alternative to changing the name is to use a flag instead - split_pipeline: true or asynchronous_buffer: true.

Additional context

N/A

@kkondaka
Copy link
Collaborator

kkondaka commented Aug 6, 2024

David we probably need some kind of partitioning mechanism (using folders) and make sure items in a partition are processed in order.

@kkondaka kkondaka added enhancement New feature or request and removed untriaged labels Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants