Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for sync from AWS Glue Catalog to AWS OpenSearch #3405

Open
ksco92 opened this issue Sep 28, 2023 · 0 comments
Open

Allow for sync from AWS Glue Catalog to AWS OpenSearch #3405

ksco92 opened this issue Sep 28, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@ksco92
Copy link

ksco92 commented Sep 28, 2023

Is your feature request related to a problem? Please describe.

No, it is a new feature.

Describe the solution you'd like

Currently, the S3 sink to ingest data from S3 requires fixed prefixes and schemas. However larger enterprises have data lakes for large scale data sets (~800M records daily per data set in my case). In AWS, Glue keeps track of all the catalog tables and metadata. If we could set up our catalog as a source and maintain sync, we can remove unnecessary infrastructure and complexity in the indexing process at large scale.

As a source, a user should be able to provide Data Prepper with:

  • Glue database name
  • Glue table name
  • Primary key fields of the table
  • Shard count
  • Replica count

And should then follow a process similar to this:

Screenshot 2023-09-28 at 3 46 17 PM

Plant UML code for reference:

@startuml
skinparam maxMessageSize 150
autonumber
participant "OSIS" as osi
participant "Glue API" as ga
participant "OpenSearch" as os
participant "S3" as s3

osi -> ga: Get table metadata
osi -> os: Get all indices in the alias
osi -> osi: Compare indices in alias vs table partitions

loop For every missing index
osi -> os: Create index
osi -> os: Set refresh interval to -1
osi -> os: Add index to alias
osi -> s3: Get the data from the partition
osi -> os: Index the data
osi -> os: Set refresh interval back to original value
end

loop For every partition
osi -> s3: Check row count on the partition
osi -> os: Check record count in index
osi -> os: If record mismatch, set refresh interval to -1
osi -> s3: If record mismatch, get the data from the partition
osi -> os: If record mismatch, index the data
osi -> os: If record mismatch, set refresh interval back to original value
osi -> os: Purge deleted records
end

@enduml

As a result, in the OS cluster, we will have:

  • An alias called <database_name>_<table_name>
  • If the table is partitioned, indices that point to the alias for each partition in this format:
<database_name>_<table_name>_<partition1_value>_<partition2_value>_<partitionN_value>

Describe alternatives you've considered (Optional)

I currently do this:

https://github.com/aws-samples/aws-s3-to-opensearch-pipeline

It's fast and it works, but an out of the box solution would be preferable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

2 participants