Allow for sync from AWS Glue Catalog to AWS OpenSearch #3405

ksco92 · 2023-09-28T22:55:02Z

Is your feature request related to a problem? Please describe.

No, it is a new feature.

Describe the solution you'd like

Currently, the S3 sink to ingest data from S3 requires fixed prefixes and schemas. However larger enterprises have data lakes for large scale data sets (~800M records daily per data set in my case). In AWS, Glue keeps track of all the catalog tables and metadata. If we could set up our catalog as a source and maintain sync, we can remove unnecessary infrastructure and complexity in the indexing process at large scale.

As a source, a user should be able to provide Data Prepper with:

Glue database name
Glue table name
Primary key fields of the table
Shard count
Replica count

And should then follow a process similar to this:

Plant UML code for reference:

@startuml
skinparam maxMessageSize 150
autonumber
participant "OSIS" as osi
participant "Glue API" as ga
participant "OpenSearch" as os
participant "S3" as s3

osi -> ga: Get table metadata
osi -> os: Get all indices in the alias
osi -> osi: Compare indices in alias vs table partitions

loop For every missing index
osi -> os: Create index
osi -> os: Set refresh interval to -1
osi -> os: Add index to alias
osi -> s3: Get the data from the partition
osi -> os: Index the data
osi -> os: Set refresh interval back to original value
end

loop For every partition
osi -> s3: Check row count on the partition
osi -> os: Check record count in index
osi -> os: If record mismatch, set refresh interval to -1
osi -> s3: If record mismatch, get the data from the partition
osi -> os: If record mismatch, index the data
osi -> os: If record mismatch, set refresh interval back to original value
osi -> os: Purge deleted records
end

@enduml

As a result, in the OS cluster, we will have:

An alias called <database_name>_<table_name>
If the table is partitioned, indices that point to the alias for each partition in this format:

<database_name>_<table_name>_<partition1_value>_<partition2_value>_<partitionN_value>

Describe alternatives you've considered (Optional)

I currently do this:

https://github.com/aws-samples/aws-s3-to-opensearch-pipeline

It's fast and it works, but an out of the box solution would be preferable.

The text was updated successfully, but these errors were encountered:

ksco92 added the untriaged label Sep 28, 2023

github-project-automation bot added this to Data Prepper Tracking Board Sep 28, 2023

github-project-automation bot moved this to Unplanned in Data Prepper Tracking Board Sep 28, 2023

dlvenable added enhancement New feature or request and removed untriaged labels Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for sync from AWS Glue Catalog to AWS OpenSearch #3405

Allow for sync from AWS Glue Catalog to AWS OpenSearch #3405

ksco92 commented Sep 28, 2023

Allow for sync from AWS Glue Catalog to AWS OpenSearch #3405

Allow for sync from AWS Glue Catalog to AWS OpenSearch #3405

Comments

ksco92 commented Sep 28, 2023