You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
No, it is a new feature.
Describe the solution you'd like
Currently, the S3 sink to ingest data from S3 requires fixed prefixes and schemas. However larger enterprises have data lakes for large scale data sets (~800M records daily per data set in my case). In AWS, Glue keeps track of all the catalog tables and metadata. If we could set up our catalog as a source and maintain sync, we can remove unnecessary infrastructure and complexity in the indexing process at large scale.
As a source, a user should be able to provide Data Prepper with:
Glue database name
Glue table name
Primary key fields of the table
Shard count
Replica count
And should then follow a process similar to this:
Plant UML code for reference:
@startuml
skinparam maxMessageSize 150
autonumber
participant "OSIS" as osi
participant "Glue API" as ga
participant "OpenSearch" as os
participant "S3" as s3
osi -> ga: Get table metadata
osi -> os: Get all indices in the alias
osi -> osi: Compare indices in alias vs table partitions
loop For every missing index
osi -> os: Create index
osi -> os: Set refresh interval to -1
osi -> os: Add index to alias
osi -> s3: Get the data from the partition
osi -> os: Index the data
osi -> os: Set refresh interval back to original value
end
loop For every partition
osi -> s3: Check row count on the partition
osi -> os: Check record count in index
osi -> os: If record mismatch, set refresh interval to -1
osi -> s3: If record mismatch, get the data from the partition
osi -> os: If record mismatch, index the data
osi -> os: If record mismatch, set refresh interval back to original value
osi -> os: Purge deleted records
end
@enduml
As a result, in the OS cluster, we will have:
An alias called <database_name>_<table_name>
If the table is partitioned, indices that point to the alias for each partition in this format:
Is your feature request related to a problem? Please describe.
No, it is a new feature.
Describe the solution you'd like
Currently, the S3 sink to ingest data from S3 requires fixed prefixes and schemas. However larger enterprises have data lakes for large scale data sets (~800M records daily per data set in my case). In AWS, Glue keeps track of all the catalog tables and metadata. If we could set up our catalog as a source and maintain sync, we can remove unnecessary infrastructure and complexity in the indexing process at large scale.
As a source, a user should be able to provide Data Prepper with:
And should then follow a process similar to this:
Plant UML code for reference:
As a result, in the OS cluster, we will have:
<database_name>_<table_name>
Describe alternatives you've considered (Optional)
I currently do this:
https://github.com/aws-samples/aws-s3-to-opensearch-pipeline
It's fast and it works, but an out of the box solution would be preferable.
The text was updated successfully, but these errors were encountered: