-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhanced discovery through ElasticSearch #5
Comments
Hi @paulu-aws, what would the architecture look like for this? Do you have a diagram? |
@martyn-swift, Originally I had in mind an S3 trigger on the data-lake bucket that would trigger a lambda to insert records into Elasticsearch. However, the cheat code for this Quilt. https://quiltdata.com/ Full disclosure, I don't work for them, this is my own opinion and not my employers: Quilt is an awesome product for this kind of use case. I'm not sure I'd try and implement my own indexing and elasticsearch architecture if Quilt was an option. |
@paulu-aws, does the Glue job write to S3 in large batches? Would the job trigger per object? Is Eventbridge an alternative for batching up s3 put events? |
@martyn-swift, part of the beauty of the Glue is how Dynamic Frames abstracts away this kind of detail, UNLESS you really want to control it. |
@martyn-swift, I'll also mention if you are trying to plug the Glue Job into an event-driven architecture, its probably not a good idea to rely on S3 triggers as the message bus. S3 triggers characterize write behaviors to an S3 bucket, not logical processing steps. Multi-file outputs, failed-writes, object versions, etc all become challenges relying on S3 triggers as an eventing mechanism. Better options in order of complex to least complex (IMO) would be AWS EventBridge, Amazon MQ, AWS Step, and Amazon SNS. All of those services can be called directly from inside your Glue job using python (or Java) APIs to send messages over the duration of your job. For example, directly after the |
@paulu-aws, can you give me an example of the SNS call? Is it using boto3 with a call after the job.commit() ? |
The SNS call would look like any other Boto3 call you might see in the api basics examples. You just need to make sure you glue jobs execution role has IAM permissions to publish to the SNS topic. You may eventually find yourself peppering in several Just keep in mind, doing this will start you down a path of blending business workflow state with your data engineering code which are probably best kept abstracted. As a one-off or short term solution, SNS inside the Glue job is fine. But anything else really deserves a more sophisticated business workflow framework like AWS Step. Here is a quick Step State machine I mocked up in a few minutes that triggers a glue workflow and publishes to SNS topics WITHOUT requiring any changes to the Glue job itself. |
No description provided.
The text was updated successfully, but these errors were encountered: