Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnest DynamoDB JSON format #19

Open
borgoat opened this issue Nov 8, 2023 · 1 comment
Open

Unnest DynamoDB JSON format #19

borgoat opened this issue Nov 8, 2023 · 1 comment

Comments

@borgoat
Copy link

borgoat commented Nov 8, 2023

Hello DuckDB team!

I'm not sure if this repo is the best place to discuss this, since it's not something that strictly depends on the AWS SDK. AFAIK Dynamo unmarshaling is not in the C++ SDK at least1. However, it is quite related to AWS usage in general...

I'm trying to run some ETL job on a DynamoDB table previously exported to S32 as NDJSON files. DuckDB will happily read it, but then I'm stuck with this fairly impractical schema generated by DynamoDB.

D select PK, SK, __typename, isMarketOpen from main limit 5;
┌────────────────────────────────────────────────────┬──────────────────────────────────────────────┬──────────────────────────────┬──────────────────────┐
│                         PK                         │                      SK                      │          __typename          │     isMarketOpen     │
│                 struct(s varchar)                  │              struct(s varchar)               │      struct(s varchar)       │ struct(bool boolean) │
├────────────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────┼──────────────────────┤
│ {'S': OBJECT#e036c748-61bb-4431-96da-eeb340c96ec7} │ {'S': QUOTE#NA9#XETR}                        │ {'S': Quote}                 │ {'BOOL': true}       │
│ {'S': USER#Google_101534313370849411321}           │ {'S': #PROFILE#Google_101534313370849411321} │ {'S': User}                  │                      │
│ {'S': OBJECT#ao_01H9QN66R50VDVAMZ79CXRHNE9}        │ {'S': OBJECT}                                │ {'S': ActivityStreamsObject} │                      │
│ {'S': OBJECT#457acd68-8900-4083-92c7-e3891c39648d} │ {'S': OBJECT}                                │ {'S': Stock}                 │                      │
│ {'S': OBJECT#457acd68-8900-4083-92c7-e3891c39648d} │ {'S': QUOTE#DEFAULT}                         │ {'S': Quote}                 │ {'BOOL': true}       │
└────────────────────────────────────────────────────┴──────────────────────────────────────────────┴──────────────────────────────┴──────────────────────┘

This will include some maps, lists, and sets that should be recursively unnested into DuckDB structs and lists to process them effectively.

AWS Glue supposedly offers a DynamicFrame transformation to do the same3 (it doesn't actually work when it comes to nested objects, which is where it's needed the most... but that's a different story)

I think this could be an extension function in DuckDB. What do you think? I'd be happy to work on it if you agree on where it should be implemented.

Thanks!
Giorgio

Footnotes

  1. https://github.com/aws/aws-sdk-cpp/issues/1556

  2. https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/S3DataExport.Output.html

  3. https://docs.aws.amazon.com/glue/latest/dg/glue-etl-scala-apis-glue-dynamicframe-class.html#glue-etl-scala-apis-glue-dynamicframe-class-defs-unnestddbjson

@samansmink
Copy link
Collaborator

Hey @borgoat!

This seems like a useful utility function to me, that should live in the AWS extension! If you would like to add it, i can definitely review and merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants