feature(ingest/athena): introduce support for complex and nested schemas in Athena #8137

bossenti · 2023-05-26T12:17:05Z

PR summary

This PR introduces a major improvement to the Athena source. Currently, the AWS Athena source can not handle fields with complex data types like map, array, or, struct accordingly. Instead they are all detected and handled as string values which is obviously wrong. This is not directly an issue of DataHub since this behavior traces down to the corresponding external library (PyAthena).
This PR implements the support for such complex data types by using an extension feature of PyAthena, resp. SQLalchemy (see more details below).

Changes Introduced

Soften PyAthena version requirement
Before this PR, PyAthena is pinned to version 2.4.1 as the usage of an internal method was required. This method is now publicly exposed, so we make use of the public method and update the requirements in setup.py accordingly.
Reference: 6af0633
Implementation of a custom Athena dialect
PyAthena allows to implement custom SQLalchemy dialects to modify data handling according to once needs.
We took this approach since this allows us to get the desired behavior into PyAthena without having the need to contribute there. Therefore, we introduced our own custom dialect (CustomAthenaRestDialect ), which only overwrites the behavior of how types are detected from the DDL description returned from Athena.
To parse the DDL description we make use of the already existing get_avro_schema_for_hive_column function.
With these changes, all data types are correctly detected and displayed in the DataHub schema field overview.
Here is one example:

Reference: cf595d1
Adapt schema field generation for Athena
With the changes described in 2 data types are correctly retrieved and displayed in the UI.
But what is not included to this point, is the generation of schema fields in the stye of [version=2.0] and therefore structs and arrays are not collapsible/extendable in the DataHub UI.
Therefore, we implemented SqlAlchemyColumnToAvroConverter which is strongly aligned to the already existing HiveColumnToAvroConverter and creates the schema fields accordingly.
This enables the extension of complex data types in the UI for Athena as shown below:

Reference: 9c0b6ab

Warning

~~While this is a great improvement of the AWS Athena source in our opinion, this should probably be considered as a >breaking change for DataHub since the field paths change completely:~~
~~Before:~~

name

~~After:~~

[version=2.0].[type=str].name

~~If you agree, we would add the change accordingly to the updating-datahub.md file, but we are curious how you assess >this aspect.~~
~~One possible alternative is to hide these changes behind a feature flag.~~

☝🏼 Above warning should be relevant anymore after the release of version 0.10.5
But maybe we can get rid of the custom generated of v2-complied schema fields? 🤔

We would be happy to get your feedback on this PR and discuss the implementation if required.

cc: @hsheth2 @treff7es

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

…on requirements Co-authored-by: dnks23 <[email protected]>

…a to detect complex data types correctly Co-authored-by: dnks23 <[email protected]>

…nested fields for SQLalchemy types and enable Athena source Co-authored-by: dnks23 <[email protected]>

bossenti · 2023-05-30T05:59:06Z

Please indicate if your interested in this contribution then I would focus on fixing the CI.

treff7es · 2023-05-30T06:54:37Z

@bossenti absolutely, I will take a look at it.
Sorry for the late reply and thanks for the contribution!

treff7es · 2023-05-30T14:22:45Z

@bossenti it seems like ci is failing, please, can you fix it?

bossenti · 2023-05-31T11:08:46Z

@treff7es checks for Python 3.7 fail because the required version of sqlalchemy does not contain the TupleType. Is there a chance to switch to a slightly newer version?

With respect to the quick tests for python 3.10, could you help me here please? I don't have a clue why this step is failing

treff7es · 2023-06-06T17:39:09Z

sorry for the delay; I need to verify if we can bump sqlalchemy and if not, then what other options do we have?

bossenti · 2023-06-07T05:41:58Z

sorry for the delay; I need to verify if we can bump sqlalchemy and if not, then what other options do we have?

sqlalchemy introduced the TupleType in version 1.4.27, which supports Python 3.7 as well (see here).

The only idea that comes to me mind would be to port to the TupleType definition into the DataHub codebase, but I wouldn't want to do that. Especially, since the requirement for pyathena in metadata-ingestion is even above version two , so we would only do this for the sake of the CI without impacting the resulting package. But as I can imagine, there is surely a reason for insisting on this version for the Python 3.7 CI builds.

hsheth2 · 2023-09-29T17:19:02Z

@bossenti looks like there's one more lint issue - we'll be able to merge once CI is green

…o feature/athena-improvements

vercel · 2023-10-03T07:09:03Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
docs-website	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Oct 3, 2023 7:09am

bossenti · 2023-10-03T08:08:53Z

@hsheth2 metadata-ingestion (3.10, testIntegration) is failing with failed to register layer: write /usr/share/zoneinfo/posix/America/Anguilla: no space left on device. Maybe restarting the job helps here?

hsheth2 · 2023-10-04T02:02:08Z

@bossenti yup having an issue with our CI, which should be fixed by #8938

Once that's merged, we can update branch here merge once everything passes.

hsheth2

one more dependency issue here

hsheth2 · 2023-10-04T19:10:45Z

metadata-ingestion/src/datahub/ingestion/source/sql/sql_common.py

@@ -23,6 +23,7 @@
 from sqlalchemy.exc import ProgrammingError
 from sqlalchemy.sql import sqltypes as types
 from sqlalchemy.types import TypeDecorator, TypeEngine
+from sqlalchemy_bigquery import STRUCT


looks like this import is causing issues - this file is used by all ingestion sources, so it shouldn't have a hard requirement on bigquery in particular.

you can use register_custom_type in the athena.py or bigquery instead if that's needed

Thanks for pointing out!

I'll have a look the upcoming days

Should be resolved now 🙂

…rovements

…type

…o feature/athena-improvements

…rovements

hsheth2 · 2023-10-18T16:39:50Z

Smoke test failures look unrelated - merging this in

@bossenti thanks for all the hard work on getting this in!

The types in the mapping need to be valid inputs for SchemaFieldDataType, which means they must come from the codegen'd class. This fixes a regression from datahub-project#8137.

bossenti and others added 4 commits May 26, 2023 10:51

chore(deps): remove internal method reference & soften PyAthena versi…

6af0633

…on requirements Co-authored-by: dnks23 <[email protected]>

feature(ingest/athena): implement custom SQLalchemy dialect for Athen…

cf595d1

…a to detect complex data types correctly Co-authored-by: dnks23 <[email protected]>

feature(ingest/athena): implement schema field generation supporting …

9c0b6ab

…nested fields for SQLalchemy types and enable Athena source Co-authored-by: dnks23 <[email protected]>

Merge branch 'datahub-project:master' into feature/athena-improvements

7750e1d

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label May 26, 2023

Merge branch 'master' into feature/athena-improvements

a2100b7

vercel bot had a problem deploying to Preview May 26, 2023 12:41 Failure

Merge branch 'master' into feature/athena-improvements

d2670a4

vercel bot had a problem deploying to Preview May 30, 2023 06:07 Failure

bossenti and others added 2 commits May 30, 2023 18:48

chore: make MapType discoverable

c7f0bc7

Merge branch 'master' into feature/athena-improvements

ad2c608

vercel bot deployed to Preview May 30, 2023 17:15 View deployment

backport to python 3.7

f22aedd

vercel bot deployed to Preview May 30, 2023 17:54 View deployment

vercel bot deployed to Preview May 30, 2023 18:49 View deployment

fix formatting

5aaf348

bossenti force-pushed the feature/athena-improvements branch from 152d21e to 5aaf348 Compare May 30, 2023 18:58

vercel bot deployed to Preview May 30, 2023 19:12 View deployment

improve linting

7dfb376

vercel bot deployed to Preview May 30, 2023 19:25 View deployment

fix import order

c6f2ba0

vercel bot deployed to Preview May 31, 2023 09:42 View deployment

hsheth2 assigned treff7es May 31, 2023

laulpogan added the community-contribution PR or Issue raised by member(s) of DataHub Community label Jun 7, 2023

fix lint + tests

50e51ad

hsheth2 added merge-pending-ci A PR that has passed review and should be merged once CI is green. and removed merge-pending-ci A PR that has passed review and should be merged once CI is green. labels Sep 22, 2023

vercel bot deployed to Preview September 22, 2023 21:09 View deployment

bossenti added 3 commits October 3, 2023 08:47

Merge branch 'master-dh' into feature/athena-improvements

7884fc3

address linting issues

b58b6d5

Merge remote-tracking branch 'origin/feature/athena-improvements' int…

54f874d

…o feature/athena-improvements

vercel bot deployed to Preview October 3, 2023 07:09 View deployment

hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Oct 4, 2023

Merge branch 'master' into feature/athena-improvements

5dc3d87

vercel bot deployed to Preview October 4, 2023 03:58 View deployment

hsheth2 reviewed Oct 4, 2023

View reviewed changes

bossenti added 3 commits October 13, 2023 20:14

Merge remote-tracking branch 'datahub/master' into feature/athena-imp…

60b21e9

…rovements

refactor: replace central definition of STRUCT by registering custom …

4c0da22

…type

Merge remote-tracking branch 'origin/feature/athena-improvements' int…

42d0114

…o feature/athena-improvements

vercel bot deployed to Preview October 13, 2023 18:33 View deployment

bossenti added 2 commits October 15, 2023 12:31

style: apply formatting

92c8000

Merge remote-tracking branch 'datahub/master' into feature/athena-imp…

f68c9e0

…rovements

vercel bot deployed to Preview October 15, 2023 10:48 View deployment

hsheth2 approved these changes Oct 15, 2023

View reviewed changes

hsheth2 merged commit 1eaf9c8 into datahub-project:master Oct 18, 2023
53 of 54 checks passed

bossenti deleted the feature/athena-improvements branch October 18, 2023 18:49

hsheth2 mentioned this pull request Oct 20, 2023

fix(ingest): update athena type mapping #9061

Merged

5 tasks

bossenti mentioned this pull request Nov 19, 2023

fix(ingest/athena): detect decimal type correctly #9270

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(ingest/athena): introduce support for complex and nested schemas in Athena #8137

feature(ingest/athena): introduce support for complex and nested schemas in Athena #8137

bossenti commented May 26, 2023 •

edited

Loading

bossenti commented May 30, 2023

treff7es commented May 30, 2023

treff7es commented May 30, 2023

bossenti commented May 31, 2023

treff7es commented Jun 6, 2023

bossenti commented Jun 7, 2023

hsheth2 commented Sep 29, 2023

vercel bot commented Oct 3, 2023

bossenti commented Oct 3, 2023

hsheth2 commented Oct 4, 2023

hsheth2 left a comment

hsheth2 Oct 4, 2023

bossenti Oct 13, 2023

bossenti Oct 13, 2023

hsheth2 commented Oct 18, 2023

feature(ingest/athena): introduce support for complex and nested schemas in Athena #8137

feature(ingest/athena): introduce support for complex and nested schemas in Athena #8137

Conversation

bossenti commented May 26, 2023 • edited Loading

PR summary

Changes Introduced

Checklist

bossenti commented May 30, 2023

treff7es commented May 30, 2023

treff7es commented May 30, 2023

bossenti commented May 31, 2023

treff7es commented Jun 6, 2023

bossenti commented Jun 7, 2023

hsheth2 commented Sep 29, 2023

vercel bot commented Oct 3, 2023

bossenti commented Oct 3, 2023

hsheth2 commented Oct 4, 2023

hsheth2 left a comment

Choose a reason for hiding this comment

hsheth2 Oct 4, 2023

Choose a reason for hiding this comment

bossenti Oct 13, 2023

Choose a reason for hiding this comment

bossenti Oct 13, 2023

Choose a reason for hiding this comment

hsheth2 commented Oct 18, 2023

bossenti commented May 26, 2023 •

edited

Loading