Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(ingest/athena): introduce support for complex and nested schemas in Athena #8137

Merged
merged 32 commits into from
Oct 18, 2023

Conversation

bossenti
Copy link
Contributor

@bossenti bossenti commented May 26, 2023

PR summary

This PR introduces a major improvement to the Athena source. Currently, the AWS Athena source can not handle fields with complex data types like map, array, or, struct accordingly. Instead they are all detected and handled as string values which is obviously wrong. This is not directly an issue of DataHub since this behavior traces down to the corresponding external library (PyAthena).
This PR implements the support for such complex data types by using an extension feature of PyAthena, resp. SQLalchemy (see more details below).

Changes Introduced

  1. Soften PyAthena version requirement
    Before this PR, PyAthena is pinned to version 2.4.1 as the usage of an internal method was required. This method is now publicly exposed, so we make use of the public method and update the requirements in setup.py accordingly.
    Reference: 6af0633

  2. Implementation of a custom Athena dialect
    PyAthena allows to implement custom SQLalchemy dialects to modify data handling according to once needs.
    We took this approach since this allows us to get the desired behavior into PyAthena without having the need to contribute there. Therefore, we introduced our own custom dialect (CustomAthenaRestDialect ), which only overwrites the behavior of how types are detected from the DDL description returned from Athena.
    To parse the DDL description we make use of the already existing get_avro_schema_for_hive_column function.
    With these changes, all data types are correctly detected and displayed in the DataHub schema field overview.
    Here is one example:
    image
    Reference: cf595d1

  3. Adapt schema field generation for Athena
    With the changes described in 2 data types are correctly retrieved and displayed in the UI.
    But what is not included to this point, is the generation of schema fields in the stye of [version=2.0] and therefore structs and arrays are not collapsible/extendable in the DataHub UI.
    Therefore, we implemented SqlAlchemyColumnToAvroConverter which is strongly aligned to the already existing HiveColumnToAvroConverter and creates the schema fields accordingly.
    This enables the extension of complex data types in the UI for Athena as shown below:
    image
    Reference: 9c0b6ab

Warning

While this is a great improvement of the AWS Athena source in our opinion, this should probably be considered as a >breaking change for DataHub since the field paths change completely:
Before:

name

After:

[version=2.0].[type=str].name

If you agree, we would add the change accordingly to the updating-datahub.md file, but we are curious how you assess >this aspect.
One possible alternative is to hide these changes behind a feature flag.

☝🏼 Above warning should be relevant anymore after the release of version 0.10.5
But maybe we can get rid of the custom generated of v2-complied schema fields? 🤔

We would be happy to get your feedback on this PR and discuss the implementation if required.

cc: @hsheth2 @treff7es

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label May 26, 2023
@bossenti
Copy link
Contributor Author

Please indicate if your interested in this contribution then I would focus on fixing the CI.

@treff7es
Copy link
Contributor

@bossenti absolutely, I will take a look at it.
Sorry for the late reply and thanks for the contribution!

@treff7es
Copy link
Contributor

@bossenti it seems like ci is failing, please, can you fix it?

@bossenti
Copy link
Contributor Author

@treff7es checks for Python 3.7 fail because the required version of sqlalchemy does not contain the TupleType. Is there a chance to switch to a slightly newer version?

With respect to the quick tests for python 3.10, could you help me here please? I don't have a clue why this step is failing

@treff7es
Copy link
Contributor

treff7es commented Jun 6, 2023

sorry for the delay; I need to verify if we can bump sqlalchemy and if not, then what other options do we have?

@bossenti
Copy link
Contributor Author

bossenti commented Jun 7, 2023

sorry for the delay; I need to verify if we can bump sqlalchemy and if not, then what other options do we have?

sqlalchemy introduced the TupleType in version 1.4.27, which supports Python 3.7 as well (see here).

The only idea that comes to me mind would be to port to the TupleType definition into the DataHub codebase, but I wouldn't want to do that. Especially, since the requirement for pyathena in metadata-ingestion is even above version two , so we would only do this for the sake of the CI without impacting the resulting package. But as I can imagine, there is surely a reason for insisting on this version for the Python 3.7 CI builds.

@laulpogan laulpogan added the community-contribution PR or Issue raised by member(s) of DataHub Community label Jun 7, 2023
@hsheth2 hsheth2 added merge-pending-ci A PR that has passed review and should be merged once CI is green. and removed merge-pending-ci A PR that has passed review and should be merged once CI is green. labels Sep 22, 2023
@hsheth2
Copy link
Collaborator

hsheth2 commented Sep 29, 2023

@bossenti looks like there's one more lint issue - we'll be able to merge once CI is green

@vercel
Copy link

vercel bot commented Oct 3, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
docs-website ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 3, 2023 7:09am

@bossenti
Copy link
Contributor Author

bossenti commented Oct 3, 2023

@hsheth2 metadata-ingestion (3.10, testIntegration) is failing with failed to register layer: write /usr/share/zoneinfo/posix/America/Anguilla: no space left on device. Maybe restarting the job helps here?

@hsheth2
Copy link
Collaborator

hsheth2 commented Oct 4, 2023

@bossenti yup having an issue with our CI, which should be fixed by #8938

Once that's merged, we can update branch here merge once everything passes.

@hsheth2 hsheth2 added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Oct 4, 2023
Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more dependency issue here

@@ -23,6 +23,7 @@
from sqlalchemy.exc import ProgrammingError
from sqlalchemy.sql import sqltypes as types
from sqlalchemy.types import TypeDecorator, TypeEngine
from sqlalchemy_bigquery import STRUCT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this import is causing issues - this file is used by all ingestion sources, so it shouldn't have a hard requirement on bigquery in particular.

you can use register_custom_type in the athena.py or bigquery instead if that's needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out!

I'll have a look the upcoming days

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be resolved now 🙂

@hsheth2
Copy link
Collaborator

hsheth2 commented Oct 18, 2023

Smoke test failures look unrelated - merging this in

@bossenti thanks for all the hard work on getting this in!

@hsheth2 hsheth2 merged commit 1eaf9c8 into datahub-project:master Oct 18, 2023
53 of 54 checks passed
@bossenti bossenti deleted the feature/athena-improvements branch October 18, 2023 18:49
hsheth2 added a commit to hsheth2/datahub that referenced this pull request Oct 20, 2023
The types in the mapping need to be valid inputs for
SchemaFieldDataType, which means they must come from the codegen'd
class. This fixes a regression from datahub-project#8137.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants