-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature(ingest/athena): introduce support for complex and nested schemas in Athena #8137
feature(ingest/athena): introduce support for complex and nested schemas in Athena #8137
Conversation
…on requirements Co-authored-by: dnks23 <[email protected]>
…a to detect complex data types correctly Co-authored-by: dnks23 <[email protected]>
…nested fields for SQLalchemy types and enable Athena source Co-authored-by: dnks23 <[email protected]>
Please indicate if your interested in this contribution then I would focus on fixing the CI. |
@bossenti absolutely, I will take a look at it. |
@bossenti it seems like ci is failing, please, can you fix it? |
152d21e
to
5aaf348
Compare
@treff7es checks for Python 3.7 fail because the required version of sqlalchemy does not contain the With respect to the quick tests for python 3.10, could you help me here please? I don't have a clue why this step is failing |
sorry for the delay; I need to verify if we can bump sqlalchemy and if not, then what other options do we have? |
The only idea that comes to me mind would be to port to the |
@bossenti looks like there's one more lint issue - we'll be able to merge once CI is green |
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
@hsheth2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more dependency issue here
@@ -23,6 +23,7 @@ | |||
from sqlalchemy.exc import ProgrammingError | |||
from sqlalchemy.sql import sqltypes as types | |||
from sqlalchemy.types import TypeDecorator, TypeEngine | |||
from sqlalchemy_bigquery import STRUCT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this import is causing issues - this file is used by all ingestion sources, so it shouldn't have a hard requirement on bigquery in particular.
you can use register_custom_type
in the athena.py or bigquery instead if that's needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out!
I'll have a look the upcoming days
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be resolved now 🙂
Smoke test failures look unrelated - merging this in @bossenti thanks for all the hard work on getting this in! |
The types in the mapping need to be valid inputs for SchemaFieldDataType, which means they must come from the codegen'd class. This fixes a regression from datahub-project#8137.
PR summary
This PR introduces a major improvement to the Athena source. Currently, the AWS Athena source can not handle fields with complex data types like
map
,array
, or,struct
accordingly. Instead they are all detected and handled asstring
values which is obviously wrong. This is not directly an issue of DataHub since this behavior traces down to the corresponding external library (PyAthena
).This PR implements the support for such complex data types by using an extension feature of
PyAthena
, resp.SQLalchemy
(see more details below).Changes Introduced
Soften
PyAthena
version requirementBefore this PR,
PyAthena
is pinned to version2.4.1
as the usage of an internal method was required. This method is now publicly exposed, so we make use of the public method and update the requirements insetup.py
accordingly.Reference: 6af0633
Implementation of a custom Athena dialect
PyAthena
allows to implement customSQLalchemy
dialects to modify data handling according to once needs.We took this approach since this allows us to get the desired behavior into
PyAthena
without having the need to contribute there. Therefore, we introduced our own custom dialect (CustomAthenaRestDialect
), which only overwrites the behavior of how types are detected from the DDL description returned from Athena.To parse the DDL description we make use of the already existing
get_avro_schema_for_hive_column
function.With these changes, all data types are correctly detected and displayed in the DataHub schema field overview.
Here is one example:
Reference: cf595d1
Adapt schema field generation for Athena
With the changes described in
2
data types are correctly retrieved and displayed in the UI.But what is not included to this point, is the generation of schema fields in the stye of
[version=2.0]
and therefore structs and arrays are not collapsible/extendable in the DataHub UI.Therefore, we implemented
SqlAlchemyColumnToAvroConverter
which is strongly aligned to the already existingHiveColumnToAvroConverter
and creates the schema fields accordingly.This enables the extension of complex data types in the UI for Athena as shown below:
Reference: 9c0b6ab
Warning
While this is a great improvement of the AWS Athena source in our opinion, this should probably be considered as a >breaking change for DataHub since the field paths change completely:Before:name
After:If you agree, we would add the change accordingly to theupdating-datahub.md
file, but we are curious how you assess >this aspect.One possible alternative is to hide these changes behind a feature flag.☝🏼 Above warning should be relevant anymore after the release of version
0.10.5
But maybe we can get rid of the custom generated of v2-complied schema fields? 🤔
We would be happy to get your feedback on this PR and discuss the implementation if required.
cc: @hsheth2 @treff7es
Checklist