-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest/s3): Partition support #11083
feat(ingest/s3): Partition support #11083
Conversation
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe changes introduce significant enhancements to the metadata ingestion process for S3 data sources, improving flexibility and usability in handling dataset configurations. New features such as partition auto-detection, enhanced sorting capabilities, and refined metadata structures enrich the data management experience. Additionally, validation mechanisms for uploaded files ensure data integrity, while modifications to JSON schemas streamline metadata representation and tracking. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant System
participant S3
User->>System: Request data ingestion
System->>S3: Access files from specified path
S3-->>System: Return file metadata
System->>System: Apply partition auto-detection
System->>User: Confirm successful ingestion
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
Fix doc generation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (10)
- metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py (10 hunks)
- metadata-ingestion/src/datahub/ingestion/source/s3/source.py (18 hunks)
- metadata-ingestion/tests/integration/s3/golden-files/s3/golden_mces_folder_partition_basic.json (15 hunks)
- metadata-ingestion/tests/integration/s3/golden-files/s3/golden_mces_folder_partition_keyval.json (15 hunks)
- metadata-ingestion/tests/integration/s3/golden-files/s3/golden_mces_folder_partition_update_schema.json (15 hunks)
- metadata-ingestion/tests/integration/s3/golden-files/s3/golden_mces_folder_partition_update_schema_with_partition_autodetect.json (1 hunks)
- metadata-ingestion/tests/integration/s3/golden-files/s3/golden_mces_folder_partition_with_partition_autodetect_traverse_all.json (1 hunks)
- metadata-ingestion/tests/integration/s3/golden-files/s3/golden_mces_folder_partition_with_partition_autodetect_traverse_min_max.json (1 hunks)
- metadata-ingestion/tests/integration/s3/sources/s3/folder_partition_with_partition_autodetect_traverse_all.json (1 hunks)
- metadata-ingestion/tests/integration/s3/sources/s3/folder_partition_with_partition_autodetect_traverse_min_max.json (1 hunks)
Files skipped from review due to trivial changes (1)
- metadata-ingestion/tests/integration/s3/sources/s3/folder_partition_with_partition_autodetect_traverse_all.json
Files skipped from review as they are similar to previous changes (5)
- metadata-ingestion/tests/integration/s3/golden-files/s3/golden_mces_folder_partition_keyval.json
- metadata-ingestion/tests/integration/s3/golden-files/s3/golden_mces_folder_partition_update_schema_with_partition_autodetect.json
- metadata-ingestion/tests/integration/s3/golden-files/s3/golden_mces_folder_partition_with_partition_autodetect_traverse_all.json
- metadata-ingestion/tests/integration/s3/golden-files/s3/golden_mces_folder_partition_with_partition_autodetect_traverse_min_max.json
- metadata-ingestion/tests/integration/s3/sources/s3/folder_partition_with_partition_autodetect_traverse_min_max.json
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py
427-427: Use
enumerate()
for index variablenum
infor
loop(SIM113)
metadata-ingestion/src/datahub/ingestion/source/s3/source.py
996-996: Use
bool(...)
instead ofTrue if ... else False
Replace with `bool(...)
(SIM210)
Additional comments not posted (31)
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py (11)
43-52
: New Enum ClassSortKeyType
Looks GoodThe new enum class
SortKeyType
is well-defined and categorizes sorting keys appropriately.
54-77
: New ClassSortKey
Looks GoodThe new class
SortKey
is well-structured and includes a useful validator method for converting date formats.
111-116
: New Attributesort_key
Looks GoodThe new optional attribute
sort_key
is well-defined and useful for sorting partitions.
133-136
: New Attributeautodetect_partitions
Looks GoodThe new boolean attribute
autodetect_partitions
is well-defined and useful for enabling or disabling partition autodetection.
138-141
: New Attributetraversal_method
Looks GoodThe new attribute
traversal_method
is well-defined and useful for specifying the folder traversal method.
143-146
: New Attributeinclude_hidden_folders
Looks GoodThe new boolean attribute
include_hidden_folders
is well-defined and useful for enabling or disabling the inclusion of hidden folders in the traversal.
148-161
: New Methodis_path_hidden
Looks GoodThe new method
is_path_hidden
is well-implemented and useful for determining if a path or its directories are hidden.
Line range hint
163-189
:
Modified Methodallowed
Looks GoodThe
allowed
method has been appropriately modified to include theignore_ext
parameter, enhancing its flexibility.
360-430
: New Methodget_partition_from_path
Looks GoodThe new method
get_partition_from_path
is well-implemented and useful for extracting partition keys from paths based on the defined sorting and partitioning logic.Tools
Ruff
427-427: Use
enumerate()
for index variablenum
infor
loop(SIM113)
481-518
: New Methodextract_datetime_partition
Looks GoodThe new method
extract_datetime_partition
is well-implemented and useful for parsing datetime values from paths based on the defined sorting keys.
192-195
: Modified Methoddir_allowed
Looks GoodThe
dir_allowed
method has been appropriately modified to enhance its functionality.metadata-ingestion/tests/integration/s3/golden-files/s3/golden_mces_folder_partition_basic.json (5)
10-12
: New Fields incustomProperties
Look GoodThe new fields
number_of_partitions
andpartitions
enhance the metadata's descriptive capabilities.
16-21
: New Timestampscreated
andlastModified
Look GoodThe new timestamps
created
andlastModified
improve data lineage and auditing capabilities.
27-28
: New FieldlastRunId
insystemMetadata
Looks GoodThe new field
lastRunId
standardizes the tracking of metadata ingestion runs.
558-559
: Changes toentityType
Look GoodThe changes to
entityType
fromdataset
tocontainer
and vice versa reflect a reclassification of certain metadata entities.Also applies to: 579-580
582-582
: Changes toaspectName
Look GoodThe changes to
aspectName
indicate a reorganization of how metadata is structured.Also applies to: 623-623
metadata-ingestion/tests/integration/s3/golden-files/s3/golden_mces_folder_partition_update_schema.json (5)
10-12
: New Fields incustomProperties
Look GoodThe new fields
number_of_partitions
andpartitions
enhance the metadata's descriptive capabilities.
16-21
: New Timestampscreated
andlastModified
Look GoodThe new timestamps
created
andlastModified
improve data lineage and auditing capabilities.
27-28
: New FieldlastRunId
insystemMetadata
Looks GoodThe new field
lastRunId
standardizes the tracking of metadata ingestion runs.
558-559
: Changes toentityType
Look GoodThe changes to
entityType
fromdataset
tocontainer
and vice versa reflect a reclassification of certain metadata entities.Also applies to: 579-580
582-582
: Changes toaspectName
Look GoodThe changes to
aspectName
indicate a reorganization of how metadata is structured.Also applies to: 623-623
metadata-ingestion/src/datahub/ingestion/source/s3/source.py (10)
212-220
: LGTM!The
Folder
class is well-structured and encapsulates folder metadata effectively.
236-238
: LGTM!The
TableData
class structure is appropriate and the new attributes for partitions are well-integrated.
505-527
: LGTM!The method appropriately handles partition keys using the new
Folder
structure.
620-652
: LGTM!The method effectively generates summaries for partitions.
685-764
: LGTM!The method effectively handles partition data and integrates the new attributes appropriately.
860-888
: LGTM!The method appropriately handles the new
partitions
parameter.
957-1003
: LGTM!The method effectively handles partition data and returns a list of
Folder
instances.Tools
Ruff
996-996: Use
bool(...)
instead ofTrue if ... else False
Replace with `bool(...)
(SIM210)
Line range hint
1008-1129
:
LGTM!The method effectively handles partition data and yields additional partition information.
Line range hint
1134-1155
:
LGTM!The method effectively handles partition data and yields additional partition information.
1174-1178
: LGTM!The method effectively handles partition data and integrates the new attributes appropriately.
) | ||
type: SortKeyType = Field( | ||
default=SortKeyType.STRING, | ||
description="The date format to use when sorting. This is used to parse the date from the key. The format should follow the java [SimpleDateFormat](https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html) format.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like description needs to change here.
return None | ||
else: | ||
for java_format, python_format in java_to_python_mapping.items(): | ||
v = v.replace(java_format, f"%{python_format}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we accept date_format
in java format if we anyway convert it to python format here before actually using it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, this should be supported across platforms (java, python, etc..) and I felt like the java one is more common
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py
Show resolved
Hide resolved
# Conflicts: # metadata-ingestion/src/datahub/ingestion/source/s3/source.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good.
ps = PathSpec(include=path_spec, default_extension="csv", allow_double_stars=True) | ||
assert ps.allowed(path) | ||
partitions = ps.get_partition_from_path(path) | ||
assert partitions == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding these.
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/data_lake_common/path_spec.py
Outdated
Show resolved
Hide resolved
timestamp: datetime | ||
size: int | ||
partitions: List[Folder] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! So browser would browse through file system to yield one BrowsePath for one identified table and that table would then be emitted as dataset if its allowed by patterns, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, exactly
path_spec (PathSpec): The path specification used to determine partitioning. | ||
bucket (Any): The S3 bucket object. | ||
prefix (str): The prefix path in the S3 bucket to list objects from. | ||
partition (Optional[str]): An optional partition string to append to the prefix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove partition from here now that its removed from method
@@ -263,6 +509,44 @@ def _extract_table_name(self, named_vars: dict) -> str: | |||
raise ValueError("path_spec.table_name is not set") | |||
return self.table_name.format_map(named_vars) | |||
|
|||
def extract_datetime_partition( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used anywhere ? Would be good to add unit tests for this too.
partitions.append( | ||
Folder( | ||
partition_id=id, | ||
is_partition=True if id else False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we set empty partitions []
if id
is absent ?
protocol=protocol, | ||
min=True, | ||
) | ||
dirs_to_process.append(dirs_to_process_min[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we push down traversal mode as max or min_max to get_dirs_to_process
? Without that, list_folders
would be done twice ?
Co-authored-by: Mayuri Nehate <[email protected]>
…on/path_spec.py Co-authored-by: Mayuri Nehate <[email protected]>
…on/path_spec.py Co-authored-by: Mayuri Nehate <[email protected]>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Mayuri Nehate <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Checklist
Summary by CodeRabbit
New Features
Folder
class to better manage folder metadata for partitioned datasets.Bug Fixes
Tests