Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate partition scope DLO strategy and persist to DLO table #284

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jiang95-dev
Copy link
Collaborator

Summary

Generate partition scope DLO strategy and persist to DLO table.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

Copy link
Collaborator

@sumedhsakdeo sumedhsakdeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the description with testing done in local docker.

Row partition = row.getStruct(2);
if (partition != null) {
for (int i = 0; i < partition.size(); i++) {
partitionValues.add(Objects.toString(partition.get(i)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the string representation look like?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like "2024-02-01, CA".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should have partCol1=partVal1, partCol2=partVal2, ... ?


private Optional<DataLayoutStrategy> buildDataLayoutStrategy(
Dataset<Long> fileSizes, int partitionCount, String partitionValue) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it partitionValue or partitionValues, if there are more than one partition column?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitionValue is the string representation of list of partitionValues.

@jiang95-dev jiang95-dev force-pushed the lejiang/add-dlo-partition-table branch from 619844f to 9012af8 Compare February 4, 2025 23:28
Copy link
Collaborator

@teamurko teamurko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jiang95-dev, overall looks great, having a few minor comments

SparkSession spark,
String outputFqtn,
List<DataLayoutStrategy> strategies,
boolean isPartitionScope) {
if (outputFqtn != null && !strategies.isEmpty()) {
createTableIfNotExists(spark, outputFqtn);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create table schema will have partition_id

try (SparkSession spark = getSparkSession()) {
spark.sql("USE openhouse");
spark.sql(
String.format("CREATE TABLE %s (id INT, data STRING) PARTITIONED BY (id)", testTable));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you do 2 columns, +1 timestamp with day granularity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants