Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate to Single Cube Materialization Option #1304

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

shangyian
Copy link
Contributor

@shangyian shangyian commented Feb 3, 2025

Summary

This PR adds a new cube materialization option that chooses the most efficient way to materialize the cube for downstream use. This should be our sole cube materialization option, replacing the "measures" and "metrics" materialization options.

The core DJ API builds a materialization job with the following pieces of job metadata:

Cube Metadata

We keep track of frozen cube metadata, like cube's metrics (versioned) and dimensions. We maintain additional info on each metric's required measures and their derived expression that uses those measures.

Measures Queries

We group the metrics by parent and build a list of measures queries based on the parent nodes, which computes the pre-aggregated measures for its child metrics and the cube's selected dimensions.

For each measures query, we will additionally keep track of:

  • The node it was generated for
  • The grain it was generated at
  • A list of measures and dimensions that it provides
  • Temporal partition info (e.g., timestamp column, granularity)
  • Spark configuration (it is unclear how we would configure this at the moment)
  • Generated output table name

Combiner Queries

This stage merges the results of the above measures queries into a single dataset, with the intention of ingesting into Druid. At the moment we raise an error if someone tries to materialize a cube that has measures datasets at different grains and therefore cannot be combined. The metadata for this stage provides:

  • A combiner query in Spark, if necessary
  • Druid ingestion spec for the combined dataset
  • The grain it was generated at
  • A list of measures and dimensions that it provides
  • Temporal partition info (e.g., timestamp column, granularity)
  • Generated output table name

Test Plan

Locally

Deployment Plan

Copy link

netlify bot commented Feb 3, 2025

Deploy Preview for thriving-cassata-78ae72 canceled.

Name Link
🔨 Latest commit 3a7a4a5
🔍 Latest deploy log https://app.netlify.com/sites/thriving-cassata-78ae72/deploys/67a963f409e4d7000829b53e

@shangyian shangyian force-pushed the smart-cube-materialization branch from 1736e0b to f1eeba3 Compare February 3, 2025 21:14
@shangyian shangyian changed the title Smart cube materialization Consolidate to Single Cube Materialization Option Feb 4, 2025
@shangyian shangyian requested a review from agorajek February 7, 2025 19:10
Copy link
Member

@agorajek agorajek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Few minor comments in line.

@@ -47,6 +58,95 @@ async def get_cube(
return await get_cube_revision_metadata(session, name)


@router.get("/cubes/{name}/materialization", name="Materialization Config Cube")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Materialization Config Cube" or "Cube Materialization Config" ?

Requirements:
- The cube must have a temporal partition column specified.
- The job strategy will always be "incremental time".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why always? I think we should provide a "full replacement" as an option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I definitely agree that we should provide it as an option, but for this first cut I didn't want to account for supporting both "full" and "incremental". It was easier to just raise an error until we're ready to implement "full".

"measures_materializations":
We group the metrics by parent node. Then we try to pre-aggregate each parent node as
much as possible to prepare for metric queries on the cube's dimensions.
"combiners":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This a great feature but I wonder how often it will be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to say, but to be honest I came across this problem almost immediately, so I think it's more often than we think.

"The cube must have a temporal partition column set "
"in order for it to be materialized.",
)
temporal_partition = temporal_partitions[0] if temporal_partitions else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add a comment that if more than 2 temporal partitions are defined we pick a random one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually changed the earlier part that checks for temporal partitions to be if len(temporal_partitions) != 1 -- then we can catch and raise for cases where there are multiple temporal partitions. The only use case I can see for multiple temporal partitions is if there's a "date int" and an "hour" partition, but we'll need more metadata to support that anyway, so better to raise I think.

strategy=(
MaterializationStrategy.INCREMENTAL_TIME
if temporal_partition
else MaterializationStrategy.FULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may never reach this code path... unless we covert this to support full refresh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. We can do that in a follow-up PR I think.

current_user=current_user,
)

# Druid Cube (this job will take subsume all existing jobs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/will take subsume/will subsume

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic looks solid. Nice work!

@shangyian shangyian marked this pull request as ready for review February 8, 2025 01:33
@shangyian shangyian force-pushed the smart-cube-materialization branch from b19f59e to c5900c9 Compare February 8, 2025 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants