-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate to Single Cube Materialization Option #1304
base: main
Are you sure you want to change the base?
Consolidate to Single Cube Materialization Option #1304
Conversation
✅ Deploy Preview for thriving-cassata-78ae72 canceled.
|
1736e0b
to
f1eeba3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Few minor comments in line.
@@ -47,6 +58,95 @@ async def get_cube( | |||
return await get_cube_revision_metadata(session, name) | |||
|
|||
|
|||
@router.get("/cubes/{name}/materialization", name="Materialization Config Cube") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "Materialization Config Cube" or "Cube Materialization Config" ?
Requirements: | ||
- The cube must have a temporal partition column specified. | ||
- The job strategy will always be "incremental time". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why always? I think we should provide a "full replacement" as an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, I definitely agree that we should provide it as an option, but for this first cut I didn't want to account for supporting both "full" and "incremental". It was easier to just raise an error until we're ready to implement "full".
"measures_materializations": | ||
We group the metrics by parent node. Then we try to pre-aggregate each parent node as | ||
much as possible to prepare for metric queries on the cube's dimensions. | ||
"combiners": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This a great feature but I wonder how often it will be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to say, but to be honest I came across this problem almost immediately, so I think it's more often than we think.
"The cube must have a temporal partition column set " | ||
"in order for it to be materialized.", | ||
) | ||
temporal_partition = temporal_partitions[0] if temporal_partitions else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Add a comment that if more than 2 temporal partitions are defined we pick a random one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually changed the earlier part that checks for temporal partitions to be if len(temporal_partitions) != 1
-- then we can catch and raise for cases where there are multiple temporal partitions. The only use case I can see for multiple temporal partitions is if there's a "date int" and an "hour" partition, but we'll need more metadata to support that anyway, so better to raise I think.
strategy=( | ||
MaterializationStrategy.INCREMENTAL_TIME | ||
if temporal_partition | ||
else MaterializationStrategy.FULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may never reach this code path... unless we covert this to support full refresh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. We can do that in a follow-up PR I think.
current_user=current_user, | ||
) | ||
|
||
# Druid Cube (this job will take subsume all existing jobs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/will take subsume/will subsume
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic looks solid. Nice work!
… and combines them
Fix linters Fix
b19f59e
to
c5900c9
Compare
Summary
This PR adds a new cube materialization option that chooses the most efficient way to materialize the cube for downstream use. This should be our sole cube materialization option, replacing the "measures" and "metrics" materialization options.
The core DJ API builds a materialization job with the following pieces of job metadata:
Cube Metadata
We keep track of frozen cube metadata, like cube's metrics (versioned) and dimensions. We maintain additional info on each metric's required measures and their derived expression that uses those measures.
Measures Queries
We group the metrics by parent and build a list of measures queries based on the parent nodes, which computes the pre-aggregated measures for its child metrics and the cube's selected dimensions.
For each measures query, we will additionally keep track of:
Combiner Queries
This stage merges the results of the above measures queries into a single dataset, with the intention of ingesting into Druid. At the moment we raise an error if someone tries to materialize a cube that has measures datasets at different grains and therefore cannot be combined. The metadata for this stage provides:
Test Plan
Locally
make check
passesmake test
shows 100% unit test coverageDeployment Plan