New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Consolidate to Single Cube Materialization Option #1304

Open

shangyian wants to merge 23 commits into DataJunction:main from shangyian:smart-cube-materialization

Contributor

shangyian commented Feb 3, 2025 •

edited

Loading

Summary

This PR adds a new cube materialization option that chooses the most efficient way to materialize the cube for downstream use. This should be our sole cube materialization option, replacing the "measures" and "metrics" materialization options.

The core DJ API builds a materialization job with the following pieces of job metadata:

Cube Metadata

We keep track of frozen cube metadata, like cube's metrics (versioned) and dimensions. We maintain additional info on each metric's required measures and their derived expression that uses those measures.

Measures Queries

We group the metrics by parent and build a list of measures queries based on the parent nodes, which computes the pre-aggregated measures for its child metrics and the cube's selected dimensions.

For each measures query, we will additionally keep track of:

The node it was generated for
The grain it was generated at
A list of measures and dimensions that it provides
Temporal partition info (e.g., timestamp column, granularity)
Spark configuration (it is unclear how we would configure this at the moment)
Generated output table name

Combiner Queries

This stage merges the results of the above measures queries into a single dataset, with the intention of ingesting into Druid. At the moment we raise an error if someone tries to materialize a cube that has measures datasets at different grains and therefore cannot be combined. The metadata for this stage provides:

A combiner query in Spark, if necessary
Druid ingestion spec for the combined dataset
The grain it was generated at
A list of measures and dimensions that it provides
Temporal partition info (e.g., timestamp column, granularity)
Generated output table name

Test Plan

Locally

PR has an associated issue: Add optimized cube materialization job #1294
make check passes
make test shows 100% unit test coverage

Deployment Plan

netlify bot commented Feb 3, 2025 •

edited

Loading

✅ Deploy Preview for thriving-cassata-78ae72 canceled.

Name	Link
🔨 Latest commit	`3a7a4a5`
🔍 Latest deploy log	https://app.netlify.com/sites/thriving-cassata-78ae72/deploys/67a963f409e4d7000829b53e

shangyian force-pushed the smart-cube-materialization branch from 1736e0b to f1eeba3 Compare

February 3, 2025 21:14

shangyian changed the title ~~Smart cube materialization~~ Consolidate to Single Cube Materialization Option

shangyian requested a review from agorajek

February 7, 2025 19:10

agorajek approved these changes

View reviewed changes

Member

agorajek left a comment

Looks great. Few minor comments in line.

datajunction-server/datajunction_server/api/cubes.py Outdated

    
            @@ -47,6 +58,95 @@ async def get_cube(
          
                  return await get_cube_revision_metadata(session, name)

              @router.get("/cubes/{name}/materialization", name="Materialization Config Cube")

Member

agorajek Feb 7, 2025

nit: "Materialization Config Cube" or "Cube Materialization Config" ?

datajunction-server/datajunction_server/api/cubes.py

    
                  Requirements:

                  - The cube must have a temporal partition column specified.

                  - The job strategy will always be "incremental time".

Member

agorajek Feb 7, 2025

Why always? I think we should provide a "full replacement" as an option.

Contributor Author

shangyian Feb 8, 2025

Oh yeah, I definitely agree that we should provide it as an option, but for this first cut I didn't want to account for supporting both "full" and "incremental". It was easier to just raise an error until we're ready to implement "full".

datajunction-server/datajunction_server/api/cubes.py

    
                  "measures_materializations":

                      We group the metrics by parent node. Then we try to pre-aggregate each parent node as

                      much as possible to prepare for metric queries on the cube's dimensions.

                  "combiners":

Member

agorajek Feb 7, 2025

This a great feature but I wonder how often it will be used.

Contributor Author

shangyian Feb 8, 2025

Hard to say, but to be honest I came across this problem almost immediately, so I think it's more often than we think.

datajunction-server/datajunction_server/api/cubes.py

    
                          "The cube must have a temporal partition column set "

                          "in order for it to be materialized.",

                      )

                  temporal_partition = temporal_partitions[0] if temporal_partitions else None

Member

agorajek Feb 7, 2025

nit: Add a comment that if more than 2 temporal partitions are defined we pick a random one.

Contributor Author

shangyian Feb 8, 2025

I actually changed the earlier part that checks for temporal partitions to be if len(temporal_partitions) != 1 -- then we can catch and raise for cases where there are multiple temporal partitions. The only use case I can see for multiple temporal partitions is if there's a "date int" and an "hour" partition, but we'll need more metadata to support that anyway, so better to raise I think.

datajunction-server/datajunction_server/api/cubes.py

    
                      strategy=(

                          MaterializationStrategy.INCREMENTAL_TIME

                          if temporal_partition

                          else MaterializationStrategy.FULL

Member

agorajek Feb 7, 2025

We may never reach this code path... unless we covert this to support full refresh.

Contributor Author

shangyian Feb 8, 2025

Yeah, good point. We can do that in a follow-up PR I think.

datajunction-server/datajunction_server/internal/materializations.py Outdated

-                          current_user=current_user,
-                      )
+                      # Druid Cube (this job will take subsume all existing jobs)

Member

agorajek Feb 7, 2025

s/will take subsume/will subsume

datajunction-server/datajunction_server/internal/cube_materializations.py Outdated

Member

agorajek Feb 7, 2025

This logic looks solid. Nice work!

shangyian marked this pull request as ready for review

February 8, 2025 01:33

shangyian added 15 commits

February 7, 2025 17:33


          Add new cube materialization option that uses pre-aggregated measures…

6e99b8f

… and combines them


          Add new cube materialization models and logic

088bd70


          Remove extraneous

c1710ec


          v0.0.1-a84.dev2

d06feff


          Include materialization info as API endpoint

Fix linters

Fix


          Fix materializations UI to have a single cube choice

c96cf8c


          Fix tests

9514b0b


          Fix tests

b4ff309


          Fix client tests

8115fbb

Fix

8b29f10


          Fix tests

cd54f37


          Fix tests

07ecc70


          Fix tests

f9fd67f

Fix

7e4961f

Fix

c5900c9

shangyian force-pushed the smart-cube-materialization branch from b19f59e to c5900c9 Compare

February 8, 2025 01:33

shangyian added 8 commits

February 7, 2025 22:48


          Add additional tests

a91e3ca

Fix

efa5048


          Fix tests

5d2b72c


          Fix tests

b74b9ed


          Fix tests

b70517c


          Add better logging

8a1c1a2


          Fix timeout

d5f3582


          Fix coverage

3a7a4a5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet