-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate to Single Cube Materialization Option #1304
Changes from all commits
6e99b8f
088bd70
c1710ec
d06feff
2994681
c96cf8c
9514b0b
b4ff309
8115fbb
8b29f10
cd54f37
07ecc70
f9fd67f
7e4961f
c5900c9
a91e3ca
efa5048
5d2b72c
b74b9ed
b70517c
8a1c1a2
d5f3582
3a7a4a5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,15 +12,26 @@ | |
from datajunction_server.construction.dimensions import build_dimensions_from_cube_query | ||
from datajunction_server.database.node import Node | ||
from datajunction_server.database.user import User | ||
from datajunction_server.errors import DJInvalidInputException | ||
from datajunction_server.internal.access.authentication.http import SecureAPIRouter | ||
from datajunction_server.internal.access.authorization import validate_access | ||
from datajunction_server.internal.materializations import build_cube_materialization | ||
from datajunction_server.internal.nodes import get_cube_revision_metadata | ||
from datajunction_server.models import access | ||
from datajunction_server.models.cube import ( | ||
CubeRevisionMetadata, | ||
DimensionValue, | ||
DimensionValues, | ||
) | ||
from datajunction_server.models.cube_materialization import ( | ||
DruidCubeMaterializationInput, | ||
UpsertCubeMaterialization, | ||
) | ||
from datajunction_server.models.materialization import ( | ||
Granularity, | ||
MaterializationJobTypeEnum, | ||
MaterializationStrategy, | ||
) | ||
from datajunction_server.models.metric import TranslatedSQL | ||
from datajunction_server.models.query import QueryCreate | ||
from datajunction_server.naming import from_amenable_name | ||
|
@@ -47,6 +58,95 @@ async def get_cube( | |
return await get_cube_revision_metadata(session, name) | ||
|
||
|
||
@router.get("/cubes/{name}/materialization", name="Cube Materialization Config") | ||
async def cube_materialization_info( | ||
name: str, | ||
session: AsyncSession = Depends(get_session), | ||
) -> DruidCubeMaterializationInput: | ||
""" | ||
The standard cube materialization config. DJ makes sensible materialization choices | ||
where possible. | ||
|
||
Requirements: | ||
- The cube must have a temporal partition column specified. | ||
- The job strategy will always be "incremental time". | ||
|
||
Outputs: | ||
"measures_materializations": | ||
We group the metrics by parent node. Then we try to pre-aggregate each parent node as | ||
much as possible to prepare for metric queries on the cube's dimensions. | ||
"combiners": | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This a great feature but I wonder how often it will be used. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hard to say, but to be honest I came across this problem almost immediately, so I think it's more often than we think. |
||
We combine each set of measures materializations on their shared grain. Note that we don't | ||
support materializing cubes with measures materializations that don't share the same grain. | ||
However, we keep `combiners` as a list in the eventual future where we support that. | ||
"metrics": | ||
We include a list of metrics, their required measures, and the derived expression (e.g., the | ||
expression used by the metric that makes use of the pre-aggregated measures) | ||
|
||
Once we create a scheduled materialization workflow, we freeze the metadata for that particular | ||
materialized dataset. This allows us to reconstruct metrics SQL from the dataset when needed. | ||
To request metrics from the materialized cube, use the metrics' measures metadata. | ||
""" | ||
node = await Node.get_cube_by_name(session, name) | ||
temporal_partitions = node.current.temporal_partition_columns() # type: ignore | ||
if len(temporal_partitions) != 1: | ||
raise DJInvalidInputException( | ||
"The cube must have a single temporal partition column set " | ||
"in order for it to be materialized.", | ||
) | ||
temporal_partition = temporal_partitions[0] if temporal_partitions else None | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Add a comment that if more than 2 temporal partitions are defined we pick a random one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I actually changed the earlier part that checks for temporal partitions to be |
||
granularity_lookback_defaults = { | ||
Granularity.MINUTE: "1 MINUTE", | ||
Granularity.HOUR: "1 HOUR", | ||
Granularity.DAY: "1 DAY", | ||
Granularity.WEEK: "1 WEEK", | ||
Granularity.MONTH: "1 MONTH", | ||
Granularity.QUARTER: "1 QUARTER", | ||
Granularity.YEAR: "1 YEAR", | ||
} | ||
granularity_cron_defaults = { | ||
Granularity.MINUTE: "* * * * *", # Runs every minute | ||
Granularity.HOUR: "0 * * * *", # Runs at the start of every hour | ||
Granularity.DAY: "0 0 * * *", # Runs at midnight every day | ||
Granularity.WEEK: "0 0 * * 0", # Runs at midnight on Sundays | ||
Granularity.MONTH: "0 0 1 * *", # Runs at midnight on the first of every month | ||
Granularity.QUARTER: "0 0 1 */3 *", # Runs at midnight on the first day of each quarter | ||
Granularity.YEAR: "0 0 1 1 *", # Runs at midnight on January 1st every year | ||
} | ||
upsert = UpsertCubeMaterialization( | ||
job=MaterializationJobTypeEnum.DRUID_CUBE, | ||
strategy=( | ||
MaterializationStrategy.INCREMENTAL_TIME | ||
if temporal_partition | ||
else MaterializationStrategy.FULL | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We may never reach this code path... unless we covert this to support full refresh. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, good point. We can do that in a follow-up PR I think. |
||
), | ||
lookback_window=granularity_lookback_defaults.get( | ||
temporal_partition.partition.granularity, | ||
granularity_lookback_defaults[Granularity.DAY], | ||
), | ||
schedule=granularity_cron_defaults.get( | ||
temporal_partition.partition.granularity, | ||
granularity_cron_defaults[Granularity.DAY], | ||
), | ||
) | ||
cube_config = await build_cube_materialization( | ||
session, | ||
node.current, # type: ignore | ||
upsert, | ||
) | ||
return DruidCubeMaterializationInput( | ||
name="", | ||
cube=cube_config.cube, | ||
dimensions=cube_config.dimensions, | ||
metrics=cube_config.metrics, | ||
strategy=upsert.strategy, | ||
schedule=upsert.schedule, | ||
job=upsert.job.name, | ||
measures_materializations=cube_config.measures_materializations, | ||
combiners=cube_config.combiners, | ||
) | ||
|
||
|
||
@router.get("/cubes/{name}/dimensions/sql", name="Dimensions SQL for Cube") | ||
async def get_cube_dimension_sql( | ||
name: str, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why always? I think we should provide a "full replacement" as an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, I definitely agree that we should provide it as an option, but for this first cut I didn't want to account for supporting both "full" and "incremental". It was easier to just raise an error until we're ready to implement "full".