Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation for missing metric_time when querying to/through SCD #1451

Merged
merged 3 commits into from
Oct 14, 2024

Conversation

serramatutu
Copy link
Contributor

@serramatutu serramatutu commented Oct 9, 2024

Summary

This PR changes MetricQueryValidationRule to raise ScdRequiresMetricTimeIssue in case the user submits a query that uses an SCD but does not group by time.

Implementation

To make this happen, I:

  • refactored how MetricQueryValidationRule works to return a new QueryAnalysisResult and avoid creating a bunch of separate methods that return booleans
  • changed MetricQueryValidationRule to make it get the specs for all the joinable SCDs for the given metric, match the group by input against it and raise an issue if there's no metric_time in the query
  • created a new LinkableElementProperty called SCD_HOP, which indicates that you need to join to an SCD somewhere along the path to join to that element
  • changed the ValidLinkableSpecResolver to add SCD_HOP to all the linkable elements which go to/through SCDs
  • added tests for joins to an SCD and through an SCD

Copy link
Contributor

@courtneyholcomb courtneyholcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!! I left a bunch of small notes inline, and one bigger question.

@@ -221,3 +222,12 @@ def get_min_queryable_time_granularity(self, metric_reference: MetricReference)
minimum_queryable_granularity = defined_time_granularity

return minimum_queryable_granularity

def get_joinable_scd_specs_for_metric(self, metric_reference: MetricReference) -> Sequence[LinkableInstanceSpec]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, my main question about this PR - why did you decide to put this method on MetricLookup instead of SemanticModelLookup? (I think we discussed putting it on SemanticModelLookup.)
Is there a reason you ended up needing to specifically get the SCDs that are joinable for the metric, instead of all SCDs that exist in the manifest? There is other logic in the group by resolver that will ensure the group bys are valid for the metric, so we shouldn't need that for this task.

Unless I'm missing some context there, I think we can simplify this PR if we do the following:

  • Remove this method and instead just store all possible SCD specs on SemanticModelLookup. You can use that for the spec pattern matching. You should be able to build that as a sequence of LinkableInstanceSpecs in the __init__ logic.
  • Then we can avoid dealing with LinkableElements at all, so you could remove the new SCD_HOP LinkableElementProperty and all the logic associated with that.

Copy link
Contributor Author

@serramatutu serramatutu Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this in DMs but I wanna document it here: the reason I'm using the LinkableElements is because a dimension being an SCD_HOP or not depends on the join path.

Let's say we have 4 models: bookings, agents, users and agencies. Each booking is booked by an user with the help of an agent which is part of an agency. Each user's bookings are always handled by the same agency but not necessarily the same agent, so there is a direct link between user and agency as well. Now suppose agents is an SCD table.

  • If we query for bookings_revenue grouped by user__agency__name, we don't need metric_time since neither user nor agency are SCDs.

  • If we now query for bookings_revenue grouped by agent__agency__name, we do need metric_time since we join to agency via agent, and agent is an SCD.

Since the same exact dimension (agency__name in the example above) might or might not be treated as an SCD depending on the join path, we need to analyze this in the path resolver based on the source metric, so it can't go in SemanticModelLookup.

@cla-bot cla-bot bot added the cla:yes label Oct 10, 2024
@serramatutu serramatutu force-pushed the serramatutu/scd-time-dimension-validation branch from d3bbf19 to 53eae93 Compare October 10, 2024 15:52
@dbt-labs dbt-labs deleted a comment from cla-bot bot Oct 10, 2024
@serramatutu serramatutu marked this pull request as ready for review October 10, 2024 16:13
@serramatutu
Copy link
Contributor Author

@plypaul I'm asking for your review since you're the one doing most of the profiling and I refactored the metric_time_requirements.py file. Is there a reason you only cached one of the methods and not the other? Do you think if I cache the whole of _get_query_analysis will that grow too much in memory or affect performance somehow?

I'll wait for this answer before I rebase from main :)

Copy link
Contributor

@courtneyholcomb courtneyholcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!! I'll wait until you've rebased and added the caching logic to approve.

Also, can you add one more test? Something like this test, which will give us a snapshot of what the new error looks like to users.

@serramatutu serramatutu force-pushed the serramatutu/scd-time-dimension-validation branch from 03ce571 to afee199 Compare October 11, 2024 14:07
`MetricTimeQueryValidationRule` had some methods that were only called
once and would loop multiple times over the group by items. I
consolidated that into a single `_get_query_items_analysis()`.

`PostResolutionQueryValidationRule` had a `_get_metric()` method that
had only one callsite and was just a proxy to
`_manifest_lookup.metric_lookup.get_metric()`. I removed that and added
a direct call to the `ManifestLookup`.
@serramatutu serramatutu force-pushed the serramatutu/scd-time-dimension-validation branch from afee199 to 3ccc016 Compare October 11, 2024 14:16
Copy link

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@serramatutu
Copy link
Contributor Author

Since the last review, I:

  • Changed the 2 tests to use assert_str_snapshot_equal with the error message. I added 2 .txt files with the expected validation errors to compare against.
  • Rebased the fixup! commits.
  • Rebased against main to get @plypaul's caching changes.
  • Added a new changelog entry with changie new.

scd_query_parser: MetricFlowQueryParser,
) -> None:
"""Test that queries that join through SCDs semantic models fail if no time dimensions are selected."""
with pytest.raises(InvalidQueryException) as exc_info:
Copy link
Contributor Author

@serramatutu serramatutu Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@courtneyholcomb I changed these tests from using the match regex to use assert_str_snapshot_equal, and I created the 2 new .txt files with the full error.

Copy link
Contributor

@courtneyholcomb courtneyholcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left 2 small notes inline, but this looks great!! Awesome work!

)

def _group_by_items_include_agg_time_dimension(
@lru_cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking - I think Paul said it was fine to use this lru_cache becasue the number of items is small enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I asked him during standup and he said that was OK!

This commit adds a new step to `MetricTimeQueryValidationRule` which
does the following:
1. Selects all the existing SCDs for the queried metric
2. Match them against the spec pattern of the group by input
3. Raise a `SCDRequiresMetricTimeIssue` if no `metric_time` was provided
  and there were matches

To accomplish step 1, I had to create a new `LinkableElementProperty`
called `SCD_HOP`. This new property indicates that the join path to the
linkable element goes through an SCD at some point.

I changed the `ValidLinkableSpecResolver` to add `SCD_HOP` to the
properties of all the elements it finds whenever that element belongs to
an SCD or if the path to it contains an SCD.
This commit adds the changelog entry about the SCD in join path
validation.
@serramatutu serramatutu force-pushed the serramatutu/scd-time-dimension-validation branch from 7d515e6 to 2c8c770 Compare October 14, 2024 12:25
@serramatutu serramatutu merged commit af045aa into main Oct 14, 2024
11 checks passed
@serramatutu serramatutu deleted the serramatutu/scd-time-dimension-validation branch October 14, 2024 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants