Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Derived & component metrics in same mf query results in complex/unnecessary joins #422

Open
callum-mcdata opened this issue Jan 27, 2023 · 3 comments
Labels
backlog bug Something isn't working

Comments

@callum-mcdata
Copy link
Contributor

Apologies on the bad title, not sure how exactly to describe this one?

Describe the bug
I believe that the new derived metric type does not recognize the component metrics included as metrics during the query generation. When a user queries both the derived and the component metrics, the SQL generated appears to include two different queries that are then full outer joined, similar to the pattern executed for metrics that don't share the same base table.

This assumption of what the issue is is partially derived by a comment by @QMalcolm in #416 where he mentioned our measures (which is how we treat component metrics) and then re-enforced when comparing the SQL generated by querying derived+component metrics versus just derived metrics.

Here is the SQL produced when querying both the derived & component metrics:

Derived metric + component metrics

MF Command: mf query --metrics revenue,costs,taxes,net_profit --dimensions metric_time__month,location_name

SELECT
  COALESCE(subq_4.metric_time__month, subq_10.metric_time__month) AS metric_time__month
  , COALESCE(subq_4.location_name, subq_10.location_name) AS location_name
  , MAX(subq_4.revenue) AS revenue
  , MAX(subq_4.costs) AS costs
  , MAX(subq_4.taxes) AS taxes
  , MAX(subq_10.net_profit) AS net_profit
FROM (
  ### This appears to be the subquery selecting the base metrics 
  SELECT
    DATE_TRUNC('month', ordered_at) AS metric_time__month
    , location_name
    , SUM(order_item_total) AS revenue
    , SUM(product_cost) AS costs
    , SUM(order_item_tax_paid) AS taxes
  FROM dbt_test.dbt_metrics.order_items order_items_src_0
  GROUP BY
    DATE_TRUNC('month', ordered_at)
    , location_name
) subq_4
FULL OUTER JOIN (
  ### This appears to be the subquery selecting the derived metric and ignoring the component metrics calculated in the same subquery
  SELECT
    metric_time__month
    , location_name
    , revenue - (costs + taxes) AS net_profit
  FROM (
    SELECT
      DATE_TRUNC('month', ordered_at) AS metric_time__month
      , location_name
      , SUM(order_item_total) AS revenue
      , SUM(product_cost) AS costs
      , SUM(order_item_tax_paid) AS taxes
    FROM dbt_test.dbt_metrics.order_items order_items_src_0
    GROUP BY
      DATE_TRUNC('month', ordered_at)
      , location_name
  ) subq_9
) subq_10
ON
  (
    subq_4.location_name = subq_10.location_name
  ) AND (
    subq_4.metric_time__month = subq_10.metric_time__month
  )
GROUP BY
  COALESCE(subq_4.metric_time__month, subq_10.metric_time__month)
  , COALESCE(subq_4.location_name, subq_10.location_name)

Here is the SQL produced when just querying the derived metric

Just derived metric

MF Command: mf query --metrics net_profit --dimensions metric_time__month --order metric_time__month

SELECT
  metric_time__month
  , revenue - (costs + taxes) AS net_profit
FROM (
  SELECT
    DATE_TRUNC('month', ordered_at) AS metric_time__month
    , SUM(order_item_total) AS revenue
    , SUM(product_cost) AS costs
    , SUM(order_item_tax_paid) AS taxes
  FROM dbt_test.dbt_metrics.order_items order_items_src_2
  GROUP BY
    DATE_TRUNC('month', ordered_at)
) subq_4
ORDER BY metric_time__month

Steps To Reproduce

  1. Create two metrics based on table A.
  2. Create a derived metric based on table A.
  3. Query both the metrics and the derived metric.

Expected behavior
I would suspect that the query generated in the second example would be used but the component metrics being calculated in the subquery would be used.

Environment (please complete the following information):

  • MetricFlow Version 0.14.0

Note:

The query works just fine but wanted to 🎏 because I was digging into the new implementation of derived metrics to test out the dbt integration

@callum-mcdata callum-mcdata added bug Something isn't working triage Tasks that need to be triaged labels Jan 27, 2023
@Jstein77 Jstein77 added backlog and removed triage Tasks that need to be triaged labels Aug 30, 2023
@giuseppecg
Copy link

Hi, guys!

I would like to second @callum-mcdata problem. Here we intend to use metric flow to generate a more generic datasets based on our data warehouse. However, as of right now, we can't use metricflow directly.

The problem for us is that it generates a more complicated query just to calculate the ratio and derived metrics on different subqueries, which wouldn't be a complete problem, but it locks the dimensions by which is calculated. For example, if a have mf query --metrics simple_metric, ratio_metric --group-by categorical_group, another_categorical_group,some_sk I can't get the correct values for the ratio_metric if I take some_sk afterwards outside of metricflow. So I have to always generate a new dataset 😕 and governance-wise this became a problem for us.

I get that the current for facilitates the filter option as well as other features, but it really doesn't work when you want less datasets for your business stakeholder to make sense.

Environment (please complete the following information):

  • MetricFlow Version 0.202.0

Note:

The query does indeed work fine, it's just a question of generalization.

@Jstein77
Copy link
Contributor

@giuseppecg thanks for adding more context to this issue!

I created another issue with a proposed approach to solving the complex SQL rending. Would love your eyes on that.

but it locks the dimensions by which is calculated. For example, if a have mf query --metrics simple_metric, ratio_metric --group-by categorical_group, another_categorical_group,some_sk I can't get the correct values for the ratio_metric if I take some_sk afterwards outside of metricflow. So I have to always generate a new dataset 😕 and governance-wise this became a problem for us.

I wanted to dig into this a bit more because I think it may be a separate issue. Even if we generate less complex SQL for this case, you still won't be able to re-aggregate ratio metrics across different dimensions. This is one of the drawbacks to using static data sets. What's the reason you decided to go with this approach? Is it mainly as an integration path for a BI tool we don't yet support?

@soham-dasgupta
Copy link

Generation of queries with FULL OUTER JOIN is much less efficient that reusing metrics from Subqueries. This is really an anti pattern. Do we have eyes on this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants