Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting a query into CTEs #21 #45

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

MohamedBsh
Copy link

Related to #21

Parsing SQL queries and extraction of CTEs within queries.
Add --materialize-ctes option in the CLI to activate this feature.
Materializing CTEs so we can analyze them quickly.

I hope this helps you at carbonfact @MaxHalford 🤞🤞

pytest -s test_examples.py::test_jaffle_shop_materialize_ctes

Original query:                                                                                                                                                                                                                                                         
           with customer_orders as (                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                               
                   select                                                                                                                                                                                                                                                                      
                   customer_id,                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                               
                   min(order_date) as first_order,                                                                                                                                                                                                                                             
                   max(order_date) as most_recent_order,                                                                                                                                                                                                                                       
                   count(order_id) as number_of_orders                                                                                                                                                                                                                                         
               from jaffle_shop_ctes_max.staging.orders                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                               
               group by customer_id                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                               
           ),                                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                               
           customer_payments as (                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                               
               select                                                                                                                                                                                                                                                                          
                   orders.customer_id,                                                                                                                                                                                                                                                         
                   sum(amount) as total_amount                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                               
               from jaffle_shop_ctes_max.staging.payments                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                               
               left join jaffle_shop_ctes_max.staging.orders orders using (order_id)                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                               
               group by orders.customer_id                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                               
           )                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                               
           select                                                                                                                                                                                                                                                                              
               -- #UNIQUE                                                                                                                                                                                                                                                                      
               customers.customer_id,                                                                                                                                                                                                                                                          
               customers.first_name,                                                                                                                                                                                                                                                           
               customers.last_name,                                                                                                                                                                                                                                                            
               customer_orders.first_order,                                                                                                                                                                                                                                                    
               customer_orders.most_recent_order,                                                                                                                                                                                                                                              
               customer_orders.number_of_orders,                                                                                                                                                                                                                                               
               customer_payments.total_amount as customer_lifetime_value                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                               
           from jaffle_shop_ctes_max.staging.customers customers                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                               
           left join customer_orders                                                                                                                                                                                                                                                           
               on customers.customer_id = customer_orders.customer_id                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                               
           left join customer_payments                                                                                                                                                                                                                                                         
               on  customers.customer_id = customer_payments.customer_id                                                                                                                                                                                                                       
           Split query result:                                                                                                                                                                                                                                                     runner.py:76
           Number of CTEs found: 2                                                                                                                                                                                                                                                 runner.py:76
           CTE 'customer_orders': SELECT customer_id, MIN(order_date) AS first_order, MAX(order_date) AS most_recent_order, COUNT(orde...                                                                                                                                          runner.py:76
           CTE 'customer_payments': SELECT orders.customer_id, SUM(amount) AS total_amount FROM jaffle_shop_ctes_max.staging.payments LE...                                                                                                                                        runner.py:76
           Attempting to materialize CTE: core.customers__customer_orders                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
     ..
           Successfully materialized CTE: core.customers__customer_orders                                                                                                                                                                                                          runner.py:76
           Attempting to materialize CTE: core.customers__customer_payments                                                                                                                                                                                                        runner.py:76

.. 
           Successfully materialized CTE: core.customers__customer_payments                                                                                                                                                                                                        runner.py:76
…

All tables:
analytics.finance__kpis
analytics.kpis
core.customers
core.customers__customer_orders
core.customers__customer_payments
core.orders
core.orders__order_payments
staging.customers
staging.orders
staging.payments

CTE tables:
analytics.finance__kpis
core.customers__customer_orders
core.customers__customer_payments
core.orders__order_payments

core.customers__customer_orders data:
(1, '2018-01-01', '2018-02-10', 2)
(2, '2018-01-11', '2018-01-11', 1)
(3, '2018-01-02', '2018-03-11', 3)
(6, '2018-02-19', '2018-02-19', 1)
(7, '2018-01-14', '2018-01-14', 1)

core.customers__customer_payments data:
(1, 33.0)
(2, 23.0)
(3, 65.0)
(6, 8.0)
(7, 26.0)

core.customers data:
(1, 'Michael', 'P.', '2018-01-01', '2018-02-10', 2, 33.0)
(2, 'Shawn', 'M.', '2018-01-11', '2018-01-11', 1, 23.0)
(3, 'Kathleen', 'P.', '2018-01-02', '2018-03-11', 3, 65.0)
(6, 'Sarah', 'R.', '2018-02-19', '2018-02-19', 1, 8.0)
(7, 'Martin', 'M.', '2018-01-14', '2018-01-14', 1, 26.0)

All CTE materialization tests passed successfully.
PASSED

Copy link
Member

@MaxHalford MaxHalford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. It's useful.

I would advise taking another look at the implementation. For me, the current implementation is too adhoc, and doesn't fit in well with the rest of the code. I think the code can be much more elegant by pre-splitting each query into CTEs, and making View objects with each CTE.

lea/runner.py Outdated Show resolved Hide resolved
lea/runner.py Outdated Show resolved Hide resolved
lea/runner.py Outdated Show resolved Hide resolved
lea/runner.py Outdated
Comment on lines 443 to 453
view = self.views[view_key].with_context(
table_reference_mapping=table_reference_mapping
)
if materialize_ctes and isinstance(view, SQLView):
print(self._materialize_ctes_and_view)
job = functools.partial(self._materialize_ctes_and_view, view=view)
else:
job = functools.partial(
self.client.materialize_view,
view=view,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't approach it this way. What I would suggest, is to split the query into CTEs, and create a View object for each CTE. I would then add these objects to the list of views that will be executed. This way, you benefit from the rest of lea, and don't need to handle the execution adhoc. I think you should rework your code so that it leverages what already exists, if that makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants