Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Incremental strategies for delete+insert and microbatch cause unnecessary cross joins #1198

Open
3 tasks done
SystemOfaDrow opened this issue Oct 8, 2024 · 0 comments
Labels
enhancement New feature or request triage

Comments

@SystemOfaDrow
Copy link

SystemOfaDrow commented Oct 8, 2024

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-snowflake functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Currently, the delete+insert incremental strategy requires a "unique_key", but there's nothing actually enforcing this key be unique. I've done this before, but the issue is that it creates a cartesian join between all the records with matching (non-unique) unique_key values because of the USING. The new microbatch strategy does something similar, where if you don't have an additional predicate, the USING bit is unnecessary. In both cases, the join explosion can take up a significant amount of resources.

Describe alternatives you've considered

For the delete+insert strategy, I would propose possibly changing the name the "unique_key" config to "incremental_key", and modifying the query to use a correlated EXISTS subquery.

delete from {{ target }}
where exists (
    select 1
    from {{ source }}
    where
    {% if incremental_key is sequence and incremental_key is not string %}
        {% for key in incremental_key %}
            {{ source }}.{{ key }} = {{ target }}.{{ key }}
            {{ "and " if not loop.last}}
        {% endfor %}
    {% else %}
        {{ source }}.{{ incremental_key }} = {{ target}}.{{ incremental_key }}
    {% endif %}
)
{% if incremental_predicates %}
    {% for predicate in incremental_predicates %}
        {{ predicate }} {{ "and" if not loop.last }}
    {% endfor %}
{% endif %};

The using {{ source }} line in the microbatch strategy should simply be deleted.

Who will this benefit?

This would benefit every who uses these incremental strategies by significantly lowering warehouse load (and therefore costs) for these types of queries.

Are you interested in contributing this feature?

I think the code above should work. It may take me a while before I have time to do all the requisite testing before creating a PR.

Anything else?

I ran one query like DELETE FROM target USING source with no additional predicates where the target table had 10 million records and the source table a billion. You can see how relatively expensive that cartesian join is.

Most Expensive Nodes
CartesianJoin 47.3%
Delete 27.0%
TableScan 1.3%

@SystemOfaDrow SystemOfaDrow added enhancement New feature or request triage labels Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

No branches or pull requests

1 participant