Storing outcomes of similarity functions across a comparison to avoid re-compute #2540

lamaeldo · 2024-12-06T09:45:43Z

Is your proposal related to a problem?

When building custome comparisons, it is common to use a level of comparison with continuous outcomes several times with changing threshold (For Example Jaro-Winkler > 0.9, > 0.8 > 0.7). Currently, we have to actually compute the similarity function for each level of comparison, which is inneficient

Describe the solution you'd like

For a given comparison, it would make sense for the value of a given similarity function to be stored until the end of the comparison, and for it to be retrieved by subsequent levels, if they use the same simlilarity function.

RobinL · 2024-12-14T19:37:55Z

Starting to experiment with this as follows:

import sqlglot
from sqlglot import exp

sql = """
SELECT
    CASE
        WHEN levenshtein(name_l, name_r) < 2 THEN 0
        WHEN levenshtein(name_l, name_r) < 4 THEN 1
        WHEN levenshtein(name_l, name_r) < 6 THEN 2
        WHEN jaro(name_l, name_r) < 0.9 THEN 2
        WHEN jaro(name_l, name_r) < 0.8 THEN 2
        ELSE 3
    END as similarity_bin
FROM joined_names
"""


expression = sqlglot.parse_one(sql)

function_counts = {}
for case_expr in expression.find_all(exp.Case):
    for when_expr in case_expr.find_all(exp.If):
        for func in when_expr.find_all(exp.Func):
            func_sql = func.sql()
            function_counts[func_sql] = function_counts.get(func_sql, 0) + 1


repeated_functions = {func for func, count in function_counts.items() if count > 1}


def transform(node):
    if isinstance(node, exp.Func) and node.sql() in repeated_functions:
        cleaned = "".join(c if c.isalnum() else "_" for c in node.sql().lower())
        return exp.Literal.string(cleaned)
    return node


transformed = expression.transform(transform)
print(transformed.sql())

lamaeldo added the enhancement New feature or request label Dec 6, 2024

lamaeldo changed the title ~~[FEAT] <title>~~ Storing outcomes of similarity functions across a comparison to avoid re-compute Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storing outcomes of similarity functions across a comparison to avoid re-compute #2540

Storing outcomes of similarity functions across a comparison to avoid re-compute #2540

lamaeldo commented Dec 6, 2024

RobinL commented Dec 14, 2024

Storing outcomes of similarity functions across a comparison to avoid re-compute #2540

Storing outcomes of similarity functions across a comparison to avoid re-compute #2540

Comments

lamaeldo commented Dec 6, 2024

Is your proposal related to a problem?

Describe the solution you'd like

RobinL commented Dec 14, 2024