Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFIDF min_similarity not applied #49

Open
philkoch opened this issue Oct 26, 2022 · 4 comments
Open

TFIDF min_similarity not applied #49

philkoch opened this issue Oct 26, 2022 · 4 comments

Comments

@philkoch
Copy link

philkoch commented Oct 26, 2022

When using the TFIDF model the min_similiary parameter seems not to be applied to the results.

Minimal Example that reproduces the problem (polyfuzz 0.4.0):

from polyfuzz import PolyFuzz
from polyfuzz.models import TFIDF

if __name__ == "__main__":
    token_list = [
        "Stoltenbergs",
        "Ansage",
        "Putin",
        "Nato",
        "Drohungen",
        "Russlands",
        "Nato",
        "Unterstützung",
        "Ukraine",
        "Stoltenberg",
        "Putin",
        "Nato",
    ]

    matcher = TFIDF(n_gram_range=(3, 3), min_similarity=0.9)
    model = PolyFuzz(matcher)
    model.match(token_list)
    model.group()
    matches = model.get_matches()
    print(matches)

When running the code the following output is generated, but the rows 4 and 7 should have a Similarity score of 0, if I understand the documentation correctly.

The minimum similarity between strings, otherwise return 0 similarity

I would expect the rows with a Similarity of < 0.9 to have a Similarity of 0 and a To value of None.

Output:

             From             To  Similarity          Group
0    Stoltenbergs    Stoltenberg       0.932   Stoltenbergs
1          Ansage           None       0.000           None
2           Putin          Putin       1.000          Putin
3            Nato           Nato       1.000           Nato
4       Drohungen  Unterstützung       0.091  Unterstützung
5       Russlands           None       0.000           None
6            Nato           Nato       1.000           Nato
7   Unterstützung      Drohungen       0.091      Drohungen
8         Ukraine           None       0.000           None
9     Stoltenberg   Stoltenbergs       0.932   Stoltenbergs
10          Putin          Putin       1.000          Putin
11           Nato           Nato       1.000           Nato

In case I'm using the library wrong, how would I be able to get only results with a similarity higher than 0.9?

@MaartenGr
Copy link
Owner

You are using the library correctly but it seems that the min_similarity was not implemented properly for all cosine similarity backends. I will make sure this gets fixed a next release. For now, if you want to use this feature, you can do it with:

pip install polyfuzz[fast]

@philkoch
Copy link
Author

I will try that, thanks for the quick response!

@nitindabadghav
Copy link

Hello Maarten,
Whichever model I use with Polyfuzz, the model parameters are never applied. Is there any workaround for this ?

Thanks,
Nitin

@MaartenGr
Copy link
Owner

@nitindabadghav Could you provide a bit more information? What version do you use? Can you share your code? Have you tried the answer I provided above? Etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants