Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fuzz transformer #25

Open
shahrukhx01 opened this issue Jul 10, 2021 · 9 comments
Open

add fuzz transformer #25

shahrukhx01 opened this issue Jul 10, 2021 · 9 comments

Comments

@shahrukhx01
Copy link

Hi @MaartenGr,
I have fine-tuned a fuzzy transformer for char level similarity to do fuzzy matching, you can read about how I did here:
LinkedIn post explanation: https://www.linkedin.com/feed/update/urn:li:activity:6819456033992253440/
Model on hugging face hub: https://huggingface.co/shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher

Would you like me to create a pull request if it fits PolyFuzz?

Thanks,
Shahrukh

@MaartenGr
Copy link
Owner

You already can! PolyFuzz supports Flair which in turn supports sentence-transformers on which your model is based. If you run the following code, you can use the model:

from polyfuzz import PolyFuzz
from polyfuzz.models import Embeddings
from flair.embeddings import SentenceTransformerDocumentEmbeddings

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

embedding = SentenceTransformerDocumentEmbeddings('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')
matcher = Embeddings(embedding, min_similarity=0)
model = PolyFuzz(matcher).match(from_list, to_list)

@shahrukhx01
Copy link
Author

shahrukhx01 commented Jul 11, 2021

thanks for your response, I was able to execute the model, however, the model produces substandard results compared to actual model this is because of the fact, in my implementation before tokenization, I break the input string into characters it really helps the model optimize for the distance objective, for instance, "hello" would preprocessed as "h e l l o". Please let me know how to proceed with this, also would you like me to document this model in Readme?
Please see the results below as well
2416004

@MaartenGr
Copy link
Owner

Hmmm, in that case, would it not be a matter of preprocessing the words before passing them to KeyBERT? Something like this:

from polyfuzz import PolyFuzz
from polyfuzz.models import Embeddings
from flair.embeddings import SentenceTransformerDocumentEmbeddings

embedding = SentenceTransformerDocumentEmbeddings('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

from_list = [" ".join([char for char in word]) for word in from_list]
to_list = [" ".join([char for char in word]) for word in to_list]

matcher = Embeddings(embedding, min_similarity=0)
model = PolyFuzz(matcher).match(from_list, to_list)

Then you would only need to transform them back into words. I am bit hesitant adding support for a specific model that I currently have no benchmark for. Do you have a paper related to this model?

@shahrukhx01
Copy link
Author

shahrukhx01 commented Jul 15, 2021

@MaartenGr I plan to write an Arxiv paper on this, however, it could take some time, in the meanwhile would you be okay, if I do I direct comparative analysis of this model with BERT based embedding model in Polyfuzz? I already have the dataset for Fuzzy benchmarking

@MaartenGr
Copy link
Owner

The thing is with the dataset that you shared is that the value generated are no ground truth since they are computed with Levenshtein. A model that has a focus on char-level embeddings is therefore likely to outperform a model that is not regardless of its actual accuracy. It would be nice if you could test on a dataset that is often used for string-matching research.

@shahrukhx01
Copy link
Author

Could you point me to a dataset that could be used here? Also, is there any chance we can collaborate in writing something formal (an Arxiv paper or something) about different neural approaches for string-matching?

@MaartenGr
Copy link
Owner

Apologies for the late response. I believe it would take several datasets and evaluation measures to thoroughly validate the model that you created. Although I would be interested in collaborating, I am afraid I currently do not have the time to write an extensive paper on the subject.

@shahrukhx01
Copy link
Author

That won't be a problem, I'm willing to do the write-ups and experimentation since I will be having the summer break from my school. It'd be great if you can help with ideas and reviewing what I do, that'd be more really great. Please let me know if that's possible for you :)

@MaartenGr
Copy link
Owner

I cannot make any promises but perhaps I can make some time to review ideas and experimentations. It would be interesting to have a nice overview of string similarity based algorithms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants