add fuzz transformer #25

shahrukhx01 · 2021-07-10T21:55:59Z

Hi @MaartenGr,
I have fine-tuned a fuzzy transformer for char level similarity to do fuzzy matching, you can read about how I did here:
LinkedIn post explanation: https://www.linkedin.com/feed/update/urn:li:activity:6819456033992253440/
Model on hugging face hub: https://huggingface.co/shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher

Would you like me to create a pull request if it fits PolyFuzz?

Thanks,
Shahrukh

MaartenGr · 2021-07-11T07:30:06Z

You already can! PolyFuzz supports Flair which in turn supports sentence-transformers on which your model is based. If you run the following code, you can use the model:

from polyfuzz import PolyFuzz
from polyfuzz.models import Embeddings
from flair.embeddings import SentenceTransformerDocumentEmbeddings

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

embedding = SentenceTransformerDocumentEmbeddings('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')
matcher = Embeddings(embedding, min_similarity=0)
model = PolyFuzz(matcher).match(from_list, to_list)

shahrukhx01 · 2021-07-11T08:09:36Z

thanks for your response, I was able to execute the model, however, the model produces substandard results compared to actual model this is because of the fact, in my implementation before tokenization, I break the input string into characters it really helps the model optimize for the distance objective, for instance, "hello" would preprocessed as "h e l l o". Please let me know how to proceed with this, also would you like me to document this model in Readme?
Please see the results below as well

MaartenGr · 2021-07-15T06:24:09Z

Hmmm, in that case, would it not be a matter of preprocessing the words before passing them to KeyBERT? Something like this:

from polyfuzz import PolyFuzz
from polyfuzz.models import Embeddings
from flair.embeddings import SentenceTransformerDocumentEmbeddings

embedding = SentenceTransformerDocumentEmbeddings('shahrukhx01/paraphrase-mpnet-base-v2-fuzzy-matcher')

from_list = ["apple", "apples", "appl", "recal", "house", "similarity"]
to_list = ["apple", "apples", "mouse"]

from_list = [" ".join([char for char in word]) for word in from_list]
to_list = [" ".join([char for char in word]) for word in to_list]

matcher = Embeddings(embedding, min_similarity=0)
model = PolyFuzz(matcher).match(from_list, to_list)

Then you would only need to transform them back into words. I am bit hesitant adding support for a specific model that I currently have no benchmark for. Do you have a paper related to this model?

shahrukhx01 · 2021-07-15T06:33:44Z

@MaartenGr I plan to write an Arxiv paper on this, however, it could take some time, in the meanwhile would you be okay, if I do I direct comparative analysis of this model with BERT based embedding model in Polyfuzz? I already have the dataset for Fuzzy benchmarking

MaartenGr · 2021-07-20T08:22:58Z

The thing is with the dataset that you shared is that the value generated are no ground truth since they are computed with Levenshtein. A model that has a focus on char-level embeddings is therefore likely to outperform a model that is not regardless of its actual accuracy. It would be nice if you could test on a dataset that is often used for string-matching research.

shahrukhx01 · 2021-07-20T09:09:00Z

Could you point me to a dataset that could be used here? Also, is there any chance we can collaborate in writing something formal (an Arxiv paper or something) about different neural approaches for string-matching?

MaartenGr · 2021-07-27T11:37:02Z

Apologies for the late response. I believe it would take several datasets and evaluation measures to thoroughly validate the model that you created. Although I would be interested in collaborating, I am afraid I currently do not have the time to write an extensive paper on the subject.

shahrukhx01 · 2021-07-27T11:57:02Z

That won't be a problem, I'm willing to do the write-ups and experimentation since I will be having the summer break from my school. It'd be great if you can help with ideas and reviewing what I do, that'd be more really great. Please let me know if that's possible for you :)

MaartenGr · 2021-08-03T05:32:41Z

I cannot make any promises but perhaps I can make some time to review ideas and experimentations. It would be interesting to have a nice overview of string similarity based algorithms.

shahrukhx01 mentioned this issue Jul 13, 2021

add char level preprocessing for fuzz transformer #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add fuzz transformer #25

add fuzz transformer #25

shahrukhx01 commented Jul 10, 2021

MaartenGr commented Jul 11, 2021

shahrukhx01 commented Jul 11, 2021 •

edited

Loading

MaartenGr commented Jul 15, 2021

shahrukhx01 commented Jul 15, 2021 •

edited

Loading

MaartenGr commented Jul 20, 2021

shahrukhx01 commented Jul 20, 2021

MaartenGr commented Jul 27, 2021

shahrukhx01 commented Jul 27, 2021

MaartenGr commented Aug 3, 2021

add fuzz transformer #25

add fuzz transformer #25

Comments

shahrukhx01 commented Jul 10, 2021

MaartenGr commented Jul 11, 2021

shahrukhx01 commented Jul 11, 2021 • edited Loading

MaartenGr commented Jul 15, 2021

shahrukhx01 commented Jul 15, 2021 • edited Loading

MaartenGr commented Jul 20, 2021

shahrukhx01 commented Jul 20, 2021

MaartenGr commented Jul 27, 2021

shahrukhx01 commented Jul 27, 2021

MaartenGr commented Aug 3, 2021

shahrukhx01 commented Jul 11, 2021 •

edited

Loading

shahrukhx01 commented Jul 15, 2021 •

edited

Loading