Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early stop when maximum distance reached [feature request] #1

Open
pimente opened this issue Jul 25, 2017 · 8 comments
Open

Early stop when maximum distance reached [feature request] #1

pimente opened this issue Jul 25, 2017 · 8 comments

Comments

@pimente
Copy link

pimente commented Jul 25, 2017

Hello,

Thank you for this implementation. It is indeed very fast!

I am trying to optimize the algorithm for the specific (but rather common) case where we look for only close items. Stopping early when we know that the distance will certainly be over a given maximum distance can make the algorithm much faster.

I get already a good speed improvement (x2) by stopping early at the very beginning here and/or here with something like:

if (typeof max === 'number' && lb - la >= max) return max;

But when I try to stop early in the for loops below, by keeping the minimum value of vector and stopping if min + lb - la > max, it works but gets slower ;(

Do you have any idea of how to implement this feature in an optimized way?
The best would be of course to support this option directly in your code ;) but any advice is welcome!

Thank you very much for your help.

Cheers

@gustf
Copy link
Owner

gustf commented Jul 26, 2017

Hi,

Sounds like a good feature, I have been thinking about this too :)

It will most likely have a negative impact on the general performance, but if we can find a solution that is not too much slower I think we can include this in the default function. Otherwise we can make a separate function depending on if a max argument is provided.

I will look into this as soon as I can find some time.

Do you have some sample input I can use for testing/benchmarking?

@pimente
Copy link
Author

pimente commented Jul 27, 2017

Great that you like it!

Just to be clear, the formal definition of the option I had in mind is: levenshtein(a, b, max) = min(max, levenshtein(a, b))

Here are some unit tests in your current format (for testing but not benchmarking):

test(t =>
     {
       t.is(levenshtein('a', 'b'), 1);
       t.is(levenshtein('ab', 'ac'), 1);
       t.is(levenshtein('ac', 'bc'), 1);
       t.is(levenshtein('abc', 'axc'), 1);
       t.is(levenshtein('kitten', 'sitting'), 3);
       t.is(levenshtein('xabxcdxxefxgx', '1ab2cd34ef5g6'), 6);
       t.is(levenshtein('cat', 'cow'), 2);
       t.is(levenshtein('xabxcdxxefxgx', 'abcdefg'), 6);
       t.is(levenshtein('javawasneat', 'scalaisgreat'), 7);
       t.is(levenshtein('example', 'samples'), 3);
       t.is(levenshtein('sturgeon', 'urgently'), 6);
       t.is(levenshtein('levenshtein', 'frankenstein'), 6);
       t.is(levenshtein('distance', 'difference'), 5);
       t.is(levenshtein('因為我是中國人所以我會說中文', '因為我是英國人所以我會說英文'), 2);
       // Tests with max option
        // max === 1
       t.is(levenshtein('因為我是中國人所以我會說中文', '因為我是英國人所以我會說英文', 1), 1);
       t.is(levenshtein('distance', 'difference', 1), 1);
       t.is(levenshtein('distance', 'distance', 1), 0);
       t.is(levenshtein('zéphïr', 'zéphïr', 1), 0);
       t.is(levenshtein('zéphir', 'zéphïr', 1), 1);
        // max === 2
       t.is(levenshtein('kitten', 'sitting', 2), 2);
       t.is(levenshtein('xabxcdxxefxgx', '1ab2cd34ef5g6', 2), 2);
       t.is(levenshtein('cat', 'cow', 2), 2);
       t.is(levenshtein('xabxcdxxefxgx', 'abcdefg', 2), 2);
       t.is(levenshtein('javawasneat', 'scalaisgreat', 2), 2);
       t.is(levenshtein('example', 'samples', 2), 2);
       t.is(levenshtein('sturgeon', 'urgently', 2), 2);
       t.is(levenshtein('ubiquitous', 'universal', 2), 2);
       t.is(levenshtein('abcdefghij', 'axbcdefghij', 2), 1);
       t.is(levenshtein('abcdefghij', 'abcdefghij', 2), 0);
        // max === 3
       t.is(levenshtein('abcdefghij', 'abcdefghij', 3), 0);
       t.is(levenshtein('abcdefghij', 'axbcdefghij', 3), 1);
       t.is(levenshtein('abcdefghij', 'acefghij', 3), 2);
       t.is(levenshtein('abcdefghij', 'acefhj', 3), 3);
       t.is(levenshtein('abcdefghij', 'accdffghii', 3), 3);
       t.is(levenshtein('abcdefghij', 'accdffhhii', 3), 3);
});

Best,

@gustf
Copy link
Owner

gustf commented Jul 28, 2017

Ok, thanks for the example input. I see your use case is for smaller strings.

I'm not sure that the formal definition would be the best for a library function as it will make it impossible for someone who would like to know if a string is more than max.

I think it would be more useful for more cases to return Infinity or something like that instead. You can always wrap the call in a function which returns max if the call returns Infinity.

@pimente
Copy link
Author

pimente commented Jul 29, 2017

Yes, I agree with your suggestion.
It will cover more use cases!

@nol13
Copy link

nol13 commented Aug 3, 2017

Would love (a version with) this. Still, looks like I have a new default distance calc for my fuzzball package either way! :D

(changing the 1's on lines 12 and 99 to 2's will correctly set the substitution cost to 2 ya?)

@gustf
Copy link
Owner

gustf commented Aug 14, 2017

@nol13
Sorry for late reply, vacation time here in sweden.

Nice to hear you use this lib in your fuzzball package! And you are right about the changes, that will do the trick 👍

@OoDeLally
Copy link

I would love such feature too. I'm doing word-by-word comparison, with a tolerance of 1 distance, therefore it is pretty ridiculous to continue until distance e.g. 8. please please please please? 🙏

@fonograph
Copy link

Hey all, I needed this feature so I went ahead and tried an implementation. It might not be 100% optimized, but if you're not using the max argument, then it shouldn't impact the existing performance of the algorithm. If it's not good enough feel free to close though. :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants