-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can the processed dictionary created on-the-fly by the command line be saved for re-use? #40
Comments
You have to serialize/deserialize the delete data structure of symspell and save it to a file/load it from a file. Those files can then also be stored in MySQL as BLOB. You can implement your own serialization, use JSON or something like flatbuffers
This is certainly possible to implement as an additional command line parameter. Alternatively one could implement this as a post processing step to filter the complete result list accordingly.
SymSpell.CommandLine allows pipes and redirects for Input & Output. So does VBA. Could this be a way how those two programs communicate while the directory is already loaded? I will mark this as future enhancements, but can't give any timeline. |
Thanks for the quick reply. I'll investigate what you said about saving down the data structure. At first glance it is not obvious where to begin but I will look through the code to see what I can find that relates to what you said. The reason I had asked about number 2 (limiting the # of responses to the top x per edit distance) is that my dictionary is several million entries and when I send a list of, say, 10000 names to be processed and I allow up to an edit distance of 3 I get back millions and millions of matches. This is particularly true if my list has some long names. What I found is that in most cases the answer I want is in the top 5 matches - but sometimes that match has an edit distance of 3 which means I have to deal with the millions of matches. I'll look more closely at pipes and redirects. Thanks again for a fantastic tool. |
I also forgot to ask if you know of anyone who has tried to use nodejs to create an api using your work. I found 1 very old attempt but it is unusable. |
As far as I understand Node.js is an JavaScript run-time environment that executes JavaScript code outside the browser. So it could be used to run one of the SymSpell JavaScript ports. But if you just want an embedded web server to provide a REST-API to SymSpell, this can be easily implemented using the HTTPListener class. |
First of all, thank you for developing this tool. It is amazingly useful and fast!
I have been using the command line version since my programming skills are limited (my skills being in vba, sql and some java). I do have visual studio 2017 installed so perhaps that could help if I need to modify the project on my end..
So my hope would be that I could find a way to do the following:
process the frequency dictionary once and save it for re-use
(I want to confirm that there is no way to have the processed dictionary be loaded into a db like mysql..I assume this won't work because mysql cannot create the proper indexes..correct?)
use a command line switch to set the number of matches returned (in frequency order of course) for edit distance 1 and edit distance 2 etc. So let's say I want the "top 5" and I set my max-edit distance to 3 then I would get 15 results (assuming there are >=5 matches for each edit distance). As it is now I may get a few for distance 1, a lot for edit distance 2 and a massive list for edit distance 3. I have been attempting to cleanup names from the census with have transcription errors and many times the correct name is the 1st or 2nd result in edit distance 2 or 3 (not edit distance 1). If I could get the top few matches from each edit distance then I have a phonetic algorithm that narrows the results.
ultimately I would really want to create an excel function that could call upon the command line for matches where the processed dictionary is already loaded into memory and that environment is accessible to vba
As a first step - items 1 and 2 are most important (saving the processed dictionary and setting the max # matches ordered by frequency)
Do you think this is possible? and could the dictionary ever be moved into a db? Thanks for your help and for sharing this excellent tool.
As a side note, could this ever be successfully migrated to nodejs to create an api?
The text was updated successfully, but these errors were encountered: