-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Splitting runons #17
Comments
Yes, it should be possible to let analiticcl generate variants involving
a whitespace. It simply entails heaving such bigrams *explicitly* in
your input lexicon (it need not be constrained to single words).
There's also a possibility if you use search mode, where you can load a
language model. Though I'm not entirely how that would play out in such
cases. Itmight still need an expanded lexicon.
There may be room for improvement in this area.
|
Thanks!
Please don't hesitate to suggest meaningful parameter usage for my case. My primary problem is that analiticcl does not generate anagrams from the alphabet file and lexicon I sent you (historical Slavonic) so I cannot seem to use it in any mode :-( |
Short update: I also tried to run analiticcl in a Colab notebook.
And no matter what word I query with
The files are UTF-8. I think by now I tested installations by both cargo and pip. |
Can you share the notebook? (along with all input files). Then I can check if I can see what's happening. |
Thanks very much! I have sent an invitation to your email address. |
Got it, something's going wrong with the anagram computation based on the alphabet file. I'm investigating... |
…tibyte characters #17 Also added a 'testinput' mode and made alphabet debugging more verbose
There was a serious bug in the multibyte handling that came to light thanks to your example. I'm doing a new analiticcl release tonight (v0.4.5) that will fix this. |
Released now! (both on crates.io and pypi) |
Example output from your test in the new situation:
|
I also added a
(the highest number in the array (37) corresponds to an unknown character, all the non-cyrillic once in this case). |
Fantastic, thank you so much! I'm excited to test it asap! |
Awesome, both the module from pypi and the CLI version now work fine! I am going to explore the different modes. |
Not sure if this is of interest, but if I run the same CLI command with UNKNOWN: ꙗванi 111299573 [4, 23, 35, 28, 3] Also, if I copy-paste some tokens from the lexocon file opened in my VS Code editor into VS Code Terminal CLI, I get surprises: ихъ But 'и' is in the alphabet file, it can be searched for and is found. |
Hi, I wonder if there is a way to have analiticcl generate variants that involve a whitespace: i.e. in case of runon errors, suggesting the split form.
Suppose that 'holygrail' is actually a runon error after OCR, then I would like to be able to return a suggestion of 'holy grail'.
Is there a way to do it?
The other way round it works, i.e. for erroneous splits the concatenated forms are retrieved, e.g.
The text was updated successfully, but these errors were encountered: