English language support #1

mthebaud · 2020-04-17T11:26:59Z

English should be the next language to be implemented in Predict4All.
Implementing english support is only a matter of data and small implementations, as its structure is similar to French. The only specific case that can matter in english is the apostroph, that might need some tweaks to be well handled : most of the "may have to" in the following list are guided by this point.

A good start would be to create org.predict4all.nlp.language.english from org.predict4all.nlp.language.french.

You should keep in mind that any language specific code should be created under interfaces : if something previously implemented in French should be different in English, add something related to the LanguageModel. Never use if(language instanceof FrenchLanguageModel) ;-)

These are the steps to implement english prediction

Find an open english dictionary with unigram to replace french Lexique.org
Create a clean english corpus ( Wikipedia + find subtitle or language corpus) - 20 millions word should be reached
Implement specific TokenMatcher (if needed, list should be determined as most of the french token matchers are directly correct for english)
Create unit tests for english
You may have then to modify (it depends if they are english specific related problems)
- Tokenizer : if the apostrophe case should be handled differently in english
- WordPrefixDetector : again, apostrophe could cause problems
- WordPredictor
(optionnal) find an english stop-word dictionary (not used right now because it's only useful with semantic)

These are the steps to implement english correction rules

Transfert french rule that could be directly used (e.g. space, azerty, etc...)
Find other rules to implements (link with OT/ST is essential on this step !)
Verify WordCorrectionGenerator > some specific part of algo may not be fully compatible with english

The text was updated successfully, but these errors were encountered:

mthebaud · 2020-05-26T09:26:18Z

As suggested by JYA : this description is correct for prediction only !
Adapting a correction model could be more complex.
A good resource for model : universaldependencies.org

mthebaud added the enhancement New feature or request label Apr 17, 2020

mthebaud pinned this issue Apr 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

English language support #1

English language support #1

mthebaud commented Apr 17, 2020

mthebaud commented May 26, 2020

English language support #1

English language support #1

Comments

mthebaud commented Apr 17, 2020

mthebaud commented May 26, 2020