You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
English should be the next language to be implemented in Predict4All.
Implementing english support is only a matter of data and small implementations, as its structure is similar to French. The only specific case that can matter in english is the apostroph, that might need some tweaks to be well handled : most of the "may have to" in the following list are guided by this point.
A good start would be to create org.predict4all.nlp.language.english from org.predict4all.nlp.language.french.
You should keep in mind that any language specific code should be created under interfaces : if something previously implemented in French should be different in English, add something related to the LanguageModel. Never use if(language instanceof FrenchLanguageModel) ;-)
These are the steps to implement english prediction
Find an open english dictionary with unigram to replace french Lexique.org
Create a clean english corpus ( Wikipedia + find subtitle or language corpus) - 20 millions word should be reached
Implement specific TokenMatcher (if needed, list should be determined as most of the french token matchers are directly correct for english)
Create unit tests for english
You may have then to modify (it depends if they are english specific related problems)
Tokenizer : if the apostrophe case should be handled differently in english
WordPrefixDetector : again, apostrophe could cause problems
WordPredictor
(optionnal) find an english stop-word dictionary (not used right now because it's only useful with semantic)
These are the steps to implement english correction rules
Transfert french rule that could be directly used (e.g. space, azerty, etc...)
Find other rules to implements (link with OT/ST is essential on this step !)
Verify WordCorrectionGenerator > some specific part of algo may not be fully compatible with english
The text was updated successfully, but these errors were encountered:
As suggested by JYA : this description is correct for prediction only !
Adapting a correction model could be more complex.
A good resource for model : universaldependencies.org
English should be the next language to be implemented in Predict4All.
Implementing english support is only a matter of data and small implementations, as its structure is similar to French. The only specific case that can matter in english is the apostroph, that might need some tweaks to be well handled : most of the "may have to" in the following list are guided by this point.
A good start would be to create org.predict4all.nlp.language.english from org.predict4all.nlp.language.french.
You should keep in mind that any language specific code should be created under interfaces : if something previously implemented in French should be different in English, add something related to the LanguageModel. Never use
if(language instanceof FrenchLanguageModel)
;-)These are the steps to implement english prediction
These are the steps to implement english correction rules
The text was updated successfully, but these errors were encountered: