Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

English language support #1

Open
mthebaud opened this issue Apr 17, 2020 · 1 comment
Open

English language support #1

mthebaud opened this issue Apr 17, 2020 · 1 comment
Labels
enhancement New feature or request

Comments

@mthebaud
Copy link
Owner

English should be the next language to be implemented in Predict4All.
Implementing english support is only a matter of data and small implementations, as its structure is similar to French. The only specific case that can matter in english is the apostroph, that might need some tweaks to be well handled : most of the "may have to" in the following list are guided by this point.

A good start would be to create org.predict4all.nlp.language.english from org.predict4all.nlp.language.french.

You should keep in mind that any language specific code should be created under interfaces : if something previously implemented in French should be different in English, add something related to the LanguageModel. Never use if(language instanceof FrenchLanguageModel) ;-)

These are the steps to implement english prediction

  • Find an open english dictionary with unigram to replace french Lexique.org
  • Create a clean english corpus ( Wikipedia + find subtitle or language corpus) - 20 millions word should be reached
  • Implement specific TokenMatcher (if needed, list should be determined as most of the french token matchers are directly correct for english)
  • Create unit tests for english
  • You may have then to modify (it depends if they are english specific related problems)
    • Tokenizer : if the apostrophe case should be handled differently in english
    • WordPrefixDetector : again, apostrophe could cause problems
    • WordPredictor
  • (optionnal) find an english stop-word dictionary (not used right now because it's only useful with semantic)

These are the steps to implement english correction rules

  • Transfert french rule that could be directly used (e.g. space, azerty, etc...)
  • Find other rules to implements (link with OT/ST is essential on this step !)
  • Verify WordCorrectionGenerator > some specific part of algo may not be fully compatible with english
@mthebaud mthebaud added the enhancement New feature or request label Apr 17, 2020
@mthebaud mthebaud pinned this issue Apr 17, 2020
@mthebaud
Copy link
Owner Author

As suggested by JYA : this description is correct for prediction only !
Adapting a correction model could be more complex.
A good resource for model : universaldependencies.org

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant