Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: "Language Packs" #23

Open
subins2000 opened this issue Sep 26, 2020 · 1 comment
Open

Feature: "Language Packs" #23

subins2000 opened this issue Sep 26, 2020 · 1 comment

Comments

@subins2000
Copy link
Member

subins2000 commented Sep 26, 2020

Currently there's sync feature in varnamd. I tried this successfully. It works, but

  1. The downloaded word files are not sorted by confidence, but by ID, so lot of unwanted words come, most with 0 confidence
  2. They're created on the fly from the learnings in server

An alternate solution for users to easily get words would be if varnamd provides "language packs"

Language Pack

Statistically curated learning files for each language is made. This file is made once and placed in the server. This file is called the "Language Pack". There can be multiple language packs for the same language. The difference can be made on where it's sourced :

These packs will be mutually exclusive that is words in one pack won't be in others. Tools to do this are here : https://gitlab.com/smc/corpus/-/tree/master/tools

The words in the files will be sorted by confidence. Sample :

ഒരു 1623
മുഖ്യമന്ത്രി 1448
ഈ 1186
സർക്കാർ 769
പറഞ്ഞു 564
എന്ന 530
കോടി 483

Language packs is versioned, each pack will have versions. The subsequent versions will also be mutually exclusive with only the latest words. A new user will have to download each version to be up-to-date (better, if there's a special URL to combine them and provide). This will be kind of like Windows updates.

  • ML-basic (10MB)
    • v0.1 (10MB)
    • v0.2 (some new words, 20Kb)

Deletions to words in packs shouldn't be versioned, instead they'll be removed from all the pack versions.

varnamd in server will provide language packs for users to download. varnamd should also have function to import them, just like how sync works currently. See #22

With this feature, users can easily download, import and be up-to-date. With Varnam Desktop coming, it'll be easiest. Plus when Varnam comes to Indic Keyboard, it'll also be an easy way to import words. Mockup screenshots in #22

cc @athul @joicemjoseph

@subins2000
Copy link
Member Author

subins2000 commented Sep 28, 2020

Here's a model of how it looks :

image

Array of JSON objects :

{
  identifier: 'ml-basic',
  name: 'Malayalam Basic',
  description: 'Collection of basic Malayalam words',
  lang: 'ml',
  versions: [
    {
      identifier: 'ml-basic-1',
      version: '1',
      description: 'Most common words found across many sources',
      size: 10
    },
    {
      identifier: 'ml-basic-2',
      version: '2',
      description: 'Some new-gen words from 2020',
      size: 1
    }
  ]
},
{
  identifier: 'ml-twitter',
  name: 'Malayalam Twitter',
  description: 'Collection of words sourced from Twitter',
  lang: 'ml',
  versions: [
    {
      identifier: 'ml-twitter-1',
      version: '1',
      description: 'Most common words found across many sources',
      size: 10
    }
  ]
},
{
  identifier: 'ml-english',
  name: 'English Words in Malayalam',
  description: 'Collection of english words written in Malayalam. Eg: KSEB, Facebook',
  lang: 'ml',
  versions: [
    {
      identifier: 'ml-english-1',
      version: '1',
      description: 'Basic words like "try", "last", "first" and many more sourced from social media.',
      size: 10
    }
  ]
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant