Feature: "Language Packs" #23

subins2000 · 2020-09-26T22:37:10Z

Currently there's sync feature in varnamd. I tried this successfully. It works, but

The downloaded word files are not sorted by confidence, but by ID, so lot of unwanted words come, most with 0 confidence
They're created on the fly from the learnings in server

An alternate solution for users to easily get words would be if varnamd provides "language packs"

Language Pack

Statistically curated learning files for each language is made. This file is made once and placed in the server. This file is called the "Language Pack". There can be multiple language packs for the same language. The difference can be made on where it's sourced :

ML-basic (10MB)
ML-twitter (20MB)
ML-wikipedia (20MB)
ML-science (20MB) words from luca.co.in https://github.com/joicemjoseph/luca-science-dictionary-scraper

These packs will be mutually exclusive that is words in one pack won't be in others. Tools to do this are here : https://gitlab.com/smc/corpus/-/tree/master/tools

The words in the files will be sorted by confidence. Sample :

ഒരു 1623
മുഖ്യമന്ത്രി 1448
ഈ 1186
സർക്കാർ 769
പറഞ്ഞു 564
എന്ന 530
കോടി 483

Language packs is versioned, each pack will have versions. The subsequent versions will also be mutually exclusive with only the latest words. A new user will have to download each version to be up-to-date (better, if there's a special URL to combine them and provide). This will be kind of like Windows updates.

ML-basic (10MB)
- v0.1 (10MB)
- v0.2 (some new words, 20Kb)

Deletions to words in packs shouldn't be versioned, instead they'll be removed from all the pack versions.

varnamd in server will provide language packs for users to download. varnamd should also have function to import them, just like how sync works currently. See #22

With this feature, users can easily download, import and be up-to-date. With Varnam Desktop coming, it'll be easiest. Plus when Varnam comes to Indic Keyboard, it'll also be an easy way to import words. Mockup screenshots in #22

cc @athul @joicemjoseph

The text was updated successfully, but these errors were encountered:

subins2000 · 2020-09-28T11:28:48Z

Here's a model of how it looks :

Array of JSON objects :

{
  identifier: 'ml-basic',
  name: 'Malayalam Basic',
  description: 'Collection of basic Malayalam words',
  lang: 'ml',
  versions: [
    {
      identifier: 'ml-basic-1',
      version: '1',
      description: 'Most common words found across many sources',
      size: 10
    },
    {
      identifier: 'ml-basic-2',
      version: '2',
      description: 'Some new-gen words from 2020',
      size: 1
    }
  ]
},
{
  identifier: 'ml-twitter',
  name: 'Malayalam Twitter',
  description: 'Collection of words sourced from Twitter',
  lang: 'ml',
  versions: [
    {
      identifier: 'ml-twitter-1',
      version: '1',
      description: 'Most common words found across many sources',
      size: 10
    }
  ]
},
{
  identifier: 'ml-english',
  name: 'English Words in Malayalam',
  description: 'Collection of english words written in Malayalam. Eg: KSEB, Facebook',
  lang: 'ml',
  versions: [
    {
      identifier: 'ml-english-1',
      version: '1',
      description: 'Basic words like "try", "last", "first" and many more sourced from social media.',
      size: 10
    }
  ]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: "Language Packs" #23

Feature: "Language Packs" #23

subins2000 commented Sep 26, 2020 •

edited

Loading

subins2000 commented Sep 28, 2020 •

edited

Loading

Feature: "Language Packs" #23

Feature: "Language Packs" #23

Comments

subins2000 commented Sep 26, 2020 • edited Loading

Language Pack

subins2000 commented Sep 28, 2020 • edited Loading

subins2000 commented Sep 26, 2020 •

edited

Loading

subins2000 commented Sep 28, 2020 •

edited

Loading