You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently there's sync feature in varnamd. I tried this successfully. It works, but
The downloaded word files are not sorted by confidence, but by ID, so lot of unwanted words come, most with 0 confidence
They're created on the fly from the learnings in server
An alternate solution for users to easily get words would be if varnamd provides "language packs"
Language Pack
Statistically curated learning files for each language is made. This file is made once and placed in the server. This file is called the "Language Pack". There can be multiple language packs for the same language. The difference can be made on where it's sourced :
The words in the files will be sorted by confidence. Sample :
ഒരു 1623
മുഖ്യമന്ത്രി 1448
ഈ 1186
സർക്കാർ 769
പറഞ്ഞു 564
എന്ന 530
കോടി 483
Language packs is versioned, each pack will have versions. The subsequent versions will also be mutually exclusive with only the latest words. A new user will have to download each version to be up-to-date (better, if there's a special URL to combine them and provide). This will be kind of like Windows updates.
ML-basic (10MB)
v0.1 (10MB)
v0.2 (some new words, 20Kb)
Deletions to words in packs shouldn't be versioned, instead they'll be removed from all the pack versions.
varnamd in server will provide language packs for users to download. varnamd should also have function to import them, just like how sync works currently. See #22
With this feature, users can easily download, import and be up-to-date. With Varnam Desktop coming, it'll be easiest. Plus when Varnam comes to Indic Keyboard, it'll also be an easy way to import words. Mockup screenshots in #22
{
identifier: 'ml-basic',
name: 'Malayalam Basic',
description: 'Collection of basic Malayalam words',
lang: 'ml',
versions: [
{
identifier: 'ml-basic-1',
version: '1',
description: 'Most common words found across many sources',
size: 10
},
{
identifier: 'ml-basic-2',
version: '2',
description: 'Some new-gen words from 2020',
size: 1
}
]
},
{
identifier: 'ml-twitter',
name: 'Malayalam Twitter',
description: 'Collection of words sourced from Twitter',
lang: 'ml',
versions: [
{
identifier: 'ml-twitter-1',
version: '1',
description: 'Most common words found across many sources',
size: 10
}
]
},
{
identifier: 'ml-english',
name: 'English Words in Malayalam',
description: 'Collection of english words written in Malayalam. Eg: KSEB, Facebook',
lang: 'ml',
versions: [
{
identifier: 'ml-english-1',
version: '1',
description: 'Basic words like "try", "last", "first" and many more sourced from social media.',
size: 10
}
]
}
Currently there's sync feature in varnamd. I tried this successfully. It works, but
An alternate solution for users to easily get words would be if
varnamd
provides "language packs"Language Pack
Statistically curated learning files for each language is made. This file is made once and placed in the server. This file is called the "Language Pack". There can be multiple language packs for the same language. The difference can be made on where it's sourced :
These packs will be mutually exclusive that is words in one pack won't be in others. Tools to do this are here : https://gitlab.com/smc/corpus/-/tree/master/tools
The words in the files will be sorted by confidence. Sample :
Language packs is versioned, each pack will have versions. The subsequent versions will also be mutually exclusive with only the latest words. A new user will have to download each version to be up-to-date (better, if there's a special URL to combine them and provide). This will be kind of like Windows updates.
Deletions to words in packs shouldn't be versioned, instead they'll be removed from all the pack versions.
varnamd
in server will provide language packs for users to download.varnamd
should also have function to import them, just like how sync works currently. See #22With this feature, users can easily download, import and be up-to-date. With Varnam Desktop coming, it'll be easiest. Plus when Varnam comes to Indic Keyboard, it'll also be an easy way to import words. Mockup screenshots in #22
cc @athul @joicemjoseph
The text was updated successfully, but these errors were encountered: