How to rebuild existing language models? #51

KryxoLV · 2022-03-21T07:58:20Z

KryxoLV
Mar 21, 2022

Hello, maybe there is someone who can explain me how to update the already existing language models? I am doing bachelors thesis on botanical terms in Latvian, and I have laready added a couple thousands of these botanicla terms to the .txt file , but what should i do next to teach the AI how to tell them apart ? I have added 1500 lines of text to "Lv.txt" as well as "la.txt". How can I rebuild the language models?

pemistahl · 2022-03-21T09:35:13Z

pemistahl
Mar 21, 2022
Maintainer

Hi @KryxoLV, thanks for your request.
Can you please explain in more detail what you are trying to achieve? Perhaps I can help you then. The language of what kind of document do you want to detect? From what I might have understood, I don't think that your approach makes any sense.

If you want to detect the language of separate botanical terms, then it is not the right approach to add all the terms to be classified to some kind of dictionary. My library does not work like that. The library determines the language mainly by calculating statistics for the distribution of letter combinations (ngrams) in a text.

2 replies

KryxoLV Mar 23, 2022
Author

Hello @pemistahl.
Okey, I will try to get more in depth.
I have read that in case to relearn the language model, there is a txt file which gets analyzed and then the training data is made. My question is - If I add a little more botanical oriented terms to this .txt file. Will that help the overall accuracy. I am doing a research in Latvian language botanical terms. The problem at this very time is that the Latin and Latvian terms are getting mashed up. Sometimes Latvian terms are shown as Latin and Latin as Latvian.
If the addon to the lv.txt file will help, then my question is, how can I exactly force the algorithm to relearn the ngrams? Thanks for the answer before. Best regards, Kristers from Ventspils University of applied sciences.

pemistahl Mar 28, 2022
Maintainer

Hi Kristers, the bundled txt files are not training data but test data to measure language detection accuracy. So those won't help you. You can try to recreate the language models by using your botanical terms only and nothing else. As mentioned in the docs, perform the steps described in the file CONTRIBUTING.md to add new language models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to rebuild existing language models? #51

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

How to rebuild existing language models? #51

KryxoLV Mar 21, 2022

Replies: 1 comment · 2 replies

pemistahl Mar 21, 2022 Maintainer

KryxoLV Mar 23, 2022 Author

pemistahl Mar 28, 2022 Maintainer

KryxoLV
Mar 21, 2022

Replies: 1 comment 2 replies

pemistahl
Mar 21, 2022
Maintainer

KryxoLV Mar 23, 2022
Author

pemistahl Mar 28, 2022
Maintainer