CLTK models for Old Norse

Trained taggers for the CLTK.

POS tagging

Choice of the corpora

Texts already annotated by researchers were selected because they were in Old Norse written from XII to XIV centuries, the Golden Age of Old Norse texts : `Icelandic Parsed Historical Corpus (version 0.9, license: LGPL) Paths to annotated texts are made as you can see below:

    >>> selected_files = ["1150.firstgrammar.sci-lin.tagged", "1150.homiliubok.rel-ser.tagged",
                      "1210.jartein.rel-sag.tagged", "1210.thorlakur.rel-sag.tagged",
                      "1250.sturlunga.nar-sag.tagged", "1250.thetubrot.nar-sag.tagged",
                      "1260.jomsvikingar.nar-sag.tagged", "1270.gragas.law-law.tagged",
                      "1275.morkin.nar-his.tagged", "1300.alexander.nar-sag.tagged",
                      "1310.grettir.nar-sag.tagged", "1325.arni.nar-sag.tagged",
                      "1350.bandamennM.nar-sag.tagged", "1350.finnbogi.nar-sag.tagged",
                      '1350.marta.rel-sag.tagged']
    >>> selected_data = ["icepahc-v0.9/tagged/"+selected_file for selected_file in selected_files]

Extraction of words and tags

    >>> words_tags = []
    >>> for filename in selected_data:
             words_tags.extend(extract_word_and_tags(filename))

The function extract_word_and_tags gets a filename as input and returns the list of (word, tag) of the whole text. Sentences were not segmented so the POS tagger is not trained completely correctly. However, it does the work.

Taggers trained with TnT

    >>> tagger = tnt.TnT()
    >>> tagger.train(words_tags)
    >>> with open(os.path.join("taggers", "pos", "tnt.pickle"), "wb") as f:
             mpck = pickle.Pickler(f)
             mpck.dump(tagger)

The model data of the TnT can be retrieved thanks to the pickle module.

Tagset

http://nlp.cs.ru.is/pdf/Tagset.pdf

Complete description of the used corpus

http://www.linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_(IcePaHC)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
taggers/pos		taggers/pos
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLTK models for Old Norse

POS tagging

Choice of the corpora

Extraction of words and tags

Taggers trained with TnT

Tagset

Complete description of the used corpus

About

Releases

Packages

License

cltk/non_models_cltk

Folders and files

Latest commit

History

Repository files navigation

CLTK models for Old Norse

POS tagging

Choice of the corpora

Extraction of words and tags

Taggers trained with TnT

Tagset

Complete description of the used corpus

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages