Skip to content

Experiments with periodicals published on the territory on modern Latvia in various languages (Latvian, Russian, Latgalian, German)

Notifications You must be signed in to change notification settings

Nofenigma/LatvianPeriodicals_OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

LatvianPeriodicals_OCR

Experiments with periodicals published on the territory on modern Latvia in various languages (Latvian, Russian, Latgalian, German)

Periodicals in Latgalian

Largely based on (http://data.lnb.lv/digitala_biblioteka/laikraksti/index3.htm):

Russian Empire

  • Gaisma, 1905-1906

Available issues

  • Sākla, 1906

Available issues

  • Auseklis, 1906-1907

Available issues

  • Drywa, 1908-1917

Available issues

  • Drywa : Gorīgajs pīlykums, 1912-1913

Available issues

  • Drywa : Ziniskajs pīlykums, 1912-1913

Available issues

Alternatively

  • Drywa : Pielikums Gorīga Maize, 1914-1917

Available issues

  • Drywa : Pielikums Orōjs, 1914-1915

Available issues

Alternatively (some 1915 issues

  • Jaunas Zinias, 1912-1914

Available issues

  • Liaužu Bolss, 1917

Available issues

Independent Latvia (interwar)

  • Jaunō Drywa, 1918

Available issues

  • Jaunō Letgola, 1918

Available issues (1 issues)

  • Latgolas Wōrds, 1919-1940

Available issues

  • Latgalīts, 1920 - 1926

Available issues

Alternatively

  • Latgolas Lauksaimnīks, 1921-1925

Available issues

  • Jaunō Straume 1921-1934

Available issues

  • Zīdūnis, 1921-1940

Available issues

  • Latgolas Škola, 1921-1938

Available issues

  • Reits, 1923-1924

Available issues

  • Zemnīka Bolss, 1924-1926

Available issues

Alternatively

  • Latgolas Dorbs, 1924-1934

Available issues

  • Latgolas Zemkūpis, 1925-1935

Available issues

  • Zemnīka Ziņas, 1926-1931

Available issues

  • Katoļu Dzeive, 1926-1940

Available issues

  • Sauleite, 1927-1940

Available issues

  • Latgolas Socialdemokrats, 1928

  • Jaunais Vōrds, 1929-1930

Available issues

  • Jaunais Vōrds, 1931-1940

Available issues

  • Dryva, 1932

Available issues

  • Lauku sāta, 1938-1940

Available issues

Soviet interwar

  • Taisneiba, 1926-1941

Available issues

Soviet in Latvia

  • Ludzas Taisneiba, 1946-1961

Available issues

  • Latgolas Taisneiba (Daugavpils), 1945-1952

Available issues

Trimda

  • Latgola, 1946-1954

Available issues

  • Dzeive, 1948-2000

Available issues

  • Latgolas Bolss, 1955-1985

Available issues

  • Mōras Zeme, 1989-1994

Available issues

Periodicals in Russian

Russian Empire

  • Курляндские губернские ведомости = Kurländische Gouvernements Zeitung, 1852-1915

Available issues

NB: incorrectly chosen models for later issues (German instead of Russian)

  • Rīgas Pilsētas Policijas Avīze = Ведомости Рижской Городской полиции, 1889-1902 (not all years available)

Trilingual; good Latvian and German OCR, poor Russian OCR (in modern orthography, many errors)

Available issues

  • Лифляндские губернские ведомости = Livländische Gouvernements-Zeitung = Vidzemes Guberņas Avīze

Poor quality OCR recognition; multilingual texts

Available issues

Issues published in 1905

  • Рижское Обозрѣніе, 1915-1917

OCR layer with old orthography symbols available

Available issues (the main corpus)

Available issues (8 more)

  • Рижский Вестник, 1869-1917

OCR layer with old orthography symbols available

Available issues

Available issues

  • Рижское утро, 1915-1917

OCR layer with old orthography symbols available

Available issues

Independent Latvia (interwar)

  • Вера и жизнь, 1923-1940

Available issues

  • Вечернее время

OCR layer with old orthography symbols available

Available issues

  • Для вас

NB: poor Russian OCR (in modern orthography, many errors)

Available issues

  • Голос народа : вестник Русской крестьянской фракции Сейма

Available issues

  • Сегодня, 1919-1940

Available issues

Useful tools/libraries

Tesseract, OCRopus (add links)

https://github.com/KBNLresearch/ochre - postprocessing (NB: prepare GT data first)

https://github.com/ktodorov/eval-historical-texts - postprocessing (using BERT embeddings)

https://github.com/TurkuNLP/ocr-correction (poorly documented)

https://github.com/kak-to-tak/Google_rusngram_spellcheck - a spellchecker for old Russian orthography (based on n-grams)

https://github.com/dhhse/prereform2modern - a converter for old Russian orthography

https://github.com/tberg12/ocular - Ocular, a state-of-the-art (at least in the past) system for historical OCR

https://github.com/cisocrgroup/OCR-Workshop - materials from the workshop "OCR and postcorrection of early printings for digital humanities"

http://cistern.cis.lmu.de/ocrocis/ - a wrapper for OCRopus

About

Experiments with periodicals published on the territory on modern Latvia in various languages (Latvian, Russian, Latgalian, German)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published