Experiments with periodicals published on the territory on modern Latvia in various languages (Latvian, Russian, Latgalian, German)
Largely based on (http://data.lnb.lv/digitala_biblioteka/laikraksti/index3.htm):
- Gaisma, 1905-1906
- Sākla, 1906
- Auseklis, 1906-1907
- Drywa, 1908-1917
- Drywa : Gorīgajs pīlykums, 1912-1913
- Drywa : Ziniskajs pīlykums, 1912-1913
- Drywa : Pielikums Gorīga Maize, 1914-1917
- Drywa : Pielikums Orōjs, 1914-1915
Alternatively (some 1915 issues
- Jaunas Zinias, 1912-1914
- Liaužu Bolss, 1917
- Jaunō Drywa, 1918
- Jaunō Letgola, 1918
- Latgolas Wōrds, 1919-1940
- Latgalīts, 1920 - 1926
- Latgolas Lauksaimnīks, 1921-1925
- Jaunō Straume 1921-1934
- Zīdūnis, 1921-1940
- Latgolas Škola, 1921-1938
- Reits, 1923-1924
- Zemnīka Bolss, 1924-1926
- Latgolas Dorbs, 1924-1934
- Latgolas Zemkūpis, 1925-1935
- Zemnīka Ziņas, 1926-1931
- Katoļu Dzeive, 1926-1940
- Sauleite, 1927-1940
-
Latgolas Socialdemokrats, 1928
-
Jaunais Vōrds, 1929-1930
- Jaunais Vōrds, 1931-1940
- Dryva, 1932
- Lauku sāta, 1938-1940
- Taisneiba, 1926-1941
- Ludzas Taisneiba, 1946-1961
- Latgolas Taisneiba (Daugavpils), 1945-1952
- Latgola, 1946-1954
- Dzeive, 1948-2000
- Latgolas Bolss, 1955-1985
- Mōras Zeme, 1989-1994
- Курляндские губернские ведомости = Kurländische Gouvernements Zeitung, 1852-1915
NB: incorrectly chosen models for later issues (German instead of Russian)
- Rīgas Pilsētas Policijas Avīze = Ведомости Рижской Городской полиции, 1889-1902 (not all years available)
Trilingual; good Latvian and German OCR, poor Russian OCR (in modern orthography, many errors)
- Лифляндские губернские ведомости = Livländische Gouvernements-Zeitung = Vidzemes Guberņas Avīze
Poor quality OCR recognition; multilingual texts
- Рижское Обозрѣніе, 1915-1917
OCR layer with old orthography symbols available
Available issues (the main corpus)
- Рижский Вестник, 1869-1917
OCR layer with old orthography symbols available
- Рижское утро, 1915-1917
OCR layer with old orthography symbols available
- Вера и жизнь, 1923-1940
- Вечернее время
OCR layer with old orthography symbols available
- Для вас
NB: poor Russian OCR (in modern orthography, many errors)
- Голос народа : вестник Русской крестьянской фракции Сейма
- Сегодня, 1919-1940
Tesseract, OCRopus (add links)
https://github.com/KBNLresearch/ochre - postprocessing (NB: prepare GT data first)
https://github.com/ktodorov/eval-historical-texts - postprocessing (using BERT embeddings)
https://github.com/TurkuNLP/ocr-correction (poorly documented)
https://github.com/kak-to-tak/Google_rusngram_spellcheck - a spellchecker for old Russian orthography (based on n-grams)
https://github.com/dhhse/prereform2modern - a converter for old Russian orthography
https://github.com/tberg12/ocular - Ocular, a state-of-the-art (at least in the past) system for historical OCR
https://github.com/cisocrgroup/OCR-Workshop - materials from the workshop "OCR and postcorrection of early printings for digital humanities"
http://cistern.cis.lmu.de/ocrocis/ - a wrapper for OCRopus