Updated from 2019-02 release to 2019-10 release.

Adjusted to tarball not containing subdirectory.
Ondkloss · Apr 25, 2020 · 45d2c04 · 45d2c04
1 parent eea5599
commit 45d2c04
Show file tree

Hide file tree

Showing 6 changed files with 10,212 additions and 8,110 deletions.
diff --git a/20190123_norsk_ordbank_nno_2012.tar.gz → 20191010_norsk_ordbank_nno_2012.tar.gz b/20190123_norsk_ordbank_nno_2012.tar.gz → 20191010_norsk_ordbank_nno_2012.tar.gz
diff --git a/20190123_norsk_ordbank_nob_2005.tar.gz → 20191010_norsk_ordbank_nob_2005.tar.gz b/20190123_norsk_ordbank_nob_2005.tar.gz → 20191010_norsk_ordbank_nob_2005.tar.gz
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@ Simple project to create a list of Norwegian words. To run:
 
     python word_parser.py
 
-Example output (and working wordlist) is `wordlist_20190123_norsk_ordbank_nob_2005.txt` and `wordlist_20190123_norsk_ordbank_nno_2012.txt`.
+Example output (and working wordlist) is `wordlist_20191010_norsk_ordbank_nob_2005.txt` and `wordlist_20191010_norsk_ordbank_nno_2012.txt`.
 
 You can also provide `--length` (`-l`) or `--pattern` (`-p`) to filter the wordlist. Some examples:
 
@@ -19,15 +19,15 @@ To handle sorting of ÆØÅ you can include the PyICU module. This also fixes so
 
 ## Source
 
-The bokmål source material is from [Norsk Ordbank in Norwegian Bokmål 2005](https://www.nb.no/sprakbanken/show?serial=oai%3Anb.no%3Asbr-5&lang=en), the 2019-02-20 update. It is released under the [CC-BY 4.0 license](https://creativecommons.org/licenses/by/4.0/).
+The bokmål source material is from [Norsk ordbank – bokmål 2005](https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-5/), the 2019-10-10 update. It is released under the [CC-BY 4.0 license](https://creativecommons.org/licenses/by/4.0/).
 
-The nynorsk source material is from [Norsk Ordbank in Norwegian Nynorsk 2012](https://www.nb.no/sprakbanken/show?serial=oai%3Anb.no%3Asbr-41&lang=en), the 2019-02-20 update. It is released under the [CC-BY 4.0 license](https://creativecommons.org/licenses/by/4.0/).
+The nynorsk source material is from [Norsk ordbank – nynorsk 2012](https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-41/), the 2019-10-10 update. It is released under the [CC-BY 4.0 license](https://creativecommons.org/licenses/by/4.0/).
 
 ## Software license
 
 The software in this repo is licensed under WTFPL and can be read in `LICENSE`.
 
 ## Known issues
 
-* The regex to remove proper nouns also removes several valid words as well.
-* One might evaluate also removing some additional special characters, for example `1`, `2`, `3`, `4` and `/`.
+- The regex to remove proper nouns also removes several valid words as well.
+- One might evaluate also removing some additional special characters, for example `1`, `2`, `3`, `4` and `/`.
diff --git a/word_parser.py b/word_parser.py
@@ -18,9 +18,9 @@ def find_tar_filename(filename_pattern):
     raise ValueError('Found none or multiple tarballs for pattern {}.'.format(filename_pattern))
 
 
-def extract_tar(filename):
+def extract_tar(filename, destination):
     tar = tarfile.open(filename, "r:gz")
-    tar.extractall()
+    tar.extractall(path=destination)
     tar.close()
 
 
@@ -108,7 +108,7 @@ def sort_locale(lines):
         collator = PyICU.Collator.createInstance(PyICU.Locale('nb_NO'))
         return sorted(lines, key=collator.getSortKey)
     else:
-        print("To get locale specific sorting (æøå) the PyICO module is required. Doing basic sort.")
+        print("To get locale specific sorting (æøå) the PyICU module is required. Doing basic sort.")
         return sorted(lines)
 
 
@@ -130,7 +130,7 @@ def in_interval(item, minimum, maximum):
 def parse_into_wordlist(filename_pattern, minmax=(None, None), pattern=None):
     # prepare content
     filename = find_tar_filename(filename_pattern)
-    extract_tar('{}.tar.gz'.format(filename))
+    extract_tar('{}.tar.gz'.format(filename), filename)
     lemma = find_lemma_file(filename)
     content = get_file_contents('{}/{}'.format(filename, lemma))
     lines = content.split('\n')