Skip to content

Commit

Permalink
Updated from 2019-02 release to 2019-10 release.
Browse files Browse the repository at this point in the history
Adjusted to tarball not containing subdirectory.
  • Loading branch information
Ondkloss committed Apr 25, 2020
1 parent eea5599 commit 45d2c04
Show file tree
Hide file tree
Showing 6 changed files with 10,212 additions and 8,110 deletions.
Binary file not shown.
Binary file not shown.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Simple project to create a list of Norwegian words. To run:

python word_parser.py

Example output (and working wordlist) is `wordlist_20190123_norsk_ordbank_nob_2005.txt` and `wordlist_20190123_norsk_ordbank_nno_2012.txt`.
Example output (and working wordlist) is `wordlist_20191010_norsk_ordbank_nob_2005.txt` and `wordlist_20191010_norsk_ordbank_nno_2012.txt`.

You can also provide `--length` (`-l`) or `--pattern` (`-p`) to filter the wordlist. Some examples:

Expand All @@ -19,15 +19,15 @@ To handle sorting of ÆØÅ you can include the PyICU module. This also fixes so

## Source

The bokmål source material is from [Norsk Ordbank in Norwegian Bokmål 2005](https://www.nb.no/sprakbanken/show?serial=oai%3Anb.no%3Asbr-5&lang=en), the 2019-02-20 update. It is released under the [CC-BY 4.0 license](https://creativecommons.org/licenses/by/4.0/).
The bokmål source material is from [Norsk ordbank – bokmål 2005](https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-5/), the 2019-10-10 update. It is released under the [CC-BY 4.0 license](https://creativecommons.org/licenses/by/4.0/).

The nynorsk source material is from [Norsk Ordbank in Norwegian Nynorsk 2012](https://www.nb.no/sprakbanken/show?serial=oai%3Anb.no%3Asbr-41&lang=en), the 2019-02-20 update. It is released under the [CC-BY 4.0 license](https://creativecommons.org/licenses/by/4.0/).
The nynorsk source material is from [Norsk ordbank – nynorsk 2012](https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-41/), the 2019-10-10 update. It is released under the [CC-BY 4.0 license](https://creativecommons.org/licenses/by/4.0/).

## Software license

The software in this repo is licensed under WTFPL and can be read in `LICENSE`.

## Known issues

* The regex to remove proper nouns also removes several valid words as well.
* One might evaluate also removing some additional special characters, for example `1`, `2`, `3`, `4` and `/`.
- The regex to remove proper nouns also removes several valid words as well.
- One might evaluate also removing some additional special characters, for example `1`, `2`, `3`, `4` and `/`.
8 changes: 4 additions & 4 deletions word_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@ def find_tar_filename(filename_pattern):
raise ValueError('Found none or multiple tarballs for pattern {}.'.format(filename_pattern))


def extract_tar(filename):
def extract_tar(filename, destination):
tar = tarfile.open(filename, "r:gz")
tar.extractall()
tar.extractall(path=destination)
tar.close()


Expand Down Expand Up @@ -108,7 +108,7 @@ def sort_locale(lines):
collator = PyICU.Collator.createInstance(PyICU.Locale('nb_NO'))
return sorted(lines, key=collator.getSortKey)
else:
print("To get locale specific sorting (æøå) the PyICO module is required. Doing basic sort.")
print("To get locale specific sorting (æøå) the PyICU module is required. Doing basic sort.")
return sorted(lines)


Expand All @@ -130,7 +130,7 @@ def in_interval(item, minimum, maximum):
def parse_into_wordlist(filename_pattern, minmax=(None, None), pattern=None):
# prepare content
filename = find_tar_filename(filename_pattern)
extract_tar('{}.tar.gz'.format(filename))
extract_tar('{}.tar.gz'.format(filename), filename)
lemma = find_lemma_file(filename)
content = get_file_contents('{}/{}'.format(filename, lemma))
lines = content.split('\n')
Expand Down
Loading

0 comments on commit 45d2c04

Please sign in to comment.