-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TICCL-unk: we need a better acronym detection #8
Comments
There is a major problem here: In some cases a |
I do not see what the problem is, at least not if in the future we always work with ngrams also > 1. The main idea here is to identify acronyms in a hopefully more fail-safe way by extracting them from likely ngrams and compounds they form. So given 'de CDA-minister' and its frequency, you conclude 'CDA' is an acronym and you store it in the acronym list. Later these acronyms will mainly serve to prevent e.g. 'VVD-ministerraadbijeenkomst' from being turned into 'PVV-ministerraadbijeenkomst'. |
Ok, so I will incorporate this in TICCL-unk, extracting acronyms from 2-grams as well. |
What we need to keep from the old implementation is that the cases: "alteration of UPPERCASE letters and punctuation, like A.N.W.B." are also detected. I have no idea why we had category 2... What will also be necessary is that lowercased versions of detected acronyms will be handled analogously. But that will be work for LDcalc, probably. |
Test report for acronym detection in TICCL-unk ACRONYMS by TICCL-unk
Command line: reynaert@red:/reddata/TICCLAT/UNK$ /exp/sloot/usr/local/bin/TICCL-unk -o TESTacro1950 --alph /reddata/TICCLAT/UNK/nld.aspell.dict.lc.chars --acro KBkrantenartikels.1950.nuTICCLlexNamesAspell.wordfreqlist.tsv
reynaert@red:/reddata/TICCLAT/UNK$ /exp/sloot/usr/local/bin/TICCL-unk -o TESTacro1950 --alph /reddata/TICCLAT/UNK/nld.aspell.dict.lc.chars --acro KBkrantenartikels.1950.nuTICCLlexNamesAspell.wordfreqlist.tsv
reynaert@red:/reddata/TICCLAT/UNK$ cat TESTacro1950.acro |sort -k 2 -gr |head Note the missing final dot.
Note that the form with trailing dot (i.e. 'V.S.') is two orders of magnitude more frequent. This form is notably absent from the acronym list extracted!
acronym ~ acronym frequency in unigram list ~ number of hyphenated compounds the acronym appears in ~ sum of frequencies of the hyphenated compounds the acronym appears in
It seems to us now that a specific TICCL-run over only these acronyms may be called for in order to decide which ones to validate (+artifrq) and which ones to let TICCL correct. |
Another issue we need to keep in mind is that possible pairs with acronyms such as 'K.N.I.L.-' versus 'KNIL-' will not pass the Levenshtein filter (typically set at 2) in LDcalc, even though they should. We should find a way of disregarding the dots for determining the LD. |
I have now also produced bi- and trigram frequency files for KBkranten 1950. I have merged these three files, successfully, with the new program: TICCL-mergelex. The result is here: reynaert@red:/reddata/Nederlab/KBkranten/FOLIAnottarred/FOLIA$ ls -l /reddata/TICCLAT/UNK/TSVmerged/ |
Great. I assume this is a confirmation of a successful test of issue #7 |
I ran this test: reynaert@red:/reddata/TICCLAT/UNK$ nohup /exp/sloot/usr/local/bin/TICCL-unk -o /reddata/TICCLAT/UNK/TESTacro1950NGRAMS --alph /reddata/TICCLAT/UNK/nld.aspell.dict.lc.chars --acro /reddata/TICCLAT/UNK/TSVmerged/MergedKBkranten1950.wordfreqlist.tsv >/reddata/TICCLAT/UNK/TESTacro1950NGRAMS.20180219.stdout 2>/reddata/TICCLAT/UNK/TESTacro1950NGRAMS.20180219.stderr & Results are in: At first sight the non-punctuated acronyms are more reliable as acronyms. The punctuated ones probably incorporate a lot of abbreviated first names. I need to study this list more closely, but using ngrams seems to work best. |
Could not help noticing that the acronym list still contains Roman numbers XVII 1 Am reconsidering what will happen if these are part of compounds and not further part of the frequency list: these are then nevertheless likely to be 'corrected'. We must make sure this does not happen, somehow. |
We have about ten times as many punctuated as unpunctuated acronyms. Am thinking that it is probably good to keep these and to also exempt these from 'correction'. However, a regexp in LDcalc might fulfill the same function, probably. reynaert@red:/reddata/TICCLAT/UNK$ grep -v '.' /reddata/TICCLAT/UNK/TESTacro1950NGRAMS.acro |wc Also the list contains single character acronyms. We should most probably keep these in: 'X-benen' should not be corrected into 'O-benen', nor 'X-rays' into 'Z-rays'. |
I'm sorry but the flow of thoughts above is not really helpful :{
|
|
assume this to be done |
At the moment acronyms are detected in TICCL-unk rather clumsy: (and maybe wrong?)
A word is an acronym if
We miss out on CDA-minister etc.
using regular expressions like "[A-Z]+[-*a-z]" might help. Probably limiting ourself to 'noun phrases' like
"de " "een " and "het "
The text was updated successfully, but these errors were encountered: