TICCL-unk: we need a better acronym detection #8

kosloot · 2018-02-01T13:05:04Z

At the moment acronyms are detected in TICCL-unk rather clumsy: (and maybe wrong?)

A word is an acronym if

it is an UPPERCASE word < 6 characters (ANWB), KLM)
it starts with exactly 1 punctuation character, and rule 1 applies to the rest (WHY???)
it is an alteration of UPPERCASE letters and punctuation, like A.N.W.B.

We miss out on CDA-minister etc.
using regular expressions like "[A-Z]+[-*a-z]" might help. Probably limiting ourself to 'noun phrases' like
"de " "een " and "het "

kosloot · 2018-02-12T10:53:25Z

There is a major problem here:
TICCL-unk processes frequency files of the format
word frequency

In some cases a word may be an bi-gram, like de CDA-minister but unigrams line CDA-minister are the majority.
Therefore applying rules based on bi-grams like de WORD will only work for multi-grams.
OR acronym detection should be done in another module, where sentences are available.
But even then we might miss out on sentences like CDA-minister Balkenende beweerde ...
@martinreynaert please comment

martinreynaert · 2018-02-12T13:26:22Z

I do not see what the problem is, at least not if in the future we always work with ngrams also > 1.

The main idea here is to identify acronyms in a hopefully more fail-safe way by extracting them from likely ngrams and compounds they form. So given 'de CDA-minister' and its frequency, you conclude 'CDA' is an acronym and you store it in the acronym list. Later these acronyms will mainly serve to prevent e.g. 'VVD-ministerraadbijeenkomst' from being turned into 'PVV-ministerraadbijeenkomst'.

kosloot · 2018-02-12T14:15:19Z

Ok, so I will incorporate this in TICCL-unk, extracting acronyms from 2-grams as well.
Doe we still keep the 'old' implementation (as described above) too?

martinreynaert · 2018-02-12T15:18:51Z

What we need to keep from the old implementation is that the cases: "alteration of UPPERCASE letters and punctuation, like A.N.W.B." are also detected.

I have no idea why we had category 2...

What will also be necessary is that lowercased versions of detected acronyms will be handled analogously. But that will be work for LDcalc, probably.

martinreynaert · 2018-02-19T08:50:26Z

Test report for acronym detection in TICCL-unk

ACRONYMS by TICCL-unk

According to our specifications, Ko has modified the acronym detection in TICCL-unk.
We have run a test on KB newspapers 1950, a collection that contains a great deal of acronyms, both in the older 'punctuated' format (e.g. K.N.I.L.) and the more modern unpunctuated format (e.g. KNIL).

Command line: reynaert@red:/reddata/TICCLAT/UNK$ /exp/sloot/usr/local/bin/TICCL-unk -o TESTacro1950 --alph /reddata/TICCLAT/UNK/nld.aspell.dict.lc.chars --acro KBkrantenartikels.1950.nuTICCLlexNamesAspell.wordfreqlist.tsv

In result, the program reports:

reynaert@red:/reddata/TICCLAT/UNK$ /exp/sloot/usr/local/bin/TICCL-unk -o TESTacro1950 --alph /reddata/TICCLAT/UNK/nld.aspell.dict.lc.chars --acro KBkrantenartikels.1950.nuTICCLlexNamesAspell.wordfreqlist.tsv
generating output files
using artifrq=0
created TESTacro1950.clean
created TESTacro1950.unk
created TESTacro1950.punct
created TESTacro1950.acro
done!
reynaert@red:/reddata/TICCLAT/UNK$

The top of the numerically descending sorted .acro list reports:

reynaert@red:/reddata/TICCLAT/UNK$ cat TESTacro1950.acro |sort -k 2 -gr |head
V.S 63
U.P 63
N.V 60
V.N 58
S.S 38

Note the missing final dot.

We are not sure what the 'count' reported is. However, when we look up the topmost item in the input corpus frequency list, we see it is more frequent there, by '2'
reynaert@red:/reddata/TICCLAT/UNK$ grep '^V.S' KBkrantenartikels.1950.nuTICCLlexNamesAspell.wordfreqlist.tsv |head
V.S. 5172 58062760 60.094
V.S.I. 659 73282362 75.846
V.S., 497 75061171 77.6871
V.S.V. 221 79448503 82.2279
V.S 65 84532021 87.4892

Note that the form with trailing dot (i.e. 'V.S.') is two orders of magnitude more frequent. This form is notably absent from the acronym list extracted!

The count reported cannot be the sum of the times the acronym was seen on its own or as part of a hyphenated compound:
reynaert@red:/reddata/TICCLAT/UNK$ cat KBkrantenartikels.1950.nuTICCLlexNamesAspell.wordfreqlist.tsv |sort -k 2 -gr | tr '\t' '#' | grep '^V.S-'
V.S-limit thread usage to 'sane' default #14#88852522#91.9609
V.S-I.Change FoLiA-correct output extension to ticcl.folia.xml instead of folia.ticcl.xml #3#91879353#95.0936
V.S-V.Compiling under Ubuntu 16.04.1 LTS #1#94640362#97.9512
V.S-V-Compiling under Ubuntu 16.04.1 LTS #1#94640361#97.9512
V.S-meer#1#94640365#97.9512
V.S-L#1#94640360#97.9512
V.S-is#1#94640364#97.9512
V.S-gewerkt;Compiling under Ubuntu 16.04.1 LTS #1#94640363#97.9512
V.S-;Compiling under Ubuntu 16.04.1 LTS #1#94640359#97.9512
V.S--'.-'':Compiling under Ubuntu 16.04.1 LTS #1#94640356#97.9512
V.S-,Compiling under Ubuntu 16.04.1 LTS #1#94640355#97.9512
V.S-0-~compeii.ie.Compiling under Ubuntu 16.04.1 LTS #1#94640358#97.9512
V.S-.0.Compiling under Ubuntu 16.04.1 LTS #1#94640357#97.9512
We wonder why the list of acronyms retrieved does not contain any of the non-punctuated acronyms, or for that matter, the fully punctuated and far more frequent forms.
What we would really like to see collected for each acronym:

acronym ~ acronym frequency in unigram list ~ number of hyphenated compounds the acronym appears in ~ sum of frequencies of the hyphenated compounds the acronym appears in

What we now have is definitely a very usable partial list of the acronyms in this corpus. However:

We require the list of non-punctuated acronyms also
Please explain how the current counts for the extracted acronyms were obtained

It seems to us now that a specific TICCL-run over only these acronyms may be called for in order to decide which ones to validate (+artifrq) and which ones to let TICCL correct.

martinreynaert · 2018-02-19T09:04:47Z

Another issue we need to keep in mind is that possible pairs with acronyms such as 'K.N.I.L.-' versus 'KNIL-' will not pass the Levenshtein filter (typically set at 2) in LDcalc, even though they should. We should find a way of disregarding the dots for determining the LD.

martinreynaert · 2018-02-19T15:39:32Z

I have now also produced bi- and trigram frequency files for KBkranten 1950.
They are here:
reynaert@red:/reddata/Nederlab/KBkranten/FOLIAnottarred/FOLIA$ ls -l /reddata/TICCLAT/UNK/TSVnew/
total 1745532
-rw-r--r-- 1 reynaert reynaert 471015851 Feb 19 15:18 KBkranten1950.wordfreqlist.2-gram.tsv
-rw-r--r-- 1 reynaert reynaert 1255614280 Feb 19 15:18 KBkranten1950.wordfreqlist.3-gram.tsv
-rw-r--r-- 1 reynaert reynaert 60766929 Feb 19 15:18 KBkranten1950.wordfreqlist.tsv

I have merged these three files, successfully, with the new program: TICCL-mergelex. The result is here:

reynaert@red:/reddata/Nederlab/KBkranten/FOLIAnottarred/FOLIA$ ls -l /reddata/TICCLAT/UNK/TSVmerged/
total 1745464
-rw-r--r-- 1 reynaert reynaert 1787341492 Feb 19 15:38 MergedKBkranten1950.wordfreqlist.tsv

kosloot · 2018-02-19T16:25:16Z

Great. I assume this is a confirmation of a successful test of issue #7
Looking forward to hear what effect this had on acronym detection.

martinreynaert · 2018-02-19T21:12:44Z

I ran this test:

reynaert@red:/reddata/TICCLAT/UNK$ nohup /exp/sloot/usr/local/bin/TICCL-unk -o /reddata/TICCLAT/UNK/TESTacro1950NGRAMS --alph /reddata/TICCLAT/UNK/nld.aspell.dict.lc.chars --acro /reddata/TICCLAT/UNK/TSVmerged/MergedKBkranten1950.wordfreqlist.tsv >/reddata/TICCLAT/UNK/TESTacro1950NGRAMS.20180219.stdout 2>/reddata/TICCLAT/UNK/TESTacro1950NGRAMS.20180219.stderr &

Results are in:
reynaert@red:/reddata/TICCLAT/UNK$ grep -v '.' /reddata/TICCLAT/UNK/TESTacro1950NGRAMS.acro |more

At first sight the non-punctuated acronyms are more reliable as acronyms. The punctuated ones probably incorporate a lot of abbreviated first names.

I need to study this list more closely, but using ngrams seems to work best.

martinreynaert · 2018-02-19T21:17:45Z

Could not help noticing that the acronym list still contains Roman numbers

XVII 1
XXLII 1

Am reconsidering what will happen if these are part of compounds and not further part of the frequency list: these are then nevertheless likely to be 'corrected'. We must make sure this does not happen, somehow.

martinreynaert · 2018-02-19T21:42:47Z

We have about ten times as many punctuated as unpunctuated acronyms. Am thinking that it is probably good to keep these and to also exempt these from 'correction'. However, a regexp in LDcalc might fulfill the same function, probably.

reynaert@red:/reddata/TICCLAT/UNK$ grep -v '.' /reddata/TICCLAT/UNK/TESTacro1950NGRAMS.acro |wc
1144 2288 7360
reynaert@red:/reddata/TICCLAT/UNK$ grep '.' /reddata/TICCLAT/UNK/TESTacro1950NGRAMS.acro |wc
9608 19216 86005

Also the list contains single character acronyms. We should most probably keep these in: 'X-benen' should not be corrected into 'O-benen', nor 'X-rays' into 'Z-rays'.

kosloot · 2018-02-20T08:44:28Z

I'm sorry but the flow of thoughts above is not really helpful :{
Please be more clear on what is a discussion and what is a (presumed) bug.
e.g.:

the counts are odd: Ok, that is maybe so because I count every occurrence of an acronym, which is probably not a good number. I think we should count the real frequency of the acronym.
If ABWB is found in ANWB-auto 23, we schpuld increment by 23 not 1.
Roman numbers is another issue (TICCL-unk filter out all Roman Numbers #9) , not addressed yet.
Punctuated acronyms: V.S. versus V.S : is this a bug or just an observation?
Both V.S and V.S. are detected as a 'punctuated' acronym, as far a s I can see.
I think you demand an other output format for the acronym list? Please make this an other issue,
also remembering that punctuated acronyms DON'T have a hyphenated 'parent'

kosloot · 2018-02-21T09:21:28Z

counts have been fixed (i hope)
Roman numbers TICCL-unk filter out all Roman Numbers #9 are fixed
two different output schemes for acronyms have been suggested:
I made this into a new issue Better output for TICCL-unk acronym list #11

kosloot · 2018-12-19T10:49:15Z

assume this to be done

kosloot added the enhancement label Feb 1, 2018

kosloot assigned martinreynaert and kosloot Feb 1, 2018

kosloot changed the title ~~TICCL-unk: we need aetter acronym detection~~ TICCL-unk: we need a better acronym detection Feb 6, 2018

kosloot mentioned this issue Feb 19, 2018

create a new module to aggregate frequency lists into one list. #7

Closed

kosloot closed this as completed Dec 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TICCL-unk: we need a better acronym detection #8

TICCL-unk: we need a better acronym detection #8

kosloot commented Feb 1, 2018

kosloot commented Feb 12, 2018

martinreynaert commented Feb 12, 2018

kosloot commented Feb 12, 2018

martinreynaert commented Feb 12, 2018

martinreynaert commented Feb 19, 2018

martinreynaert commented Feb 19, 2018

martinreynaert commented Feb 19, 2018

kosloot commented Feb 19, 2018

martinreynaert commented Feb 19, 2018

martinreynaert commented Feb 19, 2018

martinreynaert commented Feb 19, 2018 •

edited

Loading

kosloot commented Feb 20, 2018

kosloot commented Feb 21, 2018

kosloot commented Dec 19, 2018

TICCL-unk: we need a better acronym detection #8

TICCL-unk: we need a better acronym detection #8

Comments

kosloot commented Feb 1, 2018

kosloot commented Feb 12, 2018

martinreynaert commented Feb 12, 2018

kosloot commented Feb 12, 2018

martinreynaert commented Feb 12, 2018

martinreynaert commented Feb 19, 2018

martinreynaert commented Feb 19, 2018

martinreynaert commented Feb 19, 2018

kosloot commented Feb 19, 2018

martinreynaert commented Feb 19, 2018

martinreynaert commented Feb 19, 2018

martinreynaert commented Feb 19, 2018 • edited Loading

kosloot commented Feb 20, 2018

kosloot commented Feb 21, 2018

kosloot commented Dec 19, 2018

martinreynaert commented Feb 19, 2018 •

edited

Loading