Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TICCL-unk: we need a better acronym detection #8

Closed
kosloot opened this issue Feb 1, 2018 · 14 comments
Closed

TICCL-unk: we need a better acronym detection #8

kosloot opened this issue Feb 1, 2018 · 14 comments
Assignees

Comments

@kosloot
Copy link
Collaborator

kosloot commented Feb 1, 2018

At the moment acronyms are detected in TICCL-unk rather clumsy: (and maybe wrong?)

A word is an acronym if

  1. it is an UPPERCASE word < 6 characters (ANWB), KLM)
  2. it starts with exactly 1 punctuation character, and rule 1 applies to the rest (WHY???)
  3. it is an alteration of UPPERCASE letters and punctuation, like A.N.W.B.

We miss out on CDA-minister etc.
using regular expressions like "[A-Z]+[-*a-z]" might help. Probably limiting ourself to 'noun phrases' like
"de " "een " and "het "

@kosloot kosloot changed the title TICCL-unk: we need aetter acronym detection TICCL-unk: we need a better acronym detection Feb 6, 2018
@kosloot
Copy link
Collaborator Author

kosloot commented Feb 12, 2018

There is a major problem here:
TICCL-unk processes frequency files of the format
word frequency

In some cases a word may be an bi-gram, like de CDA-minister but unigrams line CDA-minister are the majority.
Therefore applying rules based on bi-grams like de WORD will only work for multi-grams.
OR acronym detection should be done in another module, where sentences are available.
But even then we might miss out on sentences like CDA-minister Balkenende beweerde ...
@martinreynaert please comment

@martinreynaert
Copy link
Collaborator

I do not see what the problem is, at least not if in the future we always work with ngrams also > 1.

The main idea here is to identify acronyms in a hopefully more fail-safe way by extracting them from likely ngrams and compounds they form. So given 'de CDA-minister' and its frequency, you conclude 'CDA' is an acronym and you store it in the acronym list. Later these acronyms will mainly serve to prevent e.g. 'VVD-ministerraadbijeenkomst' from being turned into 'PVV-ministerraadbijeenkomst'.

@kosloot
Copy link
Collaborator Author

kosloot commented Feb 12, 2018

Ok, so I will incorporate this in TICCL-unk, extracting acronyms from 2-grams as well.
Doe we still keep the 'old' implementation (as described above) too?

@martinreynaert
Copy link
Collaborator

What we need to keep from the old implementation is that the cases: "alteration of UPPERCASE letters and punctuation, like A.N.W.B." are also detected.

I have no idea why we had category 2...

What will also be necessary is that lowercased versions of detected acronyms will be handled analogously. But that will be work for LDcalc, probably.

@martinreynaert
Copy link
Collaborator

Test report for acronym detection in TICCL-unk

ACRONYMS by TICCL-unk

  • According to our specifications, Ko has modified the acronym detection in TICCL-unk.

  • We have run a test on KB newspapers 1950, a collection that contains a great deal of acronyms, both in the older 'punctuated' format (e.g. K.N.I.L.) and the more modern unpunctuated format (e.g. KNIL).

Command line: reynaert@red:/reddata/TICCLAT/UNK$ /exp/sloot/usr/local/bin/TICCL-unk -o TESTacro1950 --alph /reddata/TICCLAT/UNK/nld.aspell.dict.lc.chars --acro KBkrantenartikels.1950.nuTICCLlexNamesAspell.wordfreqlist.tsv

  • In result, the program reports:

reynaert@red:/reddata/TICCLAT/UNK$ /exp/sloot/usr/local/bin/TICCL-unk -o TESTacro1950 --alph /reddata/TICCLAT/UNK/nld.aspell.dict.lc.chars --acro KBkrantenartikels.1950.nuTICCLlexNamesAspell.wordfreqlist.tsv
generating output files
using artifrq=0
created TESTacro1950.clean
created TESTacro1950.unk
created TESTacro1950.punct
created TESTacro1950.acro
done!
reynaert@red:/reddata/TICCLAT/UNK$

  • The top of the numerically descending sorted .acro list reports:

reynaert@red:/reddata/TICCLAT/UNK$ cat TESTacro1950.acro |sort -k 2 -gr |head
V.S 63
U.P 63
N.V 60
V.N 58
S.S 38

Note the missing final dot.

  • We are not sure what the 'count' reported is. However, when we look up the topmost item in the input corpus frequency list, we see it is more frequent there, by '2'
    reynaert@red:/reddata/TICCLAT/UNK$ grep '^V.S' KBkrantenartikels.1950.nuTICCLlexNamesAspell.wordfreqlist.tsv |head
    V.S. 5172 58062760 60.094
    V.S.I. 659 73282362 75.846
    V.S., 497 75061171 77.6871
    V.S.V. 221 79448503 82.2279
    V.S 65 84532021 87.4892

Note that the form with trailing dot (i.e. 'V.S.') is two orders of magnitude more frequent. This form is notably absent from the acronym list extracted!

acronym ~ acronym frequency in unigram list ~ number of hyphenated compounds the acronym appears in ~ sum of frequencies of the hyphenated compounds the acronym appears in

  • What we now have is definitely a very usable partial list of the acronyms in this corpus. However:
  • We require the list of non-punctuated acronyms also
  • Please explain how the current counts for the extracted acronyms were obtained

It seems to us now that a specific TICCL-run over only these acronyms may be called for in order to decide which ones to validate (+artifrq) and which ones to let TICCL correct.

@martinreynaert
Copy link
Collaborator

Another issue we need to keep in mind is that possible pairs with acronyms such as 'K.N.I.L.-' versus 'KNIL-' will not pass the Levenshtein filter (typically set at 2) in LDcalc, even though they should. We should find a way of disregarding the dots for determining the LD.

@martinreynaert
Copy link
Collaborator

I have now also produced bi- and trigram frequency files for KBkranten 1950.
They are here:
reynaert@red:/reddata/Nederlab/KBkranten/FOLIAnottarred/FOLIA$ ls -l /reddata/TICCLAT/UNK/TSVnew/
total 1745532
-rw-r--r-- 1 reynaert reynaert 471015851 Feb 19 15:18 KBkranten1950.wordfreqlist.2-gram.tsv
-rw-r--r-- 1 reynaert reynaert 1255614280 Feb 19 15:18 KBkranten1950.wordfreqlist.3-gram.tsv
-rw-r--r-- 1 reynaert reynaert 60766929 Feb 19 15:18 KBkranten1950.wordfreqlist.tsv

I have merged these three files, successfully, with the new program: TICCL-mergelex. The result is here:

reynaert@red:/reddata/Nederlab/KBkranten/FOLIAnottarred/FOLIA$ ls -l /reddata/TICCLAT/UNK/TSVmerged/
total 1745464
-rw-r--r-- 1 reynaert reynaert 1787341492 Feb 19 15:38 MergedKBkranten1950.wordfreqlist.tsv

@kosloot
Copy link
Collaborator Author

kosloot commented Feb 19, 2018

Great. I assume this is a confirmation of a successful test of issue #7
Looking forward to hear what effect this had on acronym detection.

@martinreynaert
Copy link
Collaborator

I ran this test:

reynaert@red:/reddata/TICCLAT/UNK$ nohup /exp/sloot/usr/local/bin/TICCL-unk -o /reddata/TICCLAT/UNK/TESTacro1950NGRAMS --alph /reddata/TICCLAT/UNK/nld.aspell.dict.lc.chars --acro /reddata/TICCLAT/UNK/TSVmerged/MergedKBkranten1950.wordfreqlist.tsv >/reddata/TICCLAT/UNK/TESTacro1950NGRAMS.20180219.stdout 2>/reddata/TICCLAT/UNK/TESTacro1950NGRAMS.20180219.stderr &

Results are in:
reynaert@red:/reddata/TICCLAT/UNK$ grep -v '.' /reddata/TICCLAT/UNK/TESTacro1950NGRAMS.acro |more

At first sight the non-punctuated acronyms are more reliable as acronyms. The punctuated ones probably incorporate a lot of abbreviated first names.

I need to study this list more closely, but using ngrams seems to work best.

@martinreynaert
Copy link
Collaborator

Could not help noticing that the acronym list still contains Roman numbers

XVII 1
XXLII 1

Am reconsidering what will happen if these are part of compounds and not further part of the frequency list: these are then nevertheless likely to be 'corrected'. We must make sure this does not happen, somehow.

@martinreynaert
Copy link
Collaborator

martinreynaert commented Feb 19, 2018

We have about ten times as many punctuated as unpunctuated acronyms. Am thinking that it is probably good to keep these and to also exempt these from 'correction'. However, a regexp in LDcalc might fulfill the same function, probably.

reynaert@red:/reddata/TICCLAT/UNK$ grep -v '.' /reddata/TICCLAT/UNK/TESTacro1950NGRAMS.acro |wc
1144 2288 7360
reynaert@red:/reddata/TICCLAT/UNK$ grep '.' /reddata/TICCLAT/UNK/TESTacro1950NGRAMS.acro |wc
9608 19216 86005

Also the list contains single character acronyms. We should most probably keep these in: 'X-benen' should not be corrected into 'O-benen', nor 'X-rays' into 'Z-rays'.

@kosloot
Copy link
Collaborator Author

kosloot commented Feb 20, 2018

I'm sorry but the flow of thoughts above is not really helpful :{
Please be more clear on what is a discussion and what is a (presumed) bug.
e.g.:

  • the counts are odd: Ok, that is maybe so because I count every occurrence of an acronym, which is probably not a good number. I think we should count the real frequency of the acronym.
    If ABWB is found in ANWB-auto 23, we schpuld increment by 23 not 1.
  • Roman numbers is another issue (TICCL-unk filter out all Roman Numbers #9) , not addressed yet.
  • Punctuated acronyms: V.S. versus V.S : is this a bug or just an observation?
    Both V.S and V.S. are detected as a 'punctuated' acronym, as far a s I can see.
  • I think you demand an other output format for the acronym list? Please make this an other issue,
    also remembering that punctuated acronyms DON'T have a hyphenated 'parent'

@kosloot
Copy link
Collaborator Author

kosloot commented Feb 21, 2018

@kosloot
Copy link
Collaborator Author

kosloot commented Dec 19, 2018

assume this to be done

@kosloot kosloot closed this as completed Dec 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants