Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Main two things currently wrong with TICCL-LDcalc and TICCL-rank (and two more gripes...) #29

Open
martinreynaert opened this issue Aug 29, 2018 · 5 comments
Assignees

Comments

@martinreynaert
Copy link
Collaborator

The following is a long explanation of things going wrong currently. It offers no possible solutions yet. These will follow asap. I am trying to figure out the 'easiest fix'.


A/ We have recently adapted TICCL-rank to the needs of the new TICCL-chain by making it sort its best-first ranked (parameter --clip=1 ) output file numerically descending on the frequency of the Correction Candidate (CC). This has broken the correct working of TICCL-rank.

B/ We have also quite recently make TICCL-LDcalc output 'short' correction pairs to a new output file *short.ldcalc and the ngrams from which the short correction pairs were derived to a new file with extension 'ambi'. This creates further problems for TICCL-rank, as we shall explain later.

C/ Furthermore, we do not know if the new ranking feature based on the number of observed ngrams in which a particular word forms appears is in fact operational in TICCL-LDcalc yet.

D/ We remain handicapped by the fact that we do not have an exhaustive description of the full ranking system as currently implemented in TICCL-LDcalc and TICCL-rank.

Addressing A/ : We have for a while been under the impression that TICCL 'just' misses the most obvious Correction Candidate. We think we now have found the cause for this.

We present output from TICCL-rank run with respectively --clip=1, --clip=5 and --clip=10 on TICCL-LDcalc output on the English book by Morse.

In CLIP5 we see clearly that the CCs are ranked according to their frequency and no longer according to the confidence score. In fact the highest confidence score is with the fifth ranked CC. In CLIP10 we see that the highest confidence score in CLIP5 is outranked by the even higher confidence score of CC 'Niles'.

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.ranked
nuiles#1#Naples#4000030272#2#0.998194

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP5.ranked
nuiles#1#Naples#4000030272#2#0.998194
nuiles#1#Miles#3000031696#2#0.998486
nuiles#1#Giles#2000014531#2#0.998273
nuiles#1#Jules#2000014280#2#0.99853
nuiles#1#suites#2000007207#2#0.9993

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP10.ranked
nuiles#1#Naples#4000030272#2#0.998194
nuiles#1#Miles#3000031696#2#0.998486
nuiles#1#Giles#2000014531#2#0.998273
nuiles#1#Jules#2000014280#2#0.99853
nuiles#1#suites#2000007207#2#0.9993
nuiles#1#Dulles#2000005022#2#0.998645
nuiles#1#Niles#2000004196#1#0.999699
nuiles#1#Wiles#2000001183#2#0.998052
nuiles#1#Nines#2000000883#2#0.998734
nuiles#1#Ailes#2000000578#2#0.999088

When we look at the appropriately sorted output of CLIP1000 we see that 'Niles' in fact has the highest confidence score. The now 'best' ranked top 10 CCs have swamped the actual desired correction 'miles', its capitalized version 'Miles', which was present in CLIP5, is now out of sight, too.

Current TICCL output (incorrectly sorted by CC frequency) for non-word word form 'nuiles':

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP1000.ranked | sort -gr -t '#' -k 4 |head -n 10
nuiles#1#Naples#4000030272#2#0.998194
nuiles#1#Miles#3000031696#2#0.998486
nuiles#1#Giles#2000014531#2#0.998273
nuiles#1#Jules#2000014280#2#0.99853
nuiles#1#suites#2000007207#2#0.9993
nuiles#1#Dulles#2000005022#2#0.998645
nuiles#1#Niles#2000004196#1#0.999699
nuiles#1#Wiles#2000001183#2#0.998052
nuiles#1#Nines#2000000883#2#0.998734
nuiles#1#Ailes#2000000578#2#0.999088

Output as should be sorted by highest confidence:

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP1000.ranked | sort -gr -t '#' -k 6 |head -n 10
nuiles#1#Niles#2000004196#1#0.999699
nuiles#1#Tules#2000000029#2#0.999486
nuiles#1#nuclei#1000008297#2#0.999478
nuiles#1#rules#1000152878#2#0.99946
nuiles#1#Rules#1000021220#2#0.999433
nuiles#1#suites#2000007207#2#0.9993
nuiles#1#nails#1000009554#2#0.999203
nuiles#1#Suites#1705034559#2#0.999194
nuiles#1#Nilus#1000000335#2#0.999176
nuiles#1#Yules#2000000019#2#0.999097

Anyway, the main thing is that currently even the best-first ranked CC offered with CLIP1 is not the one with the highest confidence score, but the one with the highest frequency, which is plainly wrong. This is an undesired artefact of the resorting implemented for TICCL-chain.

We see much the same, though the result is less wrong -- here the most confident score is given to the right correction, for 'Amarican':

TICCL sorted output:

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^Amarican' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP5.ranked |more
Amarican#1#America#4000475833#2#0.996842
Amarican#1#American#3001522167#1#0.998421
Amarican#1#Americas#3000025187#2#0.995474
Amarican#1#Américas#3000000831#2#0.991158
Amarican#1#African#2000256933#2#0.993263

Output resorted descendingly by confidence:

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^Amarican' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP5.ranked |sort -gr -t '#' -k 6 |more
Amarican#1#American#3001522167#1#0.998421
Amarican#1#America#4000475833#2#0.996842
Amarican#1#Americas#3000025187#2#0.995474
Amarican#1#African#2000256933#2#0.993263
Amarican#1#Américas#3000000831#2#0.991158

Nevertheless: the 'best-first ranked' candidate without parameter --clip is still the one obtained by highest frequency sorting:

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^Amarican' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.ranked |more
Amarican#1#America#4000475833#2#0.996842

Addressing B/ : In prior runs without the foci file curtailed to the foreground corpus only we found that 'tire' is often a confusable for 'the'. We are rather surprised that that is still the case, although many more pairs representing this pair seem now to have been properly filtered out on the basis of their frequencies, i.e. these being validated word form pairs. We now see that in some cases this still happens, which is in itself another issue to be addressed. (This may be because capitalized word forms did not get the artifrq, at least in some of these cases).

Example:

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^tirethe' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc.ambi
tire
the#first_tireFirst_the#first_tirefirst_the#tire_Great_Kanhawaythe_Great_Kanhaway#tire_Great_Kanhawaythe_great_Kanhaway#tire_Guisos_Mexicothe_Guisos_Mexico#tire_Guisos_Mexicothe_guisos_Mexico#tire_Guisosthe_Guisos#tire_Guisosthe_guisos#tire_Milliiippithe_Milliiippi#tire_lifethe_LIFE#tire_lifethe_Life#tire_lifethe_life#

As stated before, we are not currently attempting to solve confusables. But this example allows us to explain the issue currently at hand.

The short forms have duly been added to the *short.ldcalc file, as we have recently decided to do. It is here the first of the nine last of 52 such 'confusable' pairs in *short.ldcalc.

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^tire' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.short.ldcalc |tail -n 9
tire00the00022010012
tire
00tides1000007728100000988102311001
tire00ties0002201005
tire
00tin0002201001
tire00tis0002201001
tire
00toe0002201001
tire00tone0002201001
tire
00wine0002200002
tire00wise000220000~1

[Another new issue which seems to have popped up the last week or so (as a consequence of one of the latest adjustments to the work flow) is here apparent: for lots of these pairs the usual information such as frequencies etc. is now missing.]

The issue we are inching towards is this: short word forms may well be 'properly' handled by *short.ldcalc and *ambi, but other pairs based on the actual bigram (mostly, if not exclusively, we suspect) are still incorporated in the regular 'long' *ldcalc file: (we do no longer see the actual 'tire_land' and 'tire_bay' examples we had a couple of weeks ago. The first delivered e.g. CCs 'Ireland' and 'fireland' in the long ldcalc file). But these examples are clear enough (granted: they should not be there by virtue of the frequencies of their composing words alone):

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^tire_' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc
tire_as4455Tijeras10000001091000000109233189303362511100
tire_as
4455Treas10000000981000000124238036236572511100
tire_as4455treas10000000261000000124238036236572511100
tire_on
266266Ireson10000000921000000092148343068382510100
tire_on266266Tiron10000000841000000084232073370562511100
tire_on
266266Treon10000000411000000041238036236572511100
tire_or6565TREVOR105200001830255126269672511100
tire_or
6565Trevor2000018197200001830255126269672511100
tire_to170187Tirito1000000000100000000010444521431251110~0

A non-word example concerns 'ifle':

We have 596 pairs containing this non-word in short.ldcalc.

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ cat /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.short.ldcalc |grep '^ifle~' |wc
596 596 21098

For the probably correct resolution 'rifle' we have the following evidence:

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^iflerifle' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc.ambi
ifle
rifle#The_ifle_isthe_rifle_is#The_ifleThe_rifle#The_iflethe_rifle#and_the_ifleand_the_rifle#ifle_isrifle_is#ifle_ofrifle_of#ifle_on_therifle_on_the#ifle_onrifle_on#ifle_orrifle_or#small_iflesmall_rifle#the_ifle_ofthe_rifle_of#the_ifleThe_rifle#the_ifle~the_rifle#

'Long' LDcalc nevertheless still retains a number of 'ifle' bigrams.

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^ifle_' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc
ifle_is11Ifles14238036236572501000
ifle_is
11ifles34238036236572501000
ifle_on11Flemon1000000002100000000228002070812510100
ifle_on
11Fleron111197781063502500100
ifle_on11Flexon444492347457502500100
ifle_on
11Isleton10000000521000000052110889093722511100
ifle_or11Flexor118112892347457502500100
ifle_or
11flexor1010112892347457502500100

The problem with these is that TICCL-rank misses the possibly likeliest resolution which is in short.ldalc and will rank the rest, probably delivering a False Positive.

I am not sure what would be best to do about this. I think for now we should keep both the short.ldcalc and ambi output. And still add the 'short' bigrams to 'long' ldcalc so that TICCL-rank has the data necessary to do its job well.

Given the inordinate amount of possible pairs for 'ifle' in short.ldcalc, I am not sure the very large background corpus containing also ngrams helps rather than obfuscates the situation. It seems that we should boost the evidence of validated ngrams present in the foreground corpus where and how possible.

Yet one more 'new' issue that bothers me is the fact that capitalized word forms seem to have gained prominence in the corrections. This is due to the fact that TICCL-anahash sorts the anagrams collected alphabetically, it seems. If at all possible, these should rather be sorted by frequency.

Another thing... This run had --low=4. Yet we find the couple 'ifles~riffles', word lengths 5 and 7 respectively, in short.ldcalc.

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^ifles~' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc.ambi |grep 'iflesriffles'
ifles
riffles#ifles_of~riffles_of#

How does that happen?

Addressing C/ : I need to know.

Addressing D/ : I need to know, too.

Further to the ranking features: now we have the foreground foci file: we should use this as another, strong ranking feature: if the CC is present: boost.

Following up on mainly A/ and B/: I will post recommendations for remedial work asap.

MRE

@martinreynaert
Copy link
Collaborator Author

OK. All of the above probably constitutes an intertwined set of problems too complicated to be solved all at once.

There seem to be a few problems that viewed on their own should be quite easily solved. I suggest we solve these first and then proceed from there.

First, the wrong ordening of the CCs by TICCL-rank. Before we implemented the descending sort by frequency of the CCs, all was well. This should only have been implemented for best-first ranked (--clip=1) output lists anyway. This sorting is easily done by hand apart from TICCL-rank, on its output.

So: we should either disable this now or correct it so it is done on best-first ranked lists only, respecting the actual best-first ranking according to the confidence.

Second, we do need to figure out why and how bigrams such as tire_as, tire_on, being composed of validated words only, still end up in the 'long' ldcalc file. And prevent this from happening.

Third, if the ngram ranking feature is not yet operational, it should be made so in order that we can see what effect it has.

I think these are to be addressed first, if and when you have the time to do so, Ko.

MRE

@martinreynaert
Copy link
Collaborator Author

I was mistaken before: the correct resolution for 'ifle' (taking into account the long s to f confusion) is: 'isle'. Cf. the contexts:

reynaert@red:/reddata/PILOTS/MORSE/FOLIA/AONG$ grep --color 'ifle of' Morse.archiveorg_nietgetraind.xml.folia.xml
Sound, lies E. of the Great Bahama Bank, between it and the ifle of Guanahani. N. lat. 24, W. long. 75.
Noix, Ijle au, or Nut I/le, a small ifle of 50 acres, near the north end of Lake Champlain, and within the province of Lower Canada. Here the British have a garrison containing 100 men. It is about 5 miles N. N. E. of the mouth of La Cole river, 20 north of lile La Motte, and i2 or 15 southward of St. John’s.

And for the plural:

reynaert@red:/reddata/PILOTS/MORSE/FOLIA/AONG$ grep --color 'ifles of' Morse.archiveorg_nietgetraind.xml.folia.xml
Islas, ifles of the Bay of Honda, on the coast of Honduras, or the Spanish Main.

@kosloot
Copy link
Collaborator

kosloot commented Dec 19, 2018

I wonder if this still an issue, or solved sowhere along the line. (it may be...)

1 similar comment
@kosloot
Copy link
Collaborator

kosloot commented Nov 18, 2019

I wonder if this still an issue, or solved sowhere along the line. (it may be...)

@martinreynaert
Copy link
Collaborator Author

martinreynaert commented Dec 15, 2021

At least two things seem to have wrong here:
1/ At the time I was often handed a 'new' version of one of the tools, for testing purposes. Feedback was certainly given whether this or that issue or part of an issue was then solved, but this was often done informally and did not necessarily occasion a new release where sth. would be documented in the logs.
2/ I made the mistake at the time of piling issue on issue in a single one. This makes it next to impossible to ever declare the issue solved.

On the basis of my own logs, I now conclude that A/ in this issue was definitely solved. It must have been, it was very clear what happened and what had made it happen. Also output from not too long after this issue was posted, corroborates that this was solved. Note that the filename explicitly mentions a 'new' TICCL-LDcalc and a fix by Ko in TICCL-rank.

`(LMdev) reynaert@violet:MORSE$ ls -l /reddata/PILOTS/MORSE/RUNAMALGAM3NEWLDCALC/zzz/TICCL/RUNAMALGAM3NEWLDCALC.wordfreqlist.1to3.tsv.tsv.clean.ldcalc.RANKED.FIXKORANK.SKIP9-10-11-13.ranked
-rw-rwxr-- 1 reynaert reynaert 545415 Oct 17 2018
/reddata/PILOTS/MORSE/RUNAMALGAM3NEWLDCALC/zzz/TICCL/RUNAMALGAM3NEWLDCALC.wordfreqlist.1to3.tsv.tsv.clean.ldcalc.RANKED.FIXKORANK.SKIP9-10-11-13.ranked

(LMdev) reynaert@violet:MORSE$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAMALGAM3NEWLDCALC/zzz/TICCL/RUNAMALGAM3NEWLDCALC.wordfreqlist.1to3.tsv.tsv.clean.ldcalc.RANKED.FIXKORANK.SKIP9-10-11-13.ranked
nuiles#1#miles#1000231218#2#0.966667

(LMdev) reynaert@violet:MORSE$ grep --color '^Amarican' /reddata/PILOTS/MORSE/RUNAMALGAM3NEWLDCALC/zzz/TICCL/RUNAMALGAM3NEWLDCALC.wordfreqlist.1to3.tsv.tsv.clean.ldcalc.RANKED.FIXKORANK.SKIP9-10-11-13.ranked
Amarican#1#American#1001522167#1#0.991584`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants