Main two things currently wrong with TICCL-LDcalc and TICCL-rank (and two more gripes...) #29

martinreynaert · 2018-08-29T19:05:34Z

The following is a long explanation of things going wrong currently. It offers no possible solutions yet. These will follow asap. I am trying to figure out the 'easiest fix'.

A/ We have recently adapted TICCL-rank to the needs of the new TICCL-chain by making it sort its best-first ranked (parameter --clip=1 ) output file numerically descending on the frequency of the Correction Candidate (CC). This has broken the correct working of TICCL-rank.

B/ We have also quite recently make TICCL-LDcalc output 'short' correction pairs to a new output file *short.ldcalc and the ngrams from which the short correction pairs were derived to a new file with extension 'ambi'. This creates further problems for TICCL-rank, as we shall explain later.

C/ Furthermore, we do not know if the new ranking feature based on the number of observed ngrams in which a particular word forms appears is in fact operational in TICCL-LDcalc yet.

D/ We remain handicapped by the fact that we do not have an exhaustive description of the full ranking system as currently implemented in TICCL-LDcalc and TICCL-rank.

Addressing A/ : We have for a while been under the impression that TICCL 'just' misses the most obvious Correction Candidate. We think we now have found the cause for this.

We present output from TICCL-rank run with respectively --clip=1, --clip=5 and --clip=10 on TICCL-LDcalc output on the English book by Morse.

In CLIP5 we see clearly that the CCs are ranked according to their frequency and no longer according to the confidence score. In fact the highest confidence score is with the fifth ranked CC. In CLIP10 we see that the highest confidence score in CLIP5 is outranked by the even higher confidence score of CC 'Niles'.

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.ranked
nuiles#1#Naples#4000030272#2#0.998194

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP5.ranked
nuiles#1#Naples#4000030272#2#0.998194
nuiles#1#Miles#3000031696#2#0.998486
nuiles#1#Giles#2000014531#2#0.998273
nuiles#1#Jules#2000014280#2#0.99853
nuiles#1#suites#2000007207#2#0.9993

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP10.ranked
nuiles#1#Naples#4000030272#2#0.998194
nuiles#1#Miles#3000031696#2#0.998486
nuiles#1#Giles#2000014531#2#0.998273
nuiles#1#Jules#2000014280#2#0.99853
nuiles#1#suites#2000007207#2#0.9993
nuiles#1#Dulles#2000005022#2#0.998645
nuiles#1#Niles#2000004196#1#0.999699
nuiles#1#Wiles#2000001183#2#0.998052
nuiles#1#Nines#2000000883#2#0.998734
nuiles#1#Ailes#2000000578#2#0.999088

When we look at the appropriately sorted output of CLIP1000 we see that 'Niles' in fact has the highest confidence score. The now 'best' ranked top 10 CCs have swamped the actual desired correction 'miles', its capitalized version 'Miles', which was present in CLIP5, is now out of sight, too.

Current TICCL output (incorrectly sorted by CC frequency) for non-word word form 'nuiles':

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP1000.ranked | sort -gr -t '#' -k 4 |head -n 10
nuiles#1#Naples#4000030272#2#0.998194
nuiles#1#Miles#3000031696#2#0.998486
nuiles#1#Giles#2000014531#2#0.998273
nuiles#1#Jules#2000014280#2#0.99853
nuiles#1#suites#2000007207#2#0.9993
nuiles#1#Dulles#2000005022#2#0.998645
nuiles#1#Niles#2000004196#1#0.999699
nuiles#1#Wiles#2000001183#2#0.998052
nuiles#1#Nines#2000000883#2#0.998734
nuiles#1#Ailes#2000000578#2#0.999088

Output as should be sorted by highest confidence:

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP1000.ranked | sort -gr -t '#' -k 6 |head -n 10
nuiles#1#Niles#2000004196#1#0.999699
nuiles#1#Tules#2000000029#2#0.999486
nuiles#1#nuclei#1000008297#2#0.999478
nuiles#1#rules#1000152878#2#0.99946
nuiles#1#Rules#1000021220#2#0.999433
nuiles#1#suites#2000007207#2#0.9993
nuiles#1#nails#1000009554#2#0.999203
nuiles#1#Suites#1705034559#2#0.999194
nuiles#1#Nilus#1000000335#2#0.999176
nuiles#1#Yules#2000000019#2#0.999097

Anyway, the main thing is that currently even the best-first ranked CC offered with CLIP1 is not the one with the highest confidence score, but the one with the highest frequency, which is plainly wrong. This is an undesired artefact of the resorting implemented for TICCL-chain.

We see much the same, though the result is less wrong -- here the most confident score is given to the right correction, for 'Amarican':

TICCL sorted output:

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^Amarican' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP5.ranked |more
Amarican#1#America#4000475833#2#0.996842
Amarican#1#American#3001522167#1#0.998421
Amarican#1#Americas#3000025187#2#0.995474
Amarican#1#Américas#3000000831#2#0.991158
Amarican#1#African#2000256933#2#0.993263

Output resorted descendingly by confidence:

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^Amarican' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.CLIP5.ranked |sort -gr -t '#' -k 6 |more
Amarican#1#American#3001522167#1#0.998421
Amarican#1#America#4000475833#2#0.996842
Amarican#1#Americas#3000025187#2#0.995474
Amarican#1#African#2000256933#2#0.993263
Amarican#1#Américas#3000000831#2#0.991158

Nevertheless: the 'best-first ranked' candidate without parameter --clip is still the one obtained by highest frequency sorting:

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep --color '^Amarican' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.RANK.ranked |more
Amarican#1#America#4000475833#2#0.996842

Addressing B/ : In prior runs without the foci file curtailed to the foreground corpus only we found that 'tire' is often a confusable for 'the'. We are rather surprised that that is still the case, although many more pairs representing this pair seem now to have been properly filtered out on the basis of their frequencies, i.e. these being validated word form pairs. We now see that in some cases this still happens, which is in itself another issue to be addressed. (This may be because capitalized word forms did not get the artifrq, at least in some of these cases).

Example:

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^tirethe' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc.ambi
tirethe#first_tire~~First_the#first_tire~~first_the#tire_Great_Kanhaway~~the_Great_Kanhaway#tire_Great_Kanhaway~~the_great_Kanhaway#tire_Guisos_Mexico~~the_Guisos_Mexico#tire_Guisos_Mexico~~the_guisos_Mexico#tire_Guisos~~the_Guisos#tire_Guisos~~the_guisos#tire_Milliiippi~~the_Milliiippi#tire_life~~the_LIFE#tire_life~~the_Life#tire_life~~the_life#

As stated before, we are not currently attempting to solve confusables. But this example allows us to explain the issue currently at hand.

The short forms have duly been added to the *short.ldcalc file, as we have recently decided to do. It is here the first of the nine last of 52 such 'confusable' pairs in *short.ldcalc.

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^tire' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.short.ldcalc |tail -n 9
tire00~~the~~00022010012
tire00tides~~1000007728~~100000988102311001
tire00~~ties~~0002201005
tire00tin0002201001
tire00~~tis~~0002201001
tire00toe0002201001
tire00~~tone~~0002201001
tire00wine0002200002
tire00~~wise~~000220000~1

[Another new issue which seems to have popped up the last week or so (as a consequence of one of the latest adjustments to the work flow) is here apparent: for lots of these pairs the usual information such as frequencies etc. is now missing.]

The issue we are inching towards is this: short word forms may well be 'properly' handled by *short.ldcalc and *ambi, but other pairs based on the actual bigram (mostly, if not exclusively, we suspect) are still incorporated in the regular 'long' *ldcalc file: (we do no longer see the actual 'tire_land' and 'tire_bay' examples we had a couple of weeks ago. The first delivered e.g. CCs 'Ireland' and 'fireland' in the long ldcalc file). But these examples are clear enough (granted: they should not be there by virtue of the frequencies of their composing words alone):

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^tire_' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc
tire_as4455~~Tijeras~~1000000109~~1000000109~~233189303362511100
tire_as4455Treas~~1000000098~~1000000124~~23803623657~~2511100
tire_as4455~~treas~~1000000026~~1000000124~~238036236572511100
tire_on266~~266~~Ireson~~1000000092~~1000000092~~14834306838~~2510100
tire_on~~266~~266~~Tiron~~1000000084~~1000000084~~232073370562511100
tire_on266~~266~~Treon~~1000000041~~1000000041~~23803623657~~2511100
tire_or6565~~TREVOR~~105~~2000018302~~55126269672511100
tire_or6565Trevor~~2000018197~~2000018302~~5512626967~~2511100
tire_to~~170~~187~~Tirito~~1000000000~~1000000000~~10444521431251110~0

A non-word example concerns 'ifle':

We have 596 pairs containing this non-word in short.ldcalc.

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ cat /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.short.ldcalc |grep '^ifle~' |wc
596 596 21098

For the probably correct resolution 'rifle' we have the following evidence:

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^iflerifle' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc.ambi
iflerifle#The_ifle_is~~the_rifle_is#The_ifle~~The_rifle#The_ifle~~the_rifle#and_the_ifle~~and_the_rifle#ifle_is~~rifle_is#ifle_of~~rifle_of#ifle_on_the~~rifle_on_the#ifle_on~~rifle_on#ifle_or~~rifle_or#small_ifle~~small_rifle#the_ifle_of~~the_rifle_of#the_ifle~~The_rifle#the_ifle~the_rifle#

'Long' LDcalc nevertheless still retains a number of 'ifle' bigrams.

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^ifle_' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc
ifle_is11~~Ifles~~14238036236572501000
ifle_is11ifles34~~23803623657~~2501000
ifle_on11~~Flemon~~1000000002~~1000000002~~28002070812510100
ifle_on11Fleron1111~~9778106350~~2500100
ifle_on11~~Flexon~~444492347457502500100
ifle_on11Isleton~~1000000052~~1000000052~~11088909372~~2511100
ifle_or11~~Flexor~~118~~1128~~92347457502500100
ifle_or11flexor~~1010~~1128~~9234745750~~2500100

The problem with these is that TICCL-rank misses the possibly likeliest resolution which is in short.ldalc and will rank the rest, probably delivering a False Positive.

I am not sure what would be best to do about this. I think for now we should keep both the short.ldcalc and ambi output. And still add the 'short' bigrams to 'long' ldcalc so that TICCL-rank has the data necessary to do its job well.

Given the inordinate amount of possible pairs for 'ifle' in short.ldcalc, I am not sure the very large background corpus containing also ngrams helps rather than obfuscates the situation. It seems that we should boost the evidence of validated ngrams present in the foreground corpus where and how possible.

Yet one more 'new' issue that bothers me is the fact that capitalized word forms seem to have gained prominence in the corrections. This is due to the fact that TICCL-anahash sorts the anagrams collected alphabetically, it seems. If at all possible, these should rather be sorted by frequency.

Another thing... This run had --low=4. Yet we find the couple 'ifles~riffles', word lengths 5 and 7 respectively, in short.ldcalc.

reynaert@red:/reddata/PILOTS/MORSE/TICCL$ grep '^ifles~' /reddata/PILOTS/MORSE/RUNAUTO5FORE/zzz/TICCL/RUNAUTO5FORE.wordfreqlist.1to3.tsv.clean.anahash.INDEXERFORE.LDCALC.ldcalc.ambi |grep 'iflesriffles'
iflesriffles#ifles_of~riffles_of#

How does that happen?

Addressing C/ : I need to know.

Addressing D/ : I need to know, too.

Further to the ranking features: now we have the foreground foci file: we should use this as another, strong ranking feature: if the CC is present: boost.

Following up on mainly A/ and B/: I will post recommendations for remedial work asap.

MRE

martinreynaert · 2018-08-29T22:19:53Z

OK. All of the above probably constitutes an intertwined set of problems too complicated to be solved all at once.

There seem to be a few problems that viewed on their own should be quite easily solved. I suggest we solve these first and then proceed from there.

First, the wrong ordening of the CCs by TICCL-rank. Before we implemented the descending sort by frequency of the CCs, all was well. This should only have been implemented for best-first ranked (--clip=1) output lists anyway. This sorting is easily done by hand apart from TICCL-rank, on its output.

So: we should either disable this now or correct it so it is done on best-first ranked lists only, respecting the actual best-first ranking according to the confidence.

Second, we do need to figure out why and how bigrams such as tire_as, tire_on, being composed of validated words only, still end up in the 'long' ldcalc file. And prevent this from happening.

Third, if the ngram ranking feature is not yet operational, it should be made so in order that we can see what effect it has.

I think these are to be addressed first, if and when you have the time to do so, Ko.

MRE

martinreynaert · 2018-08-30T00:37:23Z

I was mistaken before: the correct resolution for 'ifle' (taking into account the long s to f confusion) is: 'isle'. Cf. the contexts:

reynaert@red:/reddata/PILOTS/MORSE/FOLIA/AONG$ grep --color 'ifle of' Morse.archiveorg_nietgetraind.xml.folia.xml
Sound, lies E. of the Great Bahama Bank, between it and the ifle of Guanahani. N. lat. 24, W. long. 75.
Noix, Ijle au, or Nut I/le, a small ifle of 50 acres, near the north end of Lake Champlain, and within the province of Lower Canada. Here the British have a garrison containing 100 men. It is about 5 miles N. N. E. of the mouth of La Cole river, 20 north of lile La Motte, and i2 or 15 southward of St. John’s.

And for the plural:

reynaert@red:/reddata/PILOTS/MORSE/FOLIA/AONG$ grep --color 'ifles of' Morse.archiveorg_nietgetraind.xml.folia.xml
Islas, ifles of the Bay of Honda, on the coast of Honduras, or the Spanish Main.

kosloot · 2018-12-19T11:02:32Z

I wonder if this still an issue, or solved sowhere along the line. (it may be...)

kosloot · 2019-11-18T16:03:05Z

I wonder if this still an issue, or solved sowhere along the line. (it may be...)

martinreynaert · 2021-12-15T13:22:06Z

At least two things seem to have wrong here:
1/ At the time I was often handed a 'new' version of one of the tools, for testing purposes. Feedback was certainly given whether this or that issue or part of an issue was then solved, but this was often done informally and did not necessarily occasion a new release where sth. would be documented in the logs.
2/ I made the mistake at the time of piling issue on issue in a single one. This makes it next to impossible to ever declare the issue solved.

On the basis of my own logs, I now conclude that A/ in this issue was definitely solved. It must have been, it was very clear what happened and what had made it happen. Also output from not too long after this issue was posted, corroborates that this was solved. Note that the filename explicitly mentions a 'new' TICCL-LDcalc and a fix by Ko in TICCL-rank.

`(LMdev) reynaert@violet:MORSE$ ls -l /reddata/PILOTS/MORSE/RUNAMALGAM3NEWLDCALC/zzz/TICCL/RUNAMALGAM3NEWLDCALC.wordfreqlist.1to3.tsv.tsv.clean.ldcalc.RANKED.FIXKORANK.SKIP9-10-11-13.ranked
-rw-rwxr-- 1 reynaert reynaert 545415 Oct 17 2018
/reddata/PILOTS/MORSE/RUNAMALGAM3NEWLDCALC/zzz/TICCL/RUNAMALGAM3NEWLDCALC.wordfreqlist.1to3.tsv.tsv.clean.ldcalc.RANKED.FIXKORANK.SKIP9-10-11-13.ranked

(LMdev) reynaert@violet:MORSE$ grep --color '^nuiles' /reddata/PILOTS/MORSE/RUNAMALGAM3NEWLDCALC/zzz/TICCL/RUNAMALGAM3NEWLDCALC.wordfreqlist.1to3.tsv.tsv.clean.ldcalc.RANKED.FIXKORANK.SKIP9-10-11-13.ranked
nuiles#1#miles#1000231218#2#0.966667

(LMdev) reynaert@violet:MORSE$ grep --color '^Amarican' /reddata/PILOTS/MORSE/RUNAMALGAM3NEWLDCALC/zzz/TICCL/RUNAMALGAM3NEWLDCALC.wordfreqlist.1to3.tsv.tsv.clean.ldcalc.RANKED.FIXKORANK.SKIP9-10-11-13.ranked
Amarican#1#American#1001522167#1#0.991584`

martinreynaert assigned martinreynaert and kosloot Aug 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Main two things currently wrong with TICCL-LDcalc and TICCL-rank (and two more gripes...) #29

Main two things currently wrong with TICCL-LDcalc and TICCL-rank (and two more gripes...) #29

martinreynaert commented Aug 29, 2018

martinreynaert commented Aug 29, 2018

martinreynaert commented Aug 30, 2018

kosloot commented Dec 19, 2018

kosloot commented Nov 18, 2019

martinreynaert commented Dec 15, 2021 •

edited

Loading

Main two things currently wrong with TICCL-LDcalc and TICCL-rank (and two more gripes...) #29

Main two things currently wrong with TICCL-LDcalc and TICCL-rank (and two more gripes...) #29

Comments

martinreynaert commented Aug 29, 2018

martinreynaert commented Aug 29, 2018

martinreynaert commented Aug 30, 2018

kosloot commented Dec 19, 2018

kosloot commented Nov 18, 2019

martinreynaert commented Dec 15, 2021 • edited Loading

martinreynaert commented Dec 15, 2021 •

edited

Loading