Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TICCL-LDcalc: request to remove existing filter on underscore/hyphen bigram corrections #44

Open
martinreynaert opened this issue Jan 5, 2021 · 9 comments

Comments

@martinreynaert
Copy link
Collaborator

martinreynaert commented Jan 5, 2021

  • In short:

It seems to me we have misguidedly imposed a restriction on TICCL-LDcalc to return higher ngram pairs where the variant and Correction Candidate (CC) only differ in a single (?) underscore (= space) or hyphen. I suppose I at some point expected this restriction to lighten TICCL's overall work load. The result is the later modules cannot converge on the best fitting resolution of the split word due to the contradiction between the unigram solution and those offered by the bi- or possibly trigrams. Ultimately, FoLiA-correct fails to find the right bi- and trigrams to correct.

Example LD-calc output:

is_hon_derd~1~1~ir_honderd~1~1~13664231956~2~9~0~1~1~0~0
is_hon_derd~1~1~isa_Honderd~98765433~98765433~2984709275~2~9~1~1~1~0~0
is_hon_derd~1~1~ishonderd~1~1~22081616064~2~9~0~1~1~0~0

We do not get the CC: 'is_honderd'.

This results in the bi/trigram correction never getting the most plausible resolution for split words, but still getting hundreds of less plausible Correction Candidates (CCs). This results in suboptimal ranking of the CCs and chaos further on in the pipeline, especially in TICCL-chainclean which on the current very large test on about 2.3 million pages of HTRed text now fails to make progress even after days.

We observe the same to be true for hyphens in ngramcorrections. See section 'Hyphens:' below.

This restriction is possibly implemented as simply as: for the confusion values for underscore or hyphen: do not return word pairs where the CC would be a bi- or trigram, i.e. only unigrams are allowed as CC. (This will probably not fully cover it...).

However implemented, I would now like to see the restriction removed.

  • The story, more in full, for both underscores and hyphens:
  • Underscores:

TICCL-rank currently correctly returns e.g. the unigram pair:

Hon_derd~1~23~Honderd~110023864~121380612~11040808032~1~7~1~1~1~0~76

'Grep' on the ranked list:

reynaert@violet:/reddata/NATAR/RANK$ cat NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.ANAHASH.INDEXER.CONCATALL.ANCHORED.LDCALC.AlfSort.ldcalc.RANK.ranked |grep 'Hon_derd'

Hon_derd#1#honderd#110122179#11040808032#1#0.997116
Te_Hon_derd#1#ten_honderd#98766330#1125720992#2#0.998336
Hon_derd_halve#1#honderd_halven#98765438#1125720992#2#0.917197

The bigram, i.e. the split unigram, is correctly resolved. We also get two trigrams containing the bigram.

The CC for the first trigram 'Te_Hon_derd ' is 'nice' in light of the fact that we currently prefer what we now regard as the archaic form with 'ten' in Dutch. However, the more plausible form for these diachronic texts would have 'te', which has higher corpus frequencies (you need to subtract the artifrq '98765432' to get at the actual corpus frequencies):

reynaert@violet:/reddata/NATAR/RANK$ grep -i '^Te_Honderd' /reddata/NATAR/UNKMERGE/NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.clean |head -n 5

te_honderd	98767463
te_Honderd	98765777
Te_honderd	98765740
te_honderd_en	98765505
Te_Honderd	98765497

reynaert@violet:/reddata/NATAR/RANK$ grep -i '^Ten_Honderd' /reddata/NATAR/UNKMERGE/NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.clean |head -n 5

ten_honderd	98766330
ten_Honderd	98765600
Ten_honderd	98765523
Ten_Honderd	98765485
ten_honderd_en	98765472

For the second trigram ' Hon_derd_halve ' we see the actual bigram containing just 'halve' is here not returned by TICCL-LDcalc:

(LMdev) reynaert@violet:RANK$ grep '^Hon_derd_halve' NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.ANAHASH.INDEXER.CONCATALL.ANCHORED.LDCALC.AlfSort.ldcalc

Hon_derd_halve~1~1~Honderd_haive~1~1~13198108815~2~12~0~1~1~0~0
Hon_derd_halve~1~1~Honderd_halven~98765433~98765439~1125720992~2~12~1~1~0~0~0
Hon_derd_halve~1~1~honderd_halven~98765438~98765439~1125720992~2~12~1~1~0~0~0
Hon_derd_halve~1~1~honderd_halver~98765433~98765433~1722007593~2~12~1~1~0~0~0
Hon_derd_halve~1~1~honderd_halv~1~1~22633548775~2~12~0~1~0~0~0
Hon_derd_halve~1~1~honderd_hatve~1~1~14509133807~2~12~0~1~1~0~0
Hon_derd_halve~1~1~honderd_helve~98765433~98765433~13473584596~2~12~1~1~1~0~0
Hon_derd_halve~1~1~honderd_zalve~98765433~98765433~3400807475~2~12~1~1~1~0~0

After TICCL-rank this results in:

(LMdev) reynaert@violet:RANK$ grep '11040808032' NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.ANAHASH.INDEXER.CONCATALL.ANCHORED.LDCALC.AlfSort.ldcalc.RANK.ranked |grep '#honderd#'

hon_derd#22#honderd#110122179#11040808032#1#0.997135
Hon_derd#1#honderd#110122179#11040808032#1#0.997116

But on higher ngram level and allowing for more character confusion than only an extra space (represented here as underscore)::

(LMdev) reynaert@violet:RANK$ grep '_' NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.ANAHASH.INDEXER.CONCATALL.ANCHORED.LDCALC.AlfSort.ldcalc.RANK.ranked |grep '^Hon_derd'

Hon_derd#1#honderd#110122179#11040808032#1#0.997116
Hon_derd_halve#1#honderd_halven#98765438#1125720992#2#0.917197

This results in chaos down the line, TICCL-chain and especially TICCL-chainclean fail to further resolve these contradictive results.

  • Hyphens:

We see the same happening with hyphens

Our current corpus frequency list has the following bigrams::

reynaert@violet:/reddata/NATAR/RANK$ grep -i 'ge-arresteerd_en' /reddata/NATAR/UNKMERGE/NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.clean           

ge-arresteerd_En	2
ge-arresteerd_en	2

versus:

reynaert@violet:/reddata/NATAR/RANK$ grep -i 'gearresteerd_en' /reddata/NATAR/UNKMERGE/NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.clean

gearresteerd_en	98765472
gearresteerd_En	98765449
Gearresteerd_En	98765436

Here too, TICCL-LDcalc does not return the most plausible CC:

reynaert@violet:/reddata/NATAR/RANK$ grep 'ge-arresteerd_en' NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.ANAHASH.INDEXER.CONCATALL.ANCHORED.LDCALC.AlfSort.ldcalc

ge-arresteerd_en~2~4~Gearresteerden~98765447~110000088~46763859681~2~14~1~1~1~0~2
ge-arresteerd_en~2~4~ge-arresteerde~5~5~23207337056~2~14~0~1~0~0~2
ge-arresteerd_en~2~4~gearresteerden~110000073~110000088~46763859681~2~14~1~1~1~0~2
ge-arresteerdens~1~1~ge-arresteerd_en~2~4~4345431517~2~14~0~1~0~0~0

We hope this can be remedied shortly!
Thanks!
MRE

@kosloot
Copy link
Collaborator

kosloot commented Jan 5, 2021

You got me confused: The issue is mentioning TiCCL-rank, but the text seems to suggest the problem is in TiCCL-LDcalc already??

Anyway: to analyse this, I need a MINIMAL working example of the input files for LDcalc:

so a SMALL index-file, hash-file and clean file, preferable with just about 10 words or so demonstrating the problem.
could you please provide me with those?

@martinreynaert martinreynaert changed the title TICCL-rank: request to remove existing filter on underscore/hyphen bigram corrections TICCL-LDcalc: request to remove existing filter on underscore/hyphen bigram corrections Jan 5, 2021
@martinreynaert
Copy link
Collaborator Author

OK. I attach a tar.gz containing *clean, *anahash, *corpusfoci en *ldcalc files. Also a file TICCL.commandlinesTESTSAMPLE.20210105.txt which contains the command lines used.

TICCL.TestSample.LDcalcRestrictionUnderscoreHyphen.20210105.tar.gz

Note I did not use the *corpusfoci file here. That is meant to reduce the workload, without TICCL-indexer works exhaustively gathering all the possible character confusion word pairs present. But seeing there is so little here, all these modules run in just seconds.
These files should amply illustrate the problem, the *ldcalc file in fact unavoidably gives some more examples of the same filtering than I listed above.
Thanks! Looking forward to the result!

@martinreynaert
Copy link
Collaborator Author

I have now also run TICCL-rank on this.

Command line: reynaert@violet:/reddata/NATAR/TESTSAMPLE$ /exp/sloot/usr/local/bin/TICCL-rank -t 1 --alph /reddata/POLMASH/TRI/ALPH/nld.aspell.dict.clip20.lc.chars --charconf /reddata/POLMASH/TRI/ALPH/nld.aspell.dict.clip20.ld2.charconfus -o /reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK --debugfile /reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANKDEBUG --artifrq 0 --clip 1 --skipcols=1,10,11,13 /reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.ldcalc >/reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.20210105.stdout 2>/reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.20210105.stderr

Output:

reynaert@violet:/reddata/NATAR/TESTSAMPLE$ cat NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.ranked
Hon_derd#1#honderd#110122179#11040808032#1#0.697674
hon_derd#22#honderd#110122179#11040808032#1#0.697674
Ge-arresteerd#1#gearresteerd#110002708#35723051649#1#0.860759
ge-arresteerd#173#gearresteerd#110002708#35723051649#1#0.738095
ge-arresteerde#5#gearresteerde#110000650#35723051649#1#0.932961
hon_derd_twaalf#1#honderdtwaalf#108765437#22081616064#2#1
Aan_hon_derd#2#van_honderd#98768224#871099262#2#0.52
der_hon_derd#1#de_honderd#98767370#23803623657#2#0.918367
en_hon_derd#1#Een_honderd#98766569#551932711#2#0.945833
even_hon_derd#1#Een_honderd#98766569#36978232633#2#0.694444
Te_Hon_derd#1#ten_honderd#98766330#1125720992#2#0.938776
is_hon_derd#1#in_honderd#98766169#14260518557#2#1
Van_hon_derd#1#Aan_honderd#98766151#22952715326#2#0.52
van_hon_derd#1#Aan_honderd#98766151#22952715326#2#0.52
Het_hon_derd#2#met_honderd#98766097#11993905243#2#1
voor_hon_derd#1#door_honderd#98765638#19354815801#2#1

It should be obvious that 'voor' to 'door' and 'van' to 'Aan' confusions are counterproductive.

MRE

@martinreynaert
Copy link
Collaborator Author

This is not immedialtely pertinent to the actual issue involved here. But kind of illustrates the consequences of what goes wrong due to the current filtering in TICCL-LDcalc.

I have now also run TICCL-chainclean (also with -v and -v -v), which was interesting, although I do not really understand what happens.

Command line:
reynaert@violet:/reddata/NATAR/TESTSAMPLE$ /exp/sloot/usr/local/bin/TICCL-chainclean -v --lexicon /reddata/NATAR/UNKMERGE/NA.Ysberg.MergeLexWithT-Lex.wordfreqlist.1to3.UNKMERGE.clean --artifrq 100000000 --low=6 -o /reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.CHAINCLEAN /reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.CHAIN.chained >/reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.CHAINCLEAN.20210105.stdout 2>/reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.CHAINCLEAN.20210105.stderr

The result is that it retains only 5 lines of the 16 in *chained. (Actually TICCL-chain could not 'chain' any of the 16 lines in *ranked.) The other 11 lines are written to a file *deleted.

Output:
reynaert@violet:/reddata/NATAR/TESTSAMPLE$ cat /reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.CHAINCLEAN
Hon_derd#1#honderd#110122179#11040808032#1#C
Ge-arresteerd#1#gearresteerd#110002708#35723051649#1#C
ge-arresteerd#173#gearresteerd#110002708#35723051649#1#C
ge-arresteerde#5#gearresteerde#110000650#35723051649#1#C
hon_derd_twaalf#1#honderdtwaalf#108765437#22081616064#2#C
reynaert@violet:/reddata/NATAR/TESTSAMPLE$
reynaert@violet:/reddata/NATAR/TESTSAMPLE$ cat /reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.CHAINCLEAN.deleted
hon_derd#22#honderd#110122179#11040808032#1#D
Aan_hon_derd#2#van_honderd#98768224#871099262#2#D
der_hon_derd#1#de_honderd#98767370#23803623657#2#D
en_hon_derd#1#Een_honderd#98766569#551932711#2#D
even_hon_derd#1#Een_honderd#98766569#36978232633#2#D
Te_Hon_derd#1#ten_honderd#98766330#1125720992#2#D
is_hon_derd#1#in_honderd#98766169#14260518557#2#D
Van_hon_derd#1#Aan_honderd#98766151#22952715326#2#D
van_hon_derd#1#Aan_honderd#98766151#22952715326#2#D
Het_hon_derd#2#met_honderd#98766097#11993905243#2#D
voor_hon_derd#1#door_honderd#98765638#19354815801#2#D

I am still trying to figure out what it actually tries to do on the basis of the *stderr. I definitely do not agree that the pair
hon_derd#22#honderd#110122179#11040808032#1#D
should be deleted.

I attach the *stderr file for the sake of completeness. I added the extension *txt to be able to actually upload it here...
NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.CHAINCLEAN.20210105.vv.stderr.txt

@martinreynaert
Copy link
Collaborator Author

martinreynaert commented Jan 7, 2021

So, in hopes of seeing a causal relation between unigram and bi/trigram retrieval of a pair differing only in an underscore or a hyphen, I extracted from the testsample index the line for single underscore confusion and the line for single hyphen confusion, to separate new index files. I ran these with and without the value for the unigram with TICCL-LDcalc. I.e. with and without the values (in the anahash lines: '110751596624Honderd#honderd' and '163501191104Gearresteerd#gearresteerd').
When these are present, the pair is retrieved. The corresponding bi/trigrams are not.
When these are not present, the pair is necessarily not retrieved. The corresponding bi/trigrams are not either.
I conclude there is no causal relation (e.g. possible filter on bi/trigrams not being retrieved after the corresponding unigram has been validated and retrieved) between the two.

@martinreynaert
Copy link
Collaborator Author

I here attach the input and outputfiles involved in the above.

HYPHEN.zip

UNDERSCORE.zip

@martinreynaert
Copy link
Collaborator Author

I have tried with 'follow='. Definitely interesting! But I do not get the last lines: 'ignoring'. I paste the lot here.

reynaert@maize:/reddata/NATAR/TESTSAMPLE/ZIP$
[1]+ Done nohup /exp/sloot/usr/local/bin/TICCL-LDcalc -v -v --follow=Te_Hon_derd --threads 1 --LD 2 --low=6 --high=50 --index /reddata/NATAR/TESTSAMPLE/ZIP/Underscore.11040808032.testsample.No_Honderd.index --hash /reddata/NATAR/TESTSAMPLE/ZIP/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.anahash --clean /reddata/NATAR/TESTSAMPLE/ZIP/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.NumSortDes.clean --alph /reddata/POLMASH/TRI/ALPH/nld.aspell.dict.clip20.lc.chars --artifrq 98765432 -o /reddata/NATAR/TESTSAMPLE/ZIP/Underscore.11040808032.testsample.No_Honderd.index.LDCALC > /reddata/NATAR/TESTSAMPLE/ZIP/Underscore.11040808032.testsample.No_Honderd.index.LDCALC.20210107.stdout 2> /reddata/NATAR/TESTSAMPLE/ZIP/Underscore.11040808032.testsample.No_Honderd.index.LDCALC.20210107.stderr
reynaert@maize:/reddata/NATAR/TESTSAMPLE/ZIP$
reynaert@maize:/reddata/NATAR/TESTSAMPLE/ZIP$ cat /reddata/NATAR/TESTSAMPLE/ZIP/Underscore.11040808032.testsample.No_Honderd.index.LDCALC.20210107.stderr
nohup: ignoring input
skip hash for te (not in lexicon)
skip hash for Ten (not in lexicon)
skip hash for ten (not in lexicon)
skip hash for Ter (not in lexicon)
skip hash for ter (not in lexicon)
skip hash for Tes (not in lexicon)
skip hash for Tel (not in lexicon)
skip hash for Tent (not in lexicon)
skip hash for tent (not in lexicon)
skip hash for Test (not in lexicon)
skip hash for teng (not in lexicon)
skip hash for Tepe (not in lexicon)
skip hash for Teun (not in lexicon)
skip hash for tenen (not in lexicon)
skip hash for tenne (not in lexicon)
skip hash for Terne (not in lexicon)
skip hash for Terre (not in lexicon)
skip hash for Tente (not in lexicon)
skip hash for tente (not in lexicon)
skip hash for tense (not in lexicon)
skip hash for Terra (not in lexicon)
skip hash for Teers (not in lexicon)
skip hash for tende (not in lexicon)
skip hash for Tegen (not in lexicon)
skip hash for Tetro (not in lexicon)
skip hash for Testa (not in lexicon)
skip hash for Teris (not in lexicon)
skip hash for Teken (not in lexicon)
skip hash for tenke (not in lexicon)
skip hash for tenue (not in lexicon)
skip hash for Telle (not in lexicon)
skip hash for Tegel (not in lexicon)
skip hash for Teems (not in lexicon)
skip hash for Teije (not in lexicon)
skip hash for Teijn (not in lexicon)
skip hash for Tevel (not in lexicon)
skip hash for Temme (not in lexicon)
skip hash for Tewes (not in lexicon)
skip hash for Temps (not in lexicon)
skip hash for Teijl (not in lexicon)
skip hash for Tewis (not in lexicon)
skip hash for Texel (not in lexicon)
examine 11040808032#134510866391,145551674423,146103607134,146767401175,151008562231,151112832692,151871924973,162009968294,162586402935,163279361716,163771377856,167423850592,169644978743,172160207307,173627210967,173921875588,175737858361,176635584510,183289486567,187260460871,192702844882,206562440468,232544001280
extract parts from 134510866391,145551674423,146103607134,146767401175,151008562231,151112832692,151871924973,162009968294,162586402935,163279361716,163771377856,167423850592,169644978743,172160207307,173627210967,173921875588,175737858361,176635584510,183289486567,187260460871,192702844882,206562440468,232544001280
analyze ngram candidates: Te_Hon_derd AND Te_Honderd
after reduction, candidates: [Hon,derd] AND [Honderd]
FOUND 1-2-3 Hon_derd Honderd
ngram candidate: 'Hon_derdHonderd' in n-grams pair: Te_Hon_derd # Te_Honderd
stored: Hon_derd
Honderd and forget about Te_Hon_derdTe_Honderd
analyze ngram candidates: Te_Hon_derd AND Te_honderd
after reduction, candidates: [Hon,derd] AND [honderd]
FOUND 1-2-3 Hon_derd honderd
ngram candidate: 'Hon_derd
honderd' in n-grams pair: Te_Hon_derd # Te_honderd
stored: Hon_derdhonderd and forget about Te_Hon_derdTe_honderd
analyze ngram candidates: Te_Hon_derd AND te_Honderd
after reduction, candidates: [Hon,derd] AND [Honderd]
FOUND 1-2-3 Hon_derd Honderd
ngram candidate: 'Hon_derdHonderd' in n-grams pair: Te_Hon_derd # te_Honderd
stored: Hon_derd
Honderd and forget about Te_Hon_derdte_Honderd
analyze ngram candidates: Te_Hon_derd AND te_honderd
after reduction, candidates: [Hon,derd] AND [honderd]
FOUND 1-2-3 Hon_derd honderd
ngram candidate: 'Hon_derd
honderd' in n-grams pair: Te_Hon_derd # te_honderd
stored: Hon_derdhonderd and forget about Te_Hon_derdte_honderd
ignoring Hon_derdHonderd
ignoring Hon_derd
honderd
ignoring hon_derdHonderd
ignoring hon_derd
honderd
reynaert@maize:/reddata/NATAR/TESTSAMPLE/ZIP$

@martinreynaert
Copy link
Collaborator Author

martinreynaert commented Jan 8, 2021

Basically what I see, when comparing the *stderr files of the runs with and without the unigram 'honderd' listed in the index, is that the result is that in the *ldcalc output list the minimal solution: 'hon_derd' is to be corrected as 'honderd' is listed when present, is not listed when not.

However, both runs have basically done the same work and come to that same conclusion. That is like 'saying A'. And that should be said. [An aside for now is that the actual count of bigrams saying 'hon_derd' should be 'honderd' would be a valuable ranking feature. I am not clear at this very point in time whether we use this or not].

What we do not say so far is 'B', namely: if we say 'hon_derd' should be corrected as 'honderd' on the basis of the evidence provided by so many bigrams containing these word forms. we should also take the next step and say that those bigrams that contain 'hon_derd' cannot also have to be corrected as something els. This something else is identified on the basis of other character confusions and probably without fail represents a more complicated 'solution', e.g. often entailing differences between not just one part of the bigram pair but both pairs.

Saying 'B' would then mean not to 'forget' the actual bigrams evaluated and resolved, but to write these to a list of 'solved' bigrams. For reasons I will explain later, this list would be produced by TICCL-LDcalc for later use (most likely by TICCL-rank), not used within TICCL-LDcalc itself.

So, this modifies the request I made earlier, which was to not forget about the 'validated' bigrams but to output them. That would be an option, but the right way to me now seems to be to make proper use of the good work now actually already done by TICCL-LDcalc and to further cash in on that.

That would be to further down the line (i.e. in the next step) filter away all the spurious 'solutions' for each bi/trigram brought forward by the system. In the full run on the 2.5 million pages of National Archives 'Ysberg' data, for the single trigram 'te_hon_derd' - regardless of capitalization - this amounts to already 121 spurious solutions.

In so doing, I am confident we thoroughly narrow the search space and impose a highly valuable restriction on the total amount of work further to be done. I also think this will result in largely removing the need for our current module TICCL-chainclean, which in fact is meant to try and solve the very many problems created by not saying 'B' earlier on in the pipe line.

@martinreynaert
Copy link
Collaborator Author

I will now detail why I ask for a separate list of 'solved' underscore/hyphen bi/trigrams.

The huge NA Ysberg corpus results in TICCL-indexer producing a huge index amounting to 371G. TICCL-LDcalc needs to keep everything in memory and my servers are limited to 256GB of RAM.

I found a good solution to be to split the index file on the basis of its lines, each one of which represents a singe character confusion. In fact, TICCL-indexer now also produces a *ConfStats file which for each character confusion details how many ngram pairs in the corpus were found to display the particular confusion. These numbers of pairs display a power law. And I found it to be viable to split the index file according to each power, resulting in e.g. a list containing the character confusions having hundreds of thousands of pairs, a next tens of thousands, the following just the thousands, etc.

TICCL-rank can then be run so many times on a manageable subset of the index file. This is viable because each character confusion proposes just its own CCs, i.e. they are all independent of each other. After running on all the subsets, the different output files can then be concatenated to be fed to TICCL-rank as one single large file.

It is prior to this, by means of a simple filtering script, that this list could be rid of all the bi/trigrams already solved by LDcalc or, if TICCL-rank could be modified towards this end, at the time of inputting the ldcalc-list to TICCL-rank.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants