-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TICCL-LDcalc: request to remove existing filter on underscore/hyphen bigram corrections #44
Comments
You got me confused: The issue is mentioning TiCCL-rank, but the text seems to suggest the problem is in TiCCL-LDcalc already?? Anyway: to analyse this, I need a MINIMAL working example of the input files for LDcalc: so a SMALL index-file, hash-file and clean file, preferable with just about 10 words or so demonstrating the problem. |
OK. I attach a tar.gz containing *clean, *anahash, *corpusfoci en *ldcalc files. Also a file TICCL.commandlinesTESTSAMPLE.20210105.txt which contains the command lines used. TICCL.TestSample.LDcalcRestrictionUnderscoreHyphen.20210105.tar.gz Note I did not use the *corpusfoci file here. That is meant to reduce the workload, without TICCL-indexer works exhaustively gathering all the possible character confusion word pairs present. But seeing there is so little here, all these modules run in just seconds. |
I have now also run TICCL-rank on this. Command line: reynaert@violet:/reddata/NATAR/TESTSAMPLE$ /exp/sloot/usr/local/bin/TICCL-rank -t 1 --alph /reddata/POLMASH/TRI/ALPH/nld.aspell.dict.clip20.lc.chars --charconf /reddata/POLMASH/TRI/ALPH/nld.aspell.dict.clip20.ld2.charconfus -o /reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK --debugfile /reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANKDEBUG --artifrq 0 --clip 1 --skipcols=1,10,11,13 /reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.ldcalc >/reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.20210105.stdout 2>/reddata/NATAR/TESTSAMPLE/NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.20210105.stderr Output: reynaert@violet:/reddata/NATAR/TESTSAMPLE$ cat NA.Ysberg.SampleLDcalcRestriction.gearresteerd_honderd.clean.NumSortDes.ANAHASH.INDEXER.LDCALC.RANK.ranked It should be obvious that 'voor' to 'door' and 'van' to 'Aan' confusions are counterproductive. MRE |
This is not immedialtely pertinent to the actual issue involved here. But kind of illustrates the consequences of what goes wrong due to the current filtering in TICCL-LDcalc. I have now also run TICCL-chainclean (also with -v and -v -v), which was interesting, although I do not really understand what happens. Command line: The result is that it retains only 5 lines of the 16 in *chained. (Actually TICCL-chain could not 'chain' any of the 16 lines in *ranked.) The other 11 lines are written to a file *deleted. Output: I am still trying to figure out what it actually tries to do on the basis of the *stderr. I definitely do not agree that the pair I attach the *stderr file for the sake of completeness. I added the extension *txt to be able to actually upload it here... |
So, in hopes of seeing a causal relation between unigram and bi/trigram retrieval of a pair differing only in an underscore or a hyphen, I extracted from the testsample index the line for single underscore confusion and the line for single hyphen confusion, to separate new index files. I ran these with and without the value for the unigram with TICCL-LDcalc. I.e. with and without the values (in the anahash lines: '110751596624 |
I here attach the input and outputfiles involved in the above. |
I have tried with 'follow='. Definitely interesting! But I do not get the last lines: 'ignoring'. I paste the lot here. reynaert@maize:/reddata/NATAR/TESTSAMPLE/ZIP$ |
Basically what I see, when comparing the *stderr files of the runs with and without the unigram 'honderd' listed in the index, is that the result is that in the *ldcalc output list the minimal solution: 'hon_derd' is to be corrected as 'honderd' is listed when present, is not listed when not. However, both runs have basically done the same work and come to that same conclusion. That is like 'saying A'. And that should be said. [An aside for now is that the actual count of bigrams saying 'hon_derd' should be 'honderd' would be a valuable ranking feature. I am not clear at this very point in time whether we use this or not]. What we do not say so far is 'B', namely: if we say 'hon_derd' should be corrected as 'honderd' on the basis of the evidence provided by so many bigrams containing these word forms. we should also take the next step and say that those bigrams that contain 'hon_derd' cannot also have to be corrected as something els. This something else is identified on the basis of other character confusions and probably without fail represents a more complicated 'solution', e.g. often entailing differences between not just one part of the bigram pair but both pairs. Saying 'B' would then mean not to 'forget' the actual bigrams evaluated and resolved, but to write these to a list of 'solved' bigrams. For reasons I will explain later, this list would be produced by TICCL-LDcalc for later use (most likely by TICCL-rank), not used within TICCL-LDcalc itself. So, this modifies the request I made earlier, which was to not forget about the 'validated' bigrams but to output them. That would be an option, but the right way to me now seems to be to make proper use of the good work now actually already done by TICCL-LDcalc and to further cash in on that. That would be to further down the line (i.e. in the next step) filter away all the spurious 'solutions' for each bi/trigram brought forward by the system. In the full run on the 2.5 million pages of National Archives 'Ysberg' data, for the single trigram 'te_hon_derd' - regardless of capitalization - this amounts to already 121 spurious solutions. In so doing, I am confident we thoroughly narrow the search space and impose a highly valuable restriction on the total amount of work further to be done. I also think this will result in largely removing the need for our current module TICCL-chainclean, which in fact is meant to try and solve the very many problems created by not saying 'B' earlier on in the pipe line. |
I will now detail why I ask for a separate list of 'solved' underscore/hyphen bi/trigrams. The huge NA Ysberg corpus results in TICCL-indexer producing a huge index amounting to 371G. TICCL-LDcalc needs to keep everything in memory and my servers are limited to 256GB of RAM. I found a good solution to be to split the index file on the basis of its lines, each one of which represents a singe character confusion. In fact, TICCL-indexer now also produces a *ConfStats file which for each character confusion details how many ngram pairs in the corpus were found to display the particular confusion. These numbers of pairs display a power law. And I found it to be viable to split the index file according to each power, resulting in e.g. a list containing the character confusions having hundreds of thousands of pairs, a next tens of thousands, the following just the thousands, etc. TICCL-rank can then be run so many times on a manageable subset of the index file. This is viable because each character confusion proposes just its own CCs, i.e. they are all independent of each other. After running on all the subsets, the different output files can then be concatenated to be fed to TICCL-rank as one single large file. It is prior to this, by means of a simple filtering script, that this list could be rid of all the bi/trigrams already solved by LDcalc or, if TICCL-rank could be modified towards this end, at the time of inputting the ldcalc-list to TICCL-rank. |
It seems to me we have misguidedly imposed a restriction on TICCL-LDcalc to return higher ngram pairs where the variant and Correction Candidate (CC) only differ in a single (?) underscore (= space) or hyphen. I suppose I at some point expected this restriction to lighten TICCL's overall work load. The result is the later modules cannot converge on the best fitting resolution of the split word due to the contradiction between the unigram solution and those offered by the bi- or possibly trigrams. Ultimately, FoLiA-correct fails to find the right bi- and trigrams to correct.
Example LD-calc output:
We do not get the CC: 'is_honderd'.
This results in the bi/trigram correction never getting the most plausible resolution for split words, but still getting hundreds of less plausible Correction Candidates (CCs). This results in suboptimal ranking of the CCs and chaos further on in the pipeline, especially in TICCL-chainclean which on the current very large test on about 2.3 million pages of HTRed text now fails to make progress even after days.
We observe the same to be true for hyphens in ngramcorrections. See section 'Hyphens:' below.
This restriction is possibly implemented as simply as: for the confusion values for underscore or hyphen: do not return word pairs where the CC would be a bi- or trigram, i.e. only unigrams are allowed as CC. (This will probably not fully cover it...).
However implemented, I would now like to see the restriction removed.
TICCL-rank currently correctly returns e.g. the unigram pair:
'Grep' on the ranked list:
The bigram, i.e. the split unigram, is correctly resolved. We also get two trigrams containing the bigram.
The CC for the first trigram 'Te_Hon_derd ' is 'nice' in light of the fact that we currently prefer what we now regard as the archaic form with 'ten' in Dutch. However, the more plausible form for these diachronic texts would have 'te', which has higher corpus frequencies (you need to subtract the artifrq '98765432' to get at the actual corpus frequencies):
For the second trigram ' Hon_derd_halve ' we see the actual bigram containing just 'halve' is here not returned by TICCL-LDcalc:
After TICCL-rank this results in:
But on higher ngram level and allowing for more character confusion than only an extra space (represented here as underscore)::
This results in chaos down the line, TICCL-chain and especially TICCL-chainclean fail to further resolve these contradictive results.
We see the same happening with hyphens
Our current corpus frequency list has the following bigrams::
versus:
Here too, TICCL-LDcalc does not return the most plausible CC:
We hope this can be remedied shortly!
Thanks!
MRE
The text was updated successfully, but these errors were encountered: