-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request for new ranking feature based on pairs1-rank, possibly to replace pairs_combined_rank: MedianPairsCCFrequencies #33
Comments
I am a bit confused about your remark My first impression was, that it is a 'local' calculation, for 1 variant with its N CC's
this has the frequencies: Could you clarify a bit? |
No, the local calculation is not what I want. I do suggest to calculate the median of all frequencies belonging to a character confusion, for all variants it appears in. OK, for my tests is have used the following information: reynaert@red:/reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL$ grep '#9496960451#' .RUNAMALGAM5.clean.ldcalc.debug.ranked | cut -d '#' -f 1,3,4,6,16 >bla3 I have imported these output files in Excel and have calculated the average/mean and median over column 3 of this output, i.e. the base frequency for the CCs. So I based this on the info in the debug file output by TICCL-rank. I you do this on the output of LDcalc, you get larger subsets per confusion value. So extra filtering in TICCL-rank seems to discard a number of pairs, so we loose some (I hope we do not actually lose some). It would probably be easier to calculate the mean over these from Ldcalc. Who knows the net result might be the same, but I do not know this. Let us say this is an option if it proves too hard to implement this on the subsets actually output to the debug file of rank. Hope this sufficiently clarifies matters. |
Ok, |
First test on server Black running with command-line: reynaert@black:/reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL$ nohup /exp/sloot/usr/local/bin//TICCL-rank -t max --alph /reddata/PILOTS/MORSE/Aspell/eng.aspell.hyphen.dict.clip0.lc.chars --charconf /reddata/PILOTS/MORSE/Aspell/eng.aspell.hyphen.dict.clip0.ld2.charconfus -o /reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL/RUNAMALGAM5.wordfreqlist.1to3.tsv.tsv.clean.ldcalc.subtractartifrqfeature1.MEDIAN.ranked --debugfile /reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL/.RUNAMALGAM5.tsv.clean.ldcalc.subtractartifrqfeature1.MEDIAN.debug.ranked --subtractartifrqfeature1 1000000000 --clip 1 --skipcols=9,10,13 --charconfreq /reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL/RUNAMALGAM5.wordfreqlist.1to3.tsv.tsv.clean.ldcalc.subtractartifrqfeature1.ranked.chrconfreq /reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL/RUNAMALGAM5.wordfreqlist.1to3.tsv.clean.ldcalc 2>/reddata/PILOTS/MORSE/RUNAMALGAM5/zzz/TICCL/RUNAMALGAM5.RANK.subtractartifrqfeature1.charconfreq.MEDIAN.20181204.stderr & |
@martinreynaert Small addition: Ik zie (minimale) verschillen. Graag hoor ik welke benadering we gaan kiezen |
Hi,
This concerns ranking features:
This is a request for a more informed ranking-feature. This may be a new one or may replace the existing pairs_combined one (preferred).
Ranking feature pairs1 currently takes the count of each anagram confusion value of the pairs transferred from LDcalc to rank. Highest number of pairs transferred ranks highest in rank, given a particular set of Correction Candidates for a particular variant.
This does not always result in the most likely CC given the highest rank in the current situation. Quite spurious confusions over particularly shorter words may be ranked higher than ostensibly often recurring confusions given the particular corpus being corrected.
After some experimentation it seems that weighing the frequencies of the CCs proposed for a particular confusion might help. We have tried the mean of the frequencies, but this results in pretty much the same ranking as we currently get in pairs1.
The median of the CCs frequencies, however, appears more likely to deliver the better ranking.
This will probably have to be implemented at the end of rank.
So, given the overall set of pairs in rank that share a particular character confusion value, this new feature needs to calculate the median of the CCs frequencies (their own, not the summed frequency of their capitalised versions). Also, here, the highest median wins, i.e. is accorded rank 1.
I would very much like to be be able to experiment with this soon.
Thanks!
M.
The text was updated successfully, but these errors were encountered: