-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
186 lines (145 loc) · 6.19 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
# Ouiche
**Ouiche** is a fast spell-checker in C++, designed to be efficient by using a
compact Radix Trie with a dynamic Damerau-Levenshtein distance to compute
the word suggestions.
> Peter — Et moi pour les gonzesses je suis super d’accord avec toi. Mais pour
> la bouffe je vois pas ce que tu veux dire. Tu aurais envie de manger quoi
> exactement ?
>
> Steven — Ben je sais pas, par exemple une quiche lorraine.
>
> Peter — Une ouiche.
>
> Steven — Quoi ?
>
> Peter — On dit « une ouiche lorraine ».
>
> Steven — Tu es sûr ?
— La Classe Américaine - http://ouich.es/#ouiche
# Build instructions
To compile **Ouiche**, use:
$ ./configure
$ make
This will create two binaries in the main folder, `TextMiningCompiler` and
`TextMiningApp`.
**Ouiche** is released under the Beerware License. See the LICENSE file for
more information.
# Usage
* `TextMiningCompiler` is used to compile the Radix Trie from a dictionary
* `TextMiningApp` takes the compiled dictionary and correct the words given in
the standard input.
## Example
./TextMiningCompiler words.txt dict.bin
./TextMiningApp dict.bin
> approx 0 cats
[{"word":"cats","freq":815877,"distance":0}]
> approx 1 bachibousouk
[{"word":"bachibouzouk","freq":8998,"distance":1}]
> approx 2 analphabaite
[{"word":"analphabete","freq":84674,"distance":2}]
> approx 3 aoviaiateur
[{"word":"aviateur","freq":194553,"distance":3},
{"word":"naviagateur","freq":530,"distance":3},
{"word":"affiliateur","freq":382,"distance":3},
{"word":"avigateur","freq":336,"distance":3}]
# FAQ
## What are the main design choices of **ouiche**?
The compiler reads the words from the input and build a *dynamic* Radix Trie.
This Trie is stored on the heap, which allows us to add words during the
runtime. Once this dynamic tree is built, we compact it so that it is
represented contiguously in memory. This reduces its memory usage since we're
only using offsets instead of raw pointers. It also allows for a faster
deserialization since we only mmap the file in memory and use the structure
as-is.
The app uses `mmap(2)` to fetch the trie in memory. Then, for every word asked,
it constructs a Damerau-Levenshtein table with this words. This dynamic table
is able to be "fed" with a character, and "rollbacked" to a previous state.
This allows us to feed the characters of the trie as they come, until the
DL table has no way to output a distance with a smaller size than the maximum
distance, in which case we try another branch of the trie by resetting the DL
table to the previous distance.
For instance, with a radix trie like this one:
+---+ chien +-------+
| 5 | <------- | 1 |
+---+ +-------+
|
| nich
v
+---+ ien +-------+
| 4 | <------- | 2 |
+---+ +-------+
|
| e
v
+-------+
| 3 |
+-------+
We would get this table for the first word:
c h i e n
0 1 2 3 4 5
n 1 1 2 3 4 oo
i 2 2 2 2 3 4
c 3 2 3 3 3 4
h 4 3 2 3 4 4
e 5 oo 3 3 3 4
Then we rollback to size 4:
c h i e n
0 1 2 3 4 5
n 1 1 2 3 4 oo
i 2 2 2 2 3 4
c 3 2 3 3 3 4
h 4 3 2 3 4 4
Then we feed the other characters:
c h i e n
0 1 2 3 4 5
n 1 1 2 3 4 oo
i 2 2 2 2 3 4
c 3 2 3 3 3 4
h 4 3 2 3 4 4
i 5 oo 3 2 3 4
e 6 oo oo 3 2 3
n 7 oo oo oo 3 2
Then we rollback to 0, and we feed the other branch:
c h i e n
0 1 2 3 4 5
c 1 0 1 2 3 oo
h 2 1 0 1 2 3
i 3 2 1 0 1 2
e 4 3 2 1 0 1
n 5 oo 3 2 1 0
Note that we don't have to compute the values of all the cases, the ones marked
with a `oo` are not necessary since we do have a maximum distance, hence we
only need to compute the "diagonal window" of size `2 * max_dist + 1`.
With this table, we are able to efficiently get the distances without having to
compute useless data.
## How was **ouiche** tested?
**Ouiche** was tested using both unit testing for the Damerau-Levenshtein
distance and manual testing using some tool binaries.
We performed some diffs of our output with the reference program to ensure the
results were correct.
## Have you found cases where the Damerau-Levenshtein distance is not accurate?
There are a lot of cases where this distance is not adapted. For instance, it
does not take into account the position of letters on the keyboard, therefore
it could not detect that "piocje" is like "ouiche" with the right hand shifted
to the right.
There are also cases where using the default weights for all the editions is
not really accurate: a transposition of two adjacent characters is more common
than a deletion: "golbal" probably means "global" and not "gobal".
## Why did you implement a Radix Trie in your project?
A Radix Trie is a memory efficient trie, which allowed us to reduce the amount
of indirections and access to the characters in an efficient manner. This also
reduced the memory consumption of the program, when compared to a regular trie.
## Given only a string, could we automatically find an accurate max distance?
We could guess a good distance according to the length of the input string. A
value of `length / 5` would give good results since it allows for a mistake
every 5 characters.
## Do you have any ideas for performance improvements?
We could probably have a more compact representation of the radix trie,
since most of the bits are unused in the alphabet we're working on.
We also could parallelize the search, since the queries are completely
independant and never change the same structures, and have no shared state.
## What is missing to your spell-checker for it to be state-of-the-art?
State-of-the art spell checkers are able to take the keyboard position of the
letters, the grammar of the sentence, the position in the sentence, the
punctuation and the odds of doing a specific mistake into account. We are
really far from that in this basic implementation.