README

# Ouiche

**Ouiche** is a fast spell-checker in C++, designed to be efficient by using a
compact Radix Trie with a dynamic Damerau-Levenshtein distance to compute
the word suggestions.

> Peter — Et moi pour les gonzesses je suis super d’accord avec toi. Mais pour
> la bouffe je vois pas ce que tu veux dire. Tu aurais envie de manger quoi
> exactement ?
>
> Steven — Ben je sais pas, par exemple une quiche lorraine.
>
> Peter — Une ouiche.
>
> Steven — Quoi ?
>
> Peter — On dit « une ouiche lorraine ».
>
> Steven — Tu es sûr ?

— La Classe Américaine - http://ouich.es/#ouiche

# Build instructions

To compile **Ouiche**, use:

    $ ./configure
    $ make

This will create two binaries in the main folder, `TextMiningCompiler` and
`TextMiningApp`.

**Ouiche** is released under the Beerware License. See the LICENSE file for
more information.

# Usage

* `TextMiningCompiler` is used to compile the Radix Trie from a dictionary
* `TextMiningApp` takes the compiled dictionary and correct the words given in
  the standard input.

## Example

    ./TextMiningCompiler words.txt dict.bin
    ./TextMiningApp dict.bin
    > approx 0 cats
    [{"word":"cats","freq":815877,"distance":0}]
    > approx 1 bachibousouk
    [{"word":"bachibouzouk","freq":8998,"distance":1}]
    > approx 2 analphabaite
    [{"word":"analphabete","freq":84674,"distance":2}]
    > approx 3 aoviaiateur
    [{"word":"aviateur","freq":194553,"distance":3},
     {"word":"naviagateur","freq":530,"distance":3},
     {"word":"affiliateur","freq":382,"distance":3},
     {"word":"avigateur","freq":336,"distance":3}]

# FAQ

## What are the main design choices of **ouiche**?

The compiler reads the words from the input and build a *dynamic* Radix Trie.
This Trie is stored on the heap, which allows us to add words during the
runtime. Once this dynamic tree is built, we compact it so that it is
represented contiguously in memory. This reduces its memory usage since we're
only using offsets instead of raw pointers. It also allows for a faster
deserialization since we only mmap the file in memory and use the structure
as-is.

The app uses `mmap(2)` to fetch the trie in memory. Then, for every word asked,
it constructs a Damerau-Levenshtein table with this words. This dynamic table
is able to be "fed" with a character, and "rollbacked" to a previous state.
This allows us to feed the characters of the trie as they come, until the
DL table has no way to output a distance with a smaller size than the maximum
distance, in which case we try another branch of the trie by resetting the DL
table to the previous distance.

For instance, with a radix trie like this one:

    +---+  chien   +-------+
    | 5 | <------- |   1   |
    +---+          +-------+
                     |
                     | nich
                     v
    +---+  ien     +-------+
    | 4 | <------- |   2   |
    +---+          +-------+
                     |
                     | e
                     v
                   +-------+
                   |   3   |
                   +-------+

We would get this table for the first word:

          c  h  i  e  n
       0  1  2  3  4  5
    n  1  1  2  3  4 oo
    i  2  2  2  2  3  4
    c  3  2  3  3  3  4
    h  4  3  2  3  4  4
    e  5 oo  3  3  3  4

Then we rollback to size 4:

          c  h  i  e  n
       0  1  2  3  4  5
    n  1  1  2  3  4 oo
    i  2  2  2  2  3  4
    c  3  2  3  3  3  4
    h  4  3  2  3  4  4

Then we feed the other characters:

          c  h  i  e  n
       0  1  2  3  4  5
    n  1  1  2  3  4 oo
    i  2  2  2  2  3  4
    c  3  2  3  3  3  4
    h  4  3  2  3  4  4
    i  5 oo  3  2  3  4
    e  6 oo oo  3  2  3
    n  7 oo oo oo  3  2

Then we rollback to 0, and we feed the other branch:

          c  h  i  e  n
       0  1  2  3  4  5
    c  1  0  1  2  3 oo
    h  2  1  0  1  2  3
    i  3  2  1  0  1  2
    e  4  3  2  1  0  1
    n  5 oo  3  2  1  0

Note that we don't have to compute the values of all the cases, the ones marked
with a `oo` are not necessary since we do have a maximum distance, hence we
only need to compute the "diagonal window" of size `2 * max_dist + 1`.

With this table, we are able to efficiently get the distances without having to
compute useless data.

## How was **ouiche** tested?

**Ouiche** was tested using both unit testing for the Damerau-Levenshtein
distance and manual testing using some tool binaries.

We performed some diffs of our output with the reference program to ensure the
results were correct.

## Have you found cases where the Damerau-Levenshtein distance is not accurate?

There are a lot of cases where this distance is not adapted. For instance, it
does not take into account the position of letters on the keyboard, therefore
it could not detect that "piocje" is like "ouiche" with the right hand shifted
to the right.
There are also cases where using the default weights for all the editions is
not really accurate: a transposition of two adjacent characters is more common
than a deletion: "golbal" probably means "global" and not "gobal".

## Why did you implement a Radix Trie in your project?

A Radix Trie is a memory efficient trie, which allowed us to reduce the amount
of indirections and access to the characters in an efficient manner. This also
reduced the memory consumption of the program, when compared to a regular trie.

## Given only a string, could we automatically find an accurate max distance?

We could guess a good distance according to the length of the input string. A
value of `length / 5` would give good results since it allows for a mistake
every 5 characters.

## Do you have any ideas for performance improvements?

We could probably have a more compact representation of the radix trie,
since most of the bits are unused in the alphabet we're working on.
We also could parallelize the search, since the queries are completely
independant and never change the same structures, and have no shared state.

## What is missing to your spell-checker for it to be state-of-the-art?

State-of-the art spell checkers are able to take the keyboard position of the
letters, the grammar of the sentence, the position in the sentence, the
punctuation and the odds of doing a specific mistake into account. We are
really far from that in this basic implementation.