-
Notifications
You must be signed in to change notification settings - Fork 226
Bayesian Ranking
Fasd has a mathematical model for its ranking algorithm. Specifically, it utilizes Bayesian inference for ranking.
In this article, all capitalized and boldface variables are considered sets, and all corresponding lower case variables are considered elements of those sets. The set of all observed paths is denoted as (\mathbf{Path}), and the set of given patterns is denoted as (\mathbf{Pattern}).
When users provide a set of patterns, (\mathbf{Pattern}). fasd tries to find (P(path|\mathbf{Pattern})) for all ( path \in \mathbf{Path}), or in words, the probability of each path given the set of search patterns. Apply Bayes rule, we have:
[ P( path | \mathbf{Pattern} ) = \frac{ P(\mathbf{Pattern}|path)P(path) }{ \sum \limits_{pattern \in \mathbf{Pattern}} P(path|pattern)P(pattern) } ]
Since fasd only needs find the path with highest probability, and notice that the denominator is the same for all (path). So we define:
[ rank(path) = P(\mathbf{Pattern}|path)P(path) ]
(P(path)) is the prior, (P(\mathbf{Pattern}|path)) is the likelihood, and their product yields (rank(path)), the "posterior."
Fasd keeps a database of accessed paths each with the number of times it has been accessed and the most recent timestamp.
[ \forall (path \in \mathbf{Path}) : path.times \wedge path.timestamp ]
Fasd uses the "frecency" algorithm to estimate the prior, (P(Path)).
[ P(path) \approx frencent(path) = path.times \times weight(currentTime - path.timestamp) ]
[ weight(seconds) = \left{ \begin{array}{lr} 6 : seconds \le 3600 \ 4 : 3600 \leq seconds \le 86400 \ 2 : 86400 \leq seconds \le 604800 \ 1 : otherwise \end{array} \right ]
Function (weight) yields a weight according to the time elapsed.
Fasd uses common sense and the principle of least surprise to find the likelihood.
We assum that all elements of (\mathbf{Pattern}) are mutually independent. Thus:
[ P(\mathbf{Pattern}|path) = \prod \limits_{pattern \in \mathbf{Pattern}} P(pattern|path) ]
And fasd uses a function to estimate (P(pattern|path)):
[ P(pattern|path) \approx likelihood(pattern, path) ]
Here is where a trick comes in. Since (path) consists of segments which looks like "/abc/def/ghi/.../xyz", assuming the user is using a pattern to refer to a path, it is more likely that the pattern matches the filename "xyz" instaead of other segments. Consider a function
[ n(pattern, path) = \frac{ \textrm{number of segments before and including pattern in path} }{ \textrm{number of segments in path} } ]
(For instance n("def", "/abc/def/ghi") = 2/3). Then we define the estimate function (likelihood) to be
[ likelihood(pattern, path) = \left{
\begin{array}{lr}
F : n(pattern. path) = 1 \
n(pattern, path) : otherwise
\end{array}
\right ]
Where (F) is relatively big constant, usually at least 10. This way, we can prioritize a path if the pattern matches the last segment. This algorithm resembles the common sense when users are forming the search patterns: search patterns are more likely to match the filennames.
The rank, or "posterior", is just the product of the prior and the likelihood, as we have previously discussed.
[ rank(path) = frecent(path) \times \prod \limits_{pattern \in \mathbf{Pattern}} likelihood(pattern, path) ]
The optimal path is just the path with the highest rank.