Bayesian Ranking

Introduction

Fasd has a mathematical model for its ranking algorithm. Specifically, it utilizes Bayesian inference for ranking.

In this article, all capitalized and boldface variables are considered sets, and all corresponding lower case variables are considered elements of those sets. The set of all observed paths is denoted as (\mathbf{Path}), and the set of given patterns is denoted as (\mathbf{Pattern}).

Bayes Rule

When users provide a set of patterns, (\mathbf{Pattern}). fasd tries to find (P(path|\mathbf{Pattern})) for all ( path \in \mathbf{Path}), or in words, the probability of each path given the set of search patterns. Apply Bayes rule, we have:

[ P( path | \mathbf{Pattern} ) = \frac{ P(\mathbf{Pattern}|path)P(path) }{ \sum \limits_{pattern \in \mathbf{Pattern}} P(path|pattern)P(pattern) } ]

Since fasd only needs find the path with highest probability, and notice that the denominator is the same for all (path). So we define:

[ rank(path) = P(\mathbf{Pattern}|path)P(path) ]

(P(path)) is the prior, (P(\mathbf{Pattern}|path)) is the likelihood, and their product yields (rank(path)), the "posterior."

Calculating the Prior

Fasd keeps a database of accessed paths each with the number of times it has been accessed and the most recent timestamp.

[ \forall (path \in \mathbf{Path}) : path.times \wedge path.timestamp ]

Fasd uses the "frecency" algorithm to estimate the prior, (P(Path)).

[ P(path) \approx frencent(path) = path.times \times weight(currentTime - path.timestamp) ]

[ weight(seconds) = \left{ \begin{array}{lr} 6 : seconds \le 3600 \ 4 : 3600 \leq seconds \le 86400 \ 2 : 86400 \leq seconds \le 604800 \ 1 : otherwise \end{array} \right ]

Function (weight) yields a weight according to the time elapsed.

Calculating the Likelihood

Fasd uses common sense and the principle of least surprise to find the likelihood.

We assum that all elements of (\mathbf{Pattern}) are mutually independent. Thus:

[ P(\mathbf{Pattern}|path) = \prod \limits_{pattern \in \mathbf{Pattern}} P(pattern|path) ]

And fasd uses a function to estimate (P(pattern|path)):

[ P(pattern|path) \approx likelihood(pattern, path) ]

Here is where a trick comes in. Since (path) consists of segments which looks like "/abc/def/ghi/.../xyz", assuming the user is using a pattern to refer to a path, it is more likely that the pattern matches the filename "xyz" instaead of other segments. Consider a function

[ n(pattern, path) = \frac{ \textrm{number of segments before and including pattern in path} }{ \textrm{number of segments in path} } ]

(For instance n("def", "/abc/def/ghi") = 2/3). Then we define the estimate function (likelihood) to be

[ likelihood(pattern, path) = \left{ \begin{array}{lr} F : n(pattern. path) = 1 \ n(pattern, path) : otherwise
\end{array} \right ]

Where (F) is relatively big constant, usually at least 10. This way, we can prioritize a path if the pattern matches the last segment. This algorithm resembles the common sense when users are forming the search patterns: search patterns are more likely to match the filennames.

Calculating the Rank

The rank, or "posterior", is just the product of the prior and the likelihood, as we have previously discussed.

[ rank(path) = frecent(path) \times \prod \limits_{pattern \in \mathbf{Pattern}} likelihood(pattern, path) ]

Finding the Optimal Path

The optimal path is just the path with the highest rank.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bayesian Ranking

Bayesian Ranking

Introduction

Bayes Rule

Calculating the Prior

Calculating the Likelihood

Calculating the Rank

Finding the Optimal Path

Clone this wiki locally