shpyplm.tex

\chapter{Chapter 4\newline Extending the hierarchical Pitman-Yor language models with skipgrams}\label{chap:shpyplm}\marginnote{This section is based on: \cite{onrust2016Improving}}

%\section{Introduction}

\newthought{Since the seminal paper} on hierarchical Bayesian language models based on Pitman-Yor processes\autocite{teh2006hierarchical}, Bayesian language modelling has regained an interest. Although Bayesian language models are not new\autocite{mackay1995hierarchical}, previously proposed models were reported to be inferior compared to other smoothing methods. Teh's work was the first to report on improvements over interpolated Kneser-Ney smoothing\autocite{teh2006hierarchical}.
    
  To overcome the traditional problems of overestimating the probabilities of rare occurrences and underestimating the probabilities of unseen events, a range of smoothing algorithms have been proposed in the literature \autocite{goodman2001bit}. Most methods take a heuristic-frequentist approach combining $n$-gram probabilities for various values of $n$, using back-off schemes or interpolation. 

Teh\autocite{teh2006hierarchical} showed that MacKay and Peto's \autocite{mackay1995hierarchical} research on parametric Bayesian language models with a Dirichlet prior could be extended to give better results, but also that one of the best smoothing methods, interpolated Kneser-Ney \autocite{kneser1995improved}, can be derived as an approximation of the Hierarchical Pitman-Yor process language model (HPYLM).
  
  % Eventueel eruit
  The success of the Bayesian approach to language modelling is due to the use of statistical distributions such as the Dirichlet distribution, and distributions over distributions, such as the Dirichlet process and its two-parameter generalisation, the Pitman-Yor process. Both are widely studied in the statistics and probability theory communities. Interestingly, language modelling has acquired the status of a ``fruit fly'' problem in these communities, to benchmark the performance of statistical models. In this thesis we approach language modelling from a computational linguistics point of view, and consider the statistical methods to be the tool with the future goal of improving language models for extrinsic tasks such as speech recognition.
  
  We derive our model from \textcite{teh2006hierarchical}, and propose an extension with skipgrams. A frequentist approach to language modelling with skipgrams is described by \textcite{pickhardt2014generalized}, who introduce an approach using skip-$n$-grams which are interpolated using modified Kneser-Ney smoothing. In this thesis we show that a Bayesian skip-$n$-gram approach outperforms a frequentist skip-$n$-gram model.%, 

\section{Method}\marginnote{This section is a recap of \cref{chap:intoblm}, and in particular the Bayesian $n$-gram language model.}

Pitman-Yor Processes (PYP) belong to the family of non-parametric Bayesian models. Let $W$ be a fixed and finite vocabulary of $V$ words. For each word $w\in W$ let $G(w)$ be the probability of $w$, and $G = [G(w)]_{w\in W}$ be the vector of word probabilities. Since word frequencies generally follow a power-law distribution, we use a Pitman-Yor process, which is a distribution over partitions with power-law distributions. 
In the context of a language model this means that for a space $P(\mathbf{u})$, with $c(\mathbf{u}\cdot)$ elements (tokens), we want to partition $P(\mathbf{u})$ in $V$ subsets such that the partition is a good approximation of the underlying data, in which $c(\mathbf{u}w)$ is the size of subset $w$ of $P(\mathbf{u})$. We assume that the training data is a sample of the underlying data, and for this reason we seek to find an approximation, rather than using the partitions precisely as found in the training data.

Since we also assume a power-law distribution on the words in the underlying data, we place a PYP prior on G:
  \begin{equation*}
  	G \sim \operatorname{PY}(d,\theta,G_0),
  \end{equation*}
with discount parameter $0\leq d<1$, a strength parameter $\theta > -d$ and a mean vector $G_0 = [G_0(w)]_{w\in W}$. $G_0(w)$ is the a-priori probability of word $w$, which we set uniformly: $G_0(w) = 1/V$ for all $w\in W$. In general, there is no known analytic form for the density of $\operatorname{PY}(d,\theta,G_0)$ when the vocabulary is finite. However, we are interested in the distribution over word sequences induced by the PYP, which has a tractable form, and is sufficient for the purpose of language modelling. 

  Let $G$ and $G_0$ be distributions over $W$, and $x_1,x_2,\ldots$ be a sequence of words drawn i.i.d.\ from $G$. The PYP is then described in terms of a generative procedure that takes  $x_1,x_2,\ldots$ to produce a separate sequence of i.i.d.\ draws $y_1,y_2,\ldots$ from the mean distribution $G_0$ as follows. The first word $x_1$ is assigned the value of the first draw $y_1$ from $G_0$. Let $t$ be the current number of draws from $G_0$, $c_k$ the number of words assigned the value of draw $y_k$ and $c_\cdot = \sum^t_{k=1}c_k$ the number of draws from $G$.\footnote{In the analogy of the Chinese restaurant process, $t$ is the dish, $y_k$ a table, and $c_k$ the number of customers seating the table $y_k$.} For each subsequent word $x_{c_\cdot+1}$, we either assign it the value of a previous draw $y_k$, with probability $\frac{c_k-d}{\theta+c_\cdot}$, or assign it the value of a new draw from $G_0$ with probability $\frac{\theta+dt}{\theta+c_\cdot}$.

For an $n$-gram language model we use a hierarchical extension of the PYP. The hierarchical extension describes the probabilities over the current word given various contexts consisting of up to $n-1$ words. Given a context $\mathbf{u}$, let $G_\mathbf{u}(w)$ be the probability of the current word taking on value $w$. A PYP is used as the prior for $G_\mathbf{u}=[G_\mathbf{u}(w)]_{w\in W}$: 
  \begin{equation*}
  	G_\mathbf{u}\sim\operatorname{PY}(d_{|\mathbf{u}|}, \theta_{|\mathbf{u}|},G_{\pi(\mathbf{u})}),
  \end{equation*}
where $\pi(\mathbf{u})$ is the suffix of $\mathbf{u}$ consisting of all but the first word, and $|\mathbf{u}|$ being the length of $\mathbf{u}$. The priors are recursively placed with parameters $\theta_{|\pi(\mathbf{u})|}$, $d_{|\pi(\mathbf{u})|}$ and mean vector $G_{\pi(\pi(\mathbf{u}))}$, until we reach $G_\emptyset$: 		\begin{equation*}
		G_\emptyset \sim\operatorname{PY}(d_0,\theta_0,G_0),
    \end{equation*}\vspace{-0.05cm}
with $G_\emptyset$ being the uniformly distributed global mean vector that serves as prior for the empty context $\emptyset$.

\section{Backoff strategies}

Although numerous backoff strategies for $n$-grams have been proposed\autocite{chen1996empirical}, due to their nature, skipgrams allow for even a greater number of backoff strategies. In this section we limit ourselves to three backoff strategies that we will investigate: \BON, \BOL, and \BOF. First we will introduce a graphical representation, followed by a formal definition.

The \BON backoff method uses the backoff structure as prescribed by the HPYPLM. Even when trained on skipgrams, \BON only backs off to $n$-grams. \Cref{fig:bon} illustrates the \BON hierarchy.


\input{figbon}


\BOF is similar to \BON, in that it always performs full recursive backoff to the unigram probabilities. But instead of ignoring the skipgrams, \BOF uses both pattern types. In contrast to the straight hierarchical backoff structure of \BON, \BOF yields a graph-like structure, starting with the root, $\mathbf{u}w$, and all steps ending in $\emptyset$. The structure is depicted in \cref{fig:bof}.

\input{figbof}

Each step we only allow one skip to be introduced, where the first and last\footnote{The focus word $w$.} cannot be a skip. \textbf{HIER een STUKJE over RATIOS}. The intuition is that if a pattern occurs in the data, we do not have to backoff, since we can already estimate a probability.

The \BOL backoff strategy is an extension of the \BOF backoff strategy, in that it also uses both $n$-grams and skipgrams and follows the same backoff rules, but stops the recursion if a test pattern $\mathbf{u}w$ has already occurred in the training data. This means that the count is not zero, and hence at training time a probability has been assigned to that pattern.

Suppose we have a collected the frequencies of patterns from \obw:\footnote{With the HPYPLM one would in fact look at the frequencies after sampling, since they might not be same as those in the original corpus. Especially in those cases where the sampled frequencies are now zero, or vice versa.}
\begin{multicols}{2}\begin{enumerate}
	\item[0] bananas are my favourite fruit
	\item[1] are my favourite fruit
	\item[2] my favourite fruit 
	\item[7] favourite fruit
	\item[15391] fruit
	\item[1] are \{1\} favourite fruit
	\item[1] are my \{1\} fruit
	\item[90] are \{2\} fruit
	\item[13] my \{1\} fruit
	\item[0] bananas \{1\} my \{1\} fruit
	\item[0] bananas \{1\} my favourite fruit
	\item[0] bananas \{2\} favourite fruit
	\item[4] bananas \{3\} fruit
	\item[0] bananas are \{1\} favourite fruit
	\item[0] bananas are \{2\} fruit
	\item[0] bananas are my \{1\} fruit
\end{enumerate}\label{enum:minicorpus}\end{multicols}%
%
%
\noindent Then for all patterns in \cref{fig:bol} that occur in this corpus, we remove their backoff steps.


\input{figbol}

In a setup with only $n$-gram features, \BON and \BOF are functionally the same in terms of backoff strategy. The nuance lies in the way they do the backoff. 
%\BOF tries to backoff to skipgrams,\footnote[][5em]{It tries, does not find the skipgram, since it is not trained on skipgrams, and continues with the backoff procedure.} independent of whether it was trained with skipgrams. 
\BOF backs off to skipgrams, independently of whether they occurred in the training data.
The \BON strategy has only one way to get to the a-priori word probability. 
For \BOF it can get to the unigram word probabilities for each skipgram that is generated, hence it puts more emphasis on the word probabilities.\footnote{Compare the number of paths to e in \cref{fig:bon} and \cref{fig:bof}.}

Now for the formal definition of the strategies, for all strategies, we have that $p(w|\mathbf{u})=G_0(w)$ if $\mathbf{u} = \emptyset$. For \BON, the other case is defined as:
  \begin{equation}\begin{split}
  	p(w|\mathbf{u})= &
\frac{c_{\mathbf{u}w\cdot}-d_{|\mathbf{u}|}t_{\mathbf{u}w\cdot}}{\theta_{|\mathbf{u}|}+c_{\mathbf{u}\cdot\cdot}} +
\frac{\theta_{|\mathbf{u}|}+d_{|\mathbf{u}|}t_{\mathbf{u}\cdot\cdot}}{\theta_{|\mathbf{u}|}+c_{\mathbf{u}\cdot\cdot}}
p(w|\pi(\mathbf{u}))
  \end{split}\end{equation}
%with $c_{\mathbf{u}wk}$ being the count of $w$ having the value $k$ after context $\mathbf{u}$; 
with $c_{\mathbf{u}w\cdot}$ being the number of $\mathbf{u}w$ tokens, and $c_{\mathbf{u}\cdot\cdot}$ the number of patterns starting with context $\mathbf{u}$. Similarly, $t_{\mathbf{u}wk}$ is 1 if the $k$th draw from $G_{\mathbf{u}}$ was $w$, $0$ otherwise. $t_{\mathbf{u}w\cdot}$ then denotes if there is a pattern $\mathbf{u}w$, and $t_{\mathbf{u}\cdot\cdot}$ is the number of types following context $\mathbf{u}$.
  

Recall\footnote{See \cref{ch:languagemodels}.} that $\sigma_n$ is the operator that adds a skip to a pattern $\mathbf{u}$ on the $n$th position if there is not already a skip. Then $\boldsymbol\sigma(\mathbf{u}) = \left[\sigma_n(\mathbf{u})\right]_{n=2}^{|\mathbf{u}|}$ is the set of patterns with one skip more than the number of skips currently in $\mathbf{u}$. The number of generated patterns is $\boldsymbol\varsigma=|\boldsymbol\sigma(\mathbf{u})|$.

 \begin{equation}
 \begin{split}
p(w|\mathbf{u}) =& \sum_{m=1}^{\boldsymbol\varsigma}\left\{ \frac{1}{\mathbf{\boldsymbol\varsigma}+1}\left[
\frac{c_{\mathbf{u}_mw\cdot}-\delta_{\mathbf{u}_m}d_{|\mathbf{u}_m|}t_{\mathbf{u}_mw\cdot}}{\delta_{\mathbf{u}_m}\theta_{|\mathbf{u}_m|}+c_{\mathbf{u}_m\cdot\cdot}} + S_{\mathbf{u}_mw}\left(
\frac{\theta_{|\mathbf{u}_m|}+d_{|\mathbf{u}_m|}t_{\mathbf{u}_m\cdot\cdot}}{\delta_{\mathbf{u}_m}\theta_{|\mathbf{u}_m|}+c_{\mathbf{u}_m\cdot\cdot}}
p(w|\pi(\mathbf{u}_m))\right)\right] \right\} \\ 
+ 
%%
& \frac{1}{\mathbf{\boldsymbol\varsigma}+1}\left[
\frac{c_{\mathbf{u}w\cdot}-\delta_{\mathbf{u}}d_{|\mathbf{u}|}t_{\mathbf{u}w\cdot}}{\delta_{\mathbf{u}}\theta_{|\mathbf{u}|}+c_{\mathbf{u}\cdot\cdot}}+ S_{\mathbf{u}w}\left(\frac{\theta_{|\mathbf{u}|}+d_{|\mathbf{u}|}t_{\mathbf{u}\cdot\cdot}}{\delta_{\mathbf{u}}\theta_{|\mathbf{u}|}+c_{\mathbf{u}\cdot\cdot}}
p(w|\pi(\mathbf{u}))\right)\right]
  \end{split}\label{eq:interpolform}\end{equation}

We also introduce the indicator function $S$, which for the {\sf full} backoff strategy always returns its argument: $S_{\mathbf{u}w}(y) = y$.
%
%
The {\sf full} backoff strategy is defined as follows, with $\mathbf{u}_x = \boldsymbol\sigma_x(\mathbf{u})$, and discount frequency $\delta_{\mathbf{u}} = 1$.
 
 The \BOL backoff strategy is an extension of the \BOF backoff strategy that stops the recursion if a test pattern $\mathbf{u}w$ has already occurred in the training data. This means that the count is not zero, and hence at training time a probability has been assigned to that pattern. $S$ is the indicator function which tells if a pattern has been seen during training: $S_{\mathbf{u}w}(\cdot) = 0$ if $\mathrm{count}(\mathbf{u}w) > 0$, $1$ otherwise; and $\delta_{\mathbf{u}} = V-\sum_{w\in W} S_{\mathbf{u}w}(\cdot)$. Setting $S_{\mathbf{u}w}(\cdot) = 0$ stops the recursion.\footnote{Although it feels unncessary to add an expensive term $\delta_{\mathbf{u}}$ to the formula, in \cref{app:der} we show that we have to add an extra term to keep $p(w|\mathbf{u})$ a proper distribution.}

The backoff strategies \BON and \BOF always use the same number of backoff probabilities, for $5$-grams 5 and 53 respectively\footnote{For $4$-grams, 5 and 22 respectively.}. For \BOL this is dependent on the training material. If a top level pattern matches the training material, then only one probability is used. In the worst scenario, \BOL equals \BOF.

\section{Sampling the parameters}

Hoe komen we aan d, theta en de t's tijdens training?

\section{Experimental Setup}

We train $4$-gram language models with the HPYPLM as described in this chapter, on three corpora as described in \cref{chap:data}. 
%In this chapter we do not use sentences markers to denote the begin or end of a sentence. 
%Since we do not compute the perplexity on sentence-level, but rather on individual patterns that comprise the test corpus.

At the core of our experimental framework we use \texttt{cpyp}\footnote{\url{https://github.com/redpony/cpyp}}, which is an existing library for non-parametric Bayesian modelling with Pitman-Yor priors with histogram-based sampling\autocite{blunsom2009note}. This library has an example application to showcase its performance with $n$-gram-based language modelling. Limitations of the library, such as not natively supporting skip- and flexgrams, and other functionality, such as thresholding and discarding of certain patterns, led us to extend the library with \texttt{Colibri Core}\footnote{\url{https://proycon.github.io/colibri-core/}}, a pattern modelling library. \texttt{Colibri Core} can handle these limitations\autocite{gompel2016efficient}, and together the libraries are a complete Bayesian language model that handles skipgrams.\footnote{The code and documentation is available at \url{https://github.com/naiaden/SLM}.}.

Each model is run for 50 iterations\footnote{Without an explicit burn-in phase, as we are not using statistics or samples from any but the last iteration.}, with hyperparameters $\theta = 1.0$ for discount, and $\gamma = 0.8$ for strength. The hyperparameters are resampled every 30 iterations\footnote{Which in practice results in one resampling of the hyperparameters. The value is chosen empirically.} with slice sampling\autocite{walker2007sampling}.

We test each model on different test sets, and we collect their intrinsic performance by means of perplexity. For the words in the test set that were unseen in the training data, we discard their probability.

\section{Results}
In this section we describe the results of multiple experiments. First we compare the HPYPLM approach to a traditional $n$-gram language model. We then show the influence of using skipgrams in addition to $n$-grams, and investigate whether this influence holds in a cross-domain evaluation. Second, we compare the backoff strategies, and investigate which domains fit which backoff strategy best.
\clearpage
Pickhardt and colleagues\autocite{pickhardt2014generalized} report perplexity values on two subsets of the test material of jrc-p and wp-p\footnote{Both consist of 100 thousand $4$-gram patterns sampled from their respective test set, as described in \cref{chap:data}.}. Of particular interest are the values for $n =4$, where for training on \wp and testing on wp-p a perplexity of 404 is reported with MKN smoothing, and 378 with their generalised language model (GLM). Similarly for the \jrc and jrc-p, they show a reduction of perplexity from 83 with MKN to 81 with their GLM. In our attempts to reproduce these values with SRILM\autocite{stolcke2002srilm}, we found substantially lower perplexity values for MKN,\footnote{Lower perplexity values indicate a higher intrinsic performance, which is better.} as can be seen in \cref{tab:mknvsglm}.\footnote{We used the following program call: \texttt{ngram-count -kndiscount -gt3min 1 -gt4min 1 -order 4 -interpolate -kndiscount -unk}.}
 The rows represent the train corpora; columns represent the test corpora. Comparing our results with the perplexity values reported for the GLM with modified Kneser-Ney thus yields an unfair comparison. For this reason we choose the baseline to be the perplexity values from SRILM with modified Kneser-Ney.

\begin{table*}
	\begin{tabular}{l*{6}{l}l*{6}{l}}
		&  \multicolumn{6}{c}{HPYLM} & &  \multicolumn{6}{c}{modified Kneser-Ney} \\
		& \jrc	& \obw	& \emea	&	\wp	& jrc-p	& wp-p & & \jrc	& \obw &	\emea & \wp&  jrc-p & wp-p\\ \cline{2-7}\cline{9-14}
		\jrc	& 13 & 1510 & 1081 & 1293 & 13 & 1269 & & 22 & 1664 & 1310 & 1837 & 122 & 1736 \\
		\obw & 1232 & 171 & 1749 & 724 & 1156 & 692 & & 1460 & 211 & 1516 & 996 & 1383 & 1046 \\
		\emea & 769 & 1552 & 4 & 1097 & 774 & 1093 & & 1115 & 1745 & 10 &1669 & 993 & 1444 \\
		wps & 555 & 455 & 1005 & 217 & 537 & 212 & & 842 & 598 & 1590 & 449 & 1088 & 646
	\end{tabular}
	\caption[][1.5em]{A side-by-side comparison of perplexities of two $n$-gram language models with $n=4$: the Bayesian hierarchical Pitman-Yor language model (HPYLM) and the frequentist modified Kneser-Ney, as implemented by SRILM. The rows represent the training corpora, and the columns the test sets. }\label{tab:mknvsglm}
		%The two additional data sets jrc-p and wp-p are around 100k randomly selected 4-grams from jrc and wp respectively, as selected by \newcite{pickhardt2014generalized}. 
\end{table*}

Teh\sidecite<12em>[]{teh2006hierarchical} shows that the HPYPLM can outperform MKN. In our experiments we reach the same conclusion, on the basis of the results listed in \cref{tab:mknvsglm}. On the right we show the results with SRILM's implementation of MKN, now without sentence boundaries, equal to the settings of our HPYPLM. The perplexities obtained by the HPYPLM are consistently lower by a substantial margin.

Bayesian language models are more complex than their frequentist counterparts such as MKN. Not only the model itself is more complex, but also the computational complexity, both in terms of time and memory. On an Intel Xeon E5-2660 with 512G working memory it takes 3 days to compute the \obw model, and serialize the model to file. However, if we only use one iteration for sampling,\footnote{With zero sampling iterations, we have effectively reduced the HPYPLM to IKN.} we already achieve perplexities approaching the perplexities reported in \cref{tab:mknvsglm} with 50 iterations, in all setups. This one-iteration learning process does not alleviate the burden of memory complexity\footnote{The number of parameters only changes when a restaurant has no visitors, or when one is being founded. The number of customers stays constant over all sampling iterations.}, but training is now only a factor two slower than traditional MKN, and the performance is already notably better.
Yet, in the remainder of this chapter we use 50 iterations for Gibbs sampling. For testing, once the model has been loaded into memory, there is no time penalty involved when choosing HPYPLM over MKN, which is an important property for extrinsic and time-dependent evaluation.

\subsection{Influence of adding skipgrams to an $n$-gram language model}
The results for English language models with skipgrams are reported in \cref{ta:ngramvsskipgram} in terms of perplexity. We chose to report the values with the best-performing backoff strategy The rows represent the train corpora, and columns the test corpora. On the diagonals the perplexity always has the lowest value for that column, since the best performance on a test set is achieved by training on the same domain, for all domains. 

We also observe an effect of domain-specific versus generic corpora. The within-domain performance of domain-specific corpora yields very low perplexities (13 on \jrc, 4 on \emea), while these are higher for within-domain evaluation for generic corpora (158 for \obw, 216 for \wp). There is no significant difference between training with only $n$-grams and training with skipgrams on within-domain test sets.

     \begin{table*}
	\begin{tabular}{lllllllllllllll}
		& \multicolumn{4}{c}{$n$-grams}	& & \multicolumn{4}{c}{skipgrams}&& \multicolumn{4}{c}{Relative difference (in \%)}\\
		& \jrc	& \obw	& \emea	& \wp	& & \jrc	& \obw	& \emea	& \wp	& & \jrc	& \obw	& \emea	& \wp \\ \cline{2-5}\cline{7-10} \cline{12-15}
		\jrc		& 13	& 1195	& 961	& 1011	& & 13	& 1162	& 939	& 1008	& & 0	& -3	& -2	& 0		\\
		\obw	& 1232	& 171	& 1749	& 724	& & 728	& 141	& 1069	& 542	& & -41	& -6	& -39	& -25	\\
		\emea	& 600	& 1143	& 4		& 843	& & 581	& 1155	& 4		& 842	& & -3	& 1	& 1		& 0		\\
		wps		& 555	& 455	& 1005	& 217	& & 565	& 470	& 990	& 227	& & 2	& 3	& -1	& 4
	\end{tabular} 
	\caption[][1.5em]{An overview of the results to compare the influence and contribution of skipgrams for English. 
		The perplexities can only be compared row-wise, as the vocabulary depends on the training set. For these values we chose the best-performing backoff strategy. The relative difference is the percentual change, with a negative value indicating an improvement with skipgrams.}\label{ta:ngramvsskipgram}
\end{table*} 

However, we find that adding skipgrams does tend to contribute to a lower perplexity in a cross-domain setting, with many non-diagonal cells showing lower perplexities. We observe absolute perplexity reductions up to 33, and relative reductions up to 3.2\%. We are mostly interested in the effects of training on a generic corpus, and testing on a domain-specific corpus: it is generally easier to obtain generic texts, and even if there is a corpus for that specific domain, there will always be more generic data. If it turns out that the generic corpora can be used to model domain-specific corpora without much deterioration, this paves the way to fully use the billion-word data sets that are currently available for many languages.\footnote{Corpora such as News on the Web (NOW), Wikipedia, Corpora from the Web (COW), inter alia.} If we focus on the rows for \obw and \wp, we see that for the cross-domain setups with \jrc and \emea, the performance improves when we add skipgrams to the input features, except when we test \jrc trained on \wp.

In particular we focus on \obw for the remainder of this thesis, since compared to \wp, which contains articles in Wikipedia's encyclopaedic style, it contains a wider variety of text, and is less restricted in language use. Supported by the perplexity results, we consider \wp to be less generic.\footnote{This is independent of the number of words.} This is reflected in the perplexity scores as well, where a within-domain experiment on \wp (227 ppl) appears to be more difficult than on \obw (141 ppl). In the \obw-\wp and \wp-\obw experiments the perplexities are more in range, 542 and 470 respectively.

\subsection{Effect of different backoff methods}
In the previous subsection we compared pure $n$-gram models with skipgram models, where we used the best performing backoff methods. In this subsection we study the effect of different backoff methods in more detail. 

Although skipgram models also contain all $n$-grams as an $n$-gram model trained on the same data, the $n$-gram probabilities differ between the two models. If we train with skipgram features, but only use the $n$-gram during testing (\BON), one would expect the same perplexity. However, a side-effect of training with skipgrams is that the lower-order patterns contained in the skipgrams are seen more often, compared to training with only $n$-grams.\footnote{In a setting without thresholding, the relative frequencies of words stay the same, with only $n$-grams, and with skipgrams. The absolute frequencies are different because an $n$-gram spawns multiple skipgrams. However, in a situation with thresholding, only the frequently occurring skipgrams will make the threshold. In particular this is the case for $n$-grams where words are OOV due to thresholding, and the skipgram count is the same as the $n$-gram count.}

In our analyses we included this strategy to see the negative effects of using skipgrams, because even in scenarios where they do not contribute, \BON may still yield a perplexity similar to HPYPLM trained on only $n$-grams. If that is the case, we can opt to use skipgrams whenever possible, without risking a deterioration in performance.

The \BOF backoff strategy is effective especially in cross-domain situations, where backing off to the word probabilities is beneficial. We see this especially in \jrc and \emea; for \wp this effect is not present. If we compare the OOV rate for \wp on cross-domain sets, it is consistently higher than the OOV rate for \obw on other sets. This explains why a backoff procedure which emphasises the word level more does not help in such cases.

For a within-domain experiment on \obw, we notice that throughout the learning curve until 30M words, \BOF is slightly outperforming \BOL. But with more training data, \BOL has seen enough $n$-grams to overtake \BOF.

\subsection{Observations from \obw learning curves}

The learning curves for \obw are shown in \cref{fig:1bwlc}. The pure $n$-gram models are outperformed by the skipgram models, for all four test sets. We are mostly interested in comparing \BON with the best performing skipgram backoff method.

The first observation is that the skipgrams are especially useful in the beginning, when the perplexity is very low. However, these values are discarded,\footnote{Per definition of the way perplexity normally is computed.} since at this point there are still too many OOV words and therefore it models only few words.\footnote{Note that this experiment is a test on only words in the vocabulary. This is in line with all other experimental results in this thesis.} Only after the learning curve drops, we consider the values to be of use. Here the perplexity of the skipgram models is consistently lower than those of \BON. Since the \BON models have a higher peak in the learning curve, it is also easier to gain a reduction in perplexity with an increasing number of training words. With the amount of training data we have the models do not converge for cross-domain testing. For none of the eight models plotted in \cref{fig:1bwlc} we see any plateauing behaviour, indicating that all eight models gain from having (yet even) more train data.

In the same plot we can also see that there is a clear distinction between the test sets. The behaviour of testing on \obw is different compared to testing on \jrc, \emea, or \wp. They share the same progression, with \wp's course shifted downwards\footnote{Since it shares the same property of being more generic than \jrc and \emea.}, unlike the domain-specific sets. \BON and \BOF of these models also seem to have a gap, whereas \BON and \BOL of \obw are converging at the end. This seems to confirm that skipgrams are especially useful in cross-domain settings, where with order of magnitude more data, the lines do not seem to converge.\footnote{Although we do not have in fact more data, we extrapolated this bevahiour on the basis of the plots.}

For within-domain \cref{afiga} we have reached a point where we have seen enough $n$-grams to estimate the probabilities, and the patterns modelled by skipgrams are also covered sufficiently by $n$-grams. We verified this by checking to which level the models backoff, and find that with an increasing amount of training material, the average backoff depth is smaller, and that for out-of-domain experiments the average backoff depth is larger.

\begin{figure*}
%\captionsetup[subfigure]{labelformat=empty}
\centering

\subfloat[Learning curve for training on \obw, and testing on \obw and \wp. \BOL was chosen because it is the best-performing backoff strategy for \obw.]{%
\label{afiga}
\begin{tikzpicture}
	\begin{axis}[
	name={afiga},
	height=6cm, width=8cm, legend cell align=left, legend style={legend pos=north west, fill=none, font=\small},
	title=a) Testing on general texts,
	xmode = log,
	xmin=800,xmax=2048576000,
	%ymin=-100,ymax=40,
	xlabel=Train corpus size (\obw; in words),
	ylabel=Perplexity]
	\addplot[brown,thick] table [x=1bw, y=ngram]{data/1bw-1bw.dat};
	\addlegendentry{\obw \BON}
	\addplot[brown,thick,dotted] table [x=1bw, y=limited]{data/1bw-1bw.dat};
	\addlegendentry{\obw \BOL}
	%\addplot[brown,thick,dashed] table [x=1bw, y=full]{1bw-1bw.dat};
	%\addlegendentry{1bw full}
	\addplot[black,thick] table [x=wp, y=ngram]{data/1bw-wp.dat};
	\addlegendentry{\wp \BON}
	\addplot[black,thick,dashed] table [x=wp, y=full,mark=*]{data/1bw-wp.dat};
	\addlegendentry{\wp \BOF}
	\end{axis}
\end{tikzpicture}%
}
\subfloat[Learning curve for training on \jrc and \emea. ]{%
\label{afigb}
\begin{tikzpicture}
	\begin{axis}[
	name={afigb},
	height=6cm, width=8cm, legend cell align=left, legend style={legend pos=north west, fill=none, font=\small},
	title=Testing on domain-specific texts,
	xmode = log,
	xmin=800,xmax=2048576000,
	%ymin=-100,ymax=40,
	xlabel=Train corpus size (\obw; in words),
	xshift=8cm]
	\addplot[blue,thick] table [x=jrc, y=ngram]{data/1bw-jrc.dat};
	\addlegendentry{\jrc \BON}
	\addplot[blue,thick,dashed] table [x=jrc, y=full]{data/1bw-jrc.dat};
	\addlegendentry{\jrc \BOF}
	\addplot[red,thick] table [x=emea, y=ngram]{data/1bw-emea.dat};
	\addlegendentry{\emea \BON}
	\addplot[red,thick, dashed] table [x=emea, y=full,mark=*]{data/1bw-emea.dat};
	\addlegendentry{\emea \BOF}
	\end{axis}
\end{tikzpicture}%
}
	\caption{The solid lines indicate the perplexity of the \BON model, the dotted line \BOL for testing on \obw, and the dashed lines \BOF for \jrc, \emea, and \wp.}\label{fig:1bwlc}
\end{figure*}


 \section{Behaviour of the Language Models}
Aside from a quantitative analysis in terms of perplexity, we are also interested in the qualitative differences between the compared approaches. In this section we analyse the skipgrams that contribute the most, and which are most prominent. We finish the section with a look at the models from a Zipfian point of view, and compare the frequentist and Bayesian rank-ordered frequencies.

\subsection{When do skipgrams work?}
Earlier we stated that skipgrams can be part of the solution for modelling long-range dependencies and shorter-range non-contiguous patterns. In the experiments described in this thesis we have limited ourselves to modelling each skip with one token, where skips must be surrounded by word tokens. This limits the skipgrams of length $4$ as used in this thesis to at most two position skips.

To understand which patterns can be modelled with these skipgrams, and in particular, which of these patterns contribute positively to a better score, we analysed all $4$-grams in the test data which have a higher posterior probability with skipgram features added to the training model. 
%
To this end, we analysed the results of one run in which we trained on \obw and tested on the 2.2M $4$-grams in \emea. Of those $4$-grams 473 thousand showed a lower perplexity when adding skipgram features, as compared to the standard $n$-gram model. 

In this section we approach the question from three different angles. First we consider the skipgrams patterns boost probability of test patterns. Second, we look at the most-frequent skipgram patterns that cause a lower perplexity for the skipgram model. Third, we are interested in the patterns with high-frequent words that skip over low-frequent words, because they are not only frequent themselves, but also invariant to domains. 


\begin{table*}
	\small
	\begin{tabular}{lllllll}
		most helpful  & & most-frequently helpful & & skip on second & & skip on third \\ \cline{1-1}\cline{3-3}\cline{5-5}\cline{7-7}
		if the \{1\} blood & & , \{2\} , & & the \{1\} of the &  & in the \{1\} of\\
		: in \{1\} with & & ( \{2\} ) & & the \{1\} of a   & & to the \{1\} of\\
		treatment \{1\} cervical cancer   & & see \{2\} ) & & , \{1\} , and    & & on the \{1\} of\\
		product \{1\} due to & & see section \{1\} ) & & The \{1\} of the & & of the \{1\} of\\
		result \{1\} due to & & of \{2\} and & & and \{1\} of the & & of the \{1\} . \\
	\end{tabular}
	\caption[][1.5em]{First column contains the 5 most-helpful patterns; the second column the 5 most-frequently helpful patterns. The third and fourth column contain the top-5 most-frequent skipgrams with their frequency, with the skip on the second (left) and third (right) position, when the skipped word is 20 or more times less frequent than the least frequent word in the skipgram.}\label{ta:examples}
\end{table*}


%(': {1} accordance with', 0.15049816926667361),
% (': in {1} with', 0.15049816926667361),
% ('treatment {1} cervical cancer', 0.10562404017518799),
% ('product {1} due to', 0.09975897235968394),
% ('result {1} due to', 0.07797193841653396),


Patterns with the largest difference in probability are typically domain-specific patterns of which the words occur in the training data, but where there is no direct $n$-gram match. See \cref{ta:examples} for 5 examples. 
%The pattern \emph{such as myocardial infarction} is a very domain-specific pattern, and as such does not occur in 1bw. However, since parts of it occurs in multiple forms (\emph{bowel infarction}, \emph{celebral infarction}, etc.), having a skipgram \emph{such as \{1\} infarction} is beneficial.
%The pattern \emph{intestines ( ulcerative colitis} is a very domain-specific pattern, and as such does not occur in 1bw. However the trigram \emph{( ulcerative colitis} occurs once, and its bi-, and unigram \emph{ulcerative colitis} and \emph{colitis} occur 189 and 337 times. We showed earlier that adding skipgrams add causes the subpatterns in the skipgrams to be added: e.g. each bigram is added twice. Additionally \emph{full} places an emphasis in the backoff procedure on the unigrams, which helps improving the probability of this pattern a lot.
The pattern \emph{if the systolic blood} is a very domain-specific pattern in \emea, and is not covered by \obw. However, the pattern \emph{if the \{1\} blood} occurs in multiple forms (e.g.\ \emph{bad blood}, \emph{maternal blood}, \emph{' blood}). Although there is match for \emph{the systolic blood} in \obw, the skipgrams help to boost the probability with over $0.15$. 
%An example is \emph{: in accordance with}, whose high-frequency is also domain-specific for biomedical texts (326 times in emea). The improvement of the probability is not caused by emphasising the unigrams, but because of the skipgram patterns added. As a whole it occurs once in 1bw. The skipgram \emph{: \{2\} with} occurs 3969 times, \emph{: \{1\} accordance with} and \emph{: in \{1\} with} 6 and 22 times respectively. Leaving out the first marker gives an even better clue: 1bw contains 143909 occurrences of \emph{in \{1\} with} .

On the other hand, the patterns with the largest difference in probability only occur very sparsely. To explain the gain in perplexity when adding skipgrams, we have to look at patterns with a lower probability that occur often. Here we see that these patterns are less domain-specific.
%
%The first example \emph{e . g .} seems odd, but in the training data only \emph{e , g .} occurs, hence the skipgram is able to give a better estimation. Since this example occurs 1346 times, this has the potential to make a huge difference in the final perplexity.

The texts in \emea contain many subordinate clauses and parenthesised phrases, for clarification. Since \obw is a different domain, we do not find these clauses and phrases verbatim in the train data. However, we do find patterns such as \emph{, \{2\} ,} and \emph{( \{2\} )} a lot. Knowing that a lot of clauses and phrases are of length 2, these skipgrams contribute a little each time an opening parenthesis occurs, when the focus word is a closing parenthesis. The improvement is subtle, but accumulated over all test patterns, this amounts for a noticeable change in the perplexity.

%The patterns above are mostly composed out of content words. For the last analysis we look at patterns composed of function words. 
From the 473 thousand patterns that obtained a lower perplexity we first discarded specific patterns through imposing an occurrence ratio threshold of 20 on the words in the skips. This means that if the skip is $\{1\}$, that word must occur at most 20 times less often than the least frequent of the three other words, and in case of $\{2\}$ the two words must occur at most 20 times less than the least frequent of the other two words. This ratio models the suspicion that skipgrams are particularly useful in situations where the skip models a low-frequent content word. 

We hardly observe any other patterns than these function-word patterns. 
This may be caused by the fact that \emea is very domain-specific, and that matches on content words are unlikely to occur often. It may be precisely because of this feature of \emea that the skipgrams, extracted from \obw, are providing the kind of effective modelling of non-contiguous patterns to be able to skip unknown words and continue matching with (function) words further to the left.

In the analysis we listed all 473 thousand $n$-grams, and artificially added skips in the three possible manners, e.g.~\emph{in the previous section} has as three possible skipgrams \emph{in \{1\} previous section} (1), \emph{in the \{1\} section} (2), and \emph{in \{2\} section} (1,2). We then put occurrence ratio thresholds (20, 10, 5) on the words in the skips. For ratio 20 this means that if the skip is \{1\} that word must occur at most 20 times less than the least frequent of the three other words, and in case of \{2\} the two words must occur at most 20 times less than the least frequent of the other two words. This ratio models the suspicion that skipgrams are particularly useful in situations where the skip models a low-frequent word. 

\begin{table*}
	\begin{center}
		\begin{tabular}{l*{3}{l}l*{3}{l}l*{3}{l}}
			& \multicolumn{3}{c}{5} & & \multicolumn{3}{c}{10} & & \multicolumn{3}{c}{20} \\
			& \obw & tokens & types & & \obw & tokens & types & & \obw & tokens & types \\
			\cline{2-4}\cline{6-8}\cline{10-12}
			1 & 7.5M & 3.6k & 1.9k & & 5.8M & 2.9k & 1.4k & & 4.4M & 2.2k & 1.0k \\
			2 & 7.6M & 5.2k & 3.6k & & 5.9M & 4.2k & 2.1k & & 4.4M & 3.3k & 1.5k \\
			1,2 & 5.2M & 2.9k & 526 & & 4.2M & 2.3k & 362 & & 3.3M & 1.8k & 258 \\
		\end{tabular}
		\caption[][0.5em]{The number of patterns with skipped-word-ratio respectively in \obw, the number of skipgram token matches with the \emea test set, and the skipgram types that match with the frequency ratio filtered skipgrams. The rows denote the position of the skip (starting with index $0$), and the columns indicate the ratio in which the words in the skip must occur with respect to their surrounding words. This table shows that there are only few patterns that benefit from adding skipgrams.}\label{ta:skipratio}
	\end{center}
\end{table*}

In the last phase we count all the skipgram types generated from the four-grams that yielded a lower perplexity when adding skipgram features. \Cref{ta:skipratio} lists the occurrence counts of skipgram tokens and types for the \obw-\emea experiment where we varied the frequency ratio in which the skipped words must occur with respect to their surrounding words. It becomes apparent that only a limited number of patterns benefit from adding skipgrams as features.

As an illustration we list in the last two columns of \cref{ta:examples} the five most frequent skipgrams types with the skip on position 2 and with occurrence ratio 20, and with a skip on position 3.

%	If we look at which patterns are predicted better, and if we drop the concept of frequency ratio in skipgrams, then different patterns do emerge. For example, in the training data there is only one mention of `Merck Sharp \& Dohme'. Being a pharmaceutical company, there are many mentions in the test data (emea), but in more variations, such as: `Merck Sharp \& Dohme', `Merck Sharp and Dohme', `Merck Sharp E Dohme'. The pattern that attains a lower perplexity skips the third word. This is a case where skipgrams help with patterns that are not composed of function words.

Another type of effective pattern is one that skips over a numerical value. Medical prescriptions often contain mentions of quantities or concentrations, for example expressed by ``with consideration for the haemoglobin target range of 10 g / dL ( 6.2 mmol / l )''. Skipgrams are particularly useful in skipping over the number positions, as the patterns surrounding them are fixed and frequent. This step is often implicit in language modelling, when all numbers are mapped to a special token, i.e.\ \{\#\#\#\}. With skipgrams, the numerical value can be preserved.


\begin{figure*}
	\centering
	\begin{tikzpicture}
	\begin{loglogaxis} [
	width=5cm,
	height=3cm,
	scale only axis,
	enlargelimits=false,
	axis on top,
	xlabel=Rank,
	ylabel=Frequency,
	title=$n$-grams,
	ylabel shift=-0.1cm, legend cell align=left, legend style={only marks, legend pos=north east
		, font=\small}
	]
	\addplot+ graphics [ 
	xmin=1,
	xmax=4746593,
	ymin=1,
	ymax=369489,
	] {figures/n1.png};
	
	\end{loglogaxis}
	\begin{loglogaxis} [
	width=5cm,
	height=3cm,
	scale only axis,     
	enlargelimits=false,   
	axis on top,        
	xlabel=Rank,
	title=$n$-grams and skipgrams,
	xshift=7cm, legend cell align=left, legend style={at={(8cm,3cm)}
		, only marks
		, font=\small}
	]
	\addplot+ graphics [
	xmin=1,
	xmax=7370640,
	ymin=1,
	ymax=4014991,
	] {figures/s1.png};
	\addlegendimage{only marks, black, mark=*}
	\addlegendentry{\obw train}
	\addlegendentry{\obw HPYPLM}
	\addlegendentry{\jrc test}
	\addlegendentry{\obw test}
	\addlegendimage{only marks, red, mark=*} % or mark=none?
	\addlegendimage{only marks, brown, mark=*}
	\addlegendimage{only marks, blue, mark=*}
	
	\end{loglogaxis}
	\end{tikzpicture}
	\caption{Log-log frequency-rank curves 
		and scatter plots for three sets, the complete \obw (train and test), and the deviating estimated frequencies of the HPYPLM (black). The ranks are determined on the heldout data.
		}\label{fig:freqs}
\end{figure*}

\subsection{An analysis of the $n$-gram distribution}
Because the HPYPLM is a Bayesian generalisation of the IKN language model, and in some circumstances also to MKN, we are interested in the differences between the two methods. 
%     %In this subsection we analyze the word distributions, for the complete corpus, the trainings data, and the posterior trigram distribution for the HPYLM.

We know that words and word $n$-grams in a language manifest themselves typically in power-law distributions. Few patterns occur very often, whereas the majority of patterns are only observed very sparsely. For this reason we chose the HPYPLM, as the Pitman-Yor prior can model this power-law behaviour. The power-law growth of patterns in language has been popularized by the early work of George Zipf \autocite{Zipf35,Zipf49}. Recently, \autocite{piantadosi2014Zipfs} has asserted that most Zipfian analyses have been done under false premises, namely that the frequency of the words is determined on the same data set as which is used to determine the rank of the words. In this subsection we analyse the rank of the patterns on the heldout set of \obw, and determine the frequency on the training sets. The trigrams in the remainder of this section are context trigrams, the part of fourgrams before the focus word.

In \cref{fig:freqs} we plot the frequency for each trigram in the heldout set of \obw. The brown dots
follow a typical Zipfian curve since the rank is determined on the same set. The blue dots are the frequencies of the trigrams in the complete \obw, ordered by the rank as found on the heldout data. Invisible in the scatter plot, there are patterns that occur in the training data but not in the heldout data (since the heldout data was unseen data).
The blue dots also represent the frequencies as used by MKN, which does not manipulate the data by changing the word counts. The black dots are the frequencies as estimated by the HPYPLM. 

In general we observe that most deviating frequency estimations are in the tail, where we can find many patterns that do not occur in the heldout data, but have a positive frequency in the training data. Trigram tokens are mostly re-distributed by HPYPLM among those patterns.\footnote{A sum of all frequencies confirms that tokens are not lost in the process.}

If we train with only $n$-grams, HPYPLM follows the counted frequencies, but as the rank increases, the effect of sampling and re-estimating frequencies by the HPYPLM is quite apparent (the black points not shadowed by the blue cloud). For the high frequency data the frequencies are estimated lower, whereas the remainder of the counts is used to boost the patterns that have a lower rank. If we add skipgram features, than HPYPLM's cloud is more spread-out, and its power-law behaviour is less explicit. Here also we see that the counts of higher-ranked patterns are diminished to boost lower-ranked patterns.

Another observation we make is that if we train on \obw, yielding the distributions as shown in \cref{fig:freqs}, and we test on a specific domain such as \jrc (the red dots in the figure), we see that if we only train on $n$-grams, the dots are quite dispersed into a cloud, and that this distribution over frequencies differs substantially from the distributions from \obw. For the skipgram-trained model we see that the red dots are less dispersed, somewhat more closely following a Zipfian distribution, with a decay similar to the training points for the HPYPLM estimated frequencies of \obw.

\subsection{Variation in HYPLMs}
In this subsection we briefly show an analysis of the variability of the perplexity of the HPYLM approach, both in terms of iterations used for sampling, and in terms of stability over multiple runs. These analyses support the idea that the reductions in perplexity mentioned earlier in this paper are not due to some random effect, by choosing a favourable `best' run, or halting at an specific or large number of iterations.


%     One of the major concerns of Bayesian models are their intractability. To avoid these complications, clever inference schemes have been proposed. In our methods we use Gibbs sampling for the seating arrangement, and slice sampling for the hyperparameters. Still, each new run leads to a different outcome. 

\begin{figure*}
	\begin{tikzpicture}
	\begin{axis} [
	%width=0.25\textwidth,
	height=4cm,
	scale only axis,        % Plot size does not include axes.
	enlargelimits=false,    % Shrink wrap the PNG.
	axis on top,            % Axes placed over PNG to avoid obscuring the lines.
	xlabel=Iterations,
	ylabel=Perplexity,
	]
	\addplot graphics [
	xmin=1,
	xmax=150,
	ymin=1181.23,
	ymax=1192.63,
	] {figures/jrc1bwglm.pdf};
	\end{axis}
	%     \begin{axis} [
	%     	width=0.25\textwidth,
	%         scale only axis,        % Plot size does not include axes.
	%         enlargelimits=false,    % Shrink wrap the PNG.
	%         axis on top,            % Axes placed over PNG to avoid obscuring the lines.
	%         xlabel=Iterations,
	%         xshift=0.33\textwidth
	%     ]
	%         \addplot graphics [
	%             xmin=1,
	%             xmax=150,
	%             ymin=61.6039,
	%             ymax=66.3028,
	%         ] {jrcjrcglm.pdf};
	%     \end{axis}
	\begin{axis} [
	%width=0.25\textwidth,
	height=4cm,
	scale only axis,        % Plot size does not include axes.
	enlargelimits=false,    % Shrink wrap the PNG.
	axis on top,            % Axes placed over PNG to avoid obscuring the lines.
	xlabel=Iterations,
	xshift=8cm
	]
	\addplot graphics [
	xmin=1,
	xmax=150,
	ymin=13.2062,
	ymax=13.3504,
	] {figures/jrcjrcngram.pdf};
	\end{axis}
\end{tikzpicture}
\caption{Variation over 10 runs. On the left is \jrc-\obw-\BOF, on the right is \jrc-\jrc-\BON. The red lines show the perplexity values for one of the 10 runs, and the blue lines show the min and max values per iteration over the 10 runs.}\label{fig:iterplots}
\end{figure*}


We evaluate on the test sets \jrc and \obw, trained on $n$-grams and skipgrams, with backoff strategies \BON and \BOF, in ten individual runs with up to 150 iterations each, with the same training data. We measure perplexity on the respective test sets at every iteration. We find that there are only minor differences in perplexities between runs (maximal and minimal values denoted by the blue lines) with respect to the average over the 10 runs (the red line). This is the case in high-perplexity scenarios such as \jrc-\obw-\BOF, with perplexities around 1184, and low-perplexity scenarios such as \jrc-\jrc-\BOF with perplexities around 13.2. The perplexities also converge quickly, within 50 iterations.\footnote{Although we do not take burn-in and autocorrelation between iterations into consideration.}

\section{Discussion}
In this paper we showed that by adding skipgrams, a simple generalisation of $n$-gram word patterns, we can reduce the perplexity of a Bayesian language model in a cross-domain language modelling task. The model, a Hierarchical Pitman-Yor Process Language Model (HPYPLM) was observed to attain lower perplexities than the standard modified Kneser-Ney smoothing $n$-gram language model in all training-test conditions, confirming results reported in earlier work \autocite{teh2006hierarchical}.

Although the extension of HPYPLM to skipgrams is relatively easy to implement, it has a considerable impact on the complexity. Since the number of patterns almost doubles, so does the run-time of the learning phase. Compared to the modified Kneser-Ney implementation of SRILM, training the HPYPLM is markedly slower. However, within a single iteration of sampling the approach already improves over modified Kneser-Ney. If one would halt here, runtime complexity is reduced to a factor two compared to modified Kneser-Ney with SRILM. The improvements of our HPYPLM variant over the SRILM (and most other) modified Kneser-Ney implementations is that it can handle skipgrams in their full form. The testing phase of either methods is comparable in terms of computational complexity, which puts the HPYLM in a preferable position.