interpolation.tex

\chapter{Chapter 5\newline Improving Bayesian skipgram language models \newline  with interpolation factors and backoff strategies}\label{chap:interpol}

Statistical language models have been a staple technology and working horse of natural language processing (NLP) since decades. Many ideas have been proposed, and most have been improvements over existing models. Some of them were revolutionary in either their performance, or in their simplicity. In the nineties Kneser and Ney\autocite{kneser1995improved} published their work on frequentist language models that use count-of-count information to better estimate smoothed backoff probabilities. Two decades later, Mikolov culminated existing work on language models into word2vec,\autocite{mikolov2013distributed} as-of-now one of the most widely used language models.

In this chapter, and more generally, in this thesis, we set out to improve over the more traditional count-based models in the form of their Bayesian generalisation by adding skipgrams to the set of input features, in addition to $n$-grams.

To overcome the traditional problems of overestimating the probabilities of rare occurrences and underestimating the probabilities of unseen events, a range of smoothing algorithms have been proposed in the literature\autocite{goodman2001bit}. Most methods take a heuristic-frequentist approach combining $n$-gram probabilities for various values of $n$, using back-off schemes or interpolation.

In this chapter we expand the hierarchical Pitman-Yor process language model (HPYPLM) with skipgrams,\autocite{onrust2016Improving} introduced in the previous chapter. We add interpolation factors, weighing the relative influence of skipgrams over $n$-grams, and the relative influence of interpolated backoff probabilities.

\section{Interpolation factors}

The use of interpolation factors in a language model is not new. In the literature we find lattice-based language models in
\autocite{dupont1997lattice} and a generalisation called factored language model with generalised parallel backoff \autocite{bilmes2003factored}. 
%However, the context sizes are small (2 and 3), % ik snap de relevantie van die laatste zin niet, maar misschien omdat ze niet af is ? (einigt in een komma)
  Maximum entropy language models \autocite{ROSENFELD1996187} and distant bigram language models \autocite{bassiou2011long} are other related cases in point. In \autocite{gao2004long} each backoff level has its own weight, fixed for all features. These works are all implicitly using skipgram features, with variable skip sizes, spanning patterns that are larger than $n$. 
  
%  In \cite{gao2004long} each backoff level has its own weight, fixed for all features.
% [AB] this sentence moved and copied above

  A more recent paper on using skipgram language models only uses uniform linear interpolation with a generalisation of modified Kneser-Ney \autocite{pickhardt2014generalized}. Even more recently, in \autocite{pelemans2016sparse} a sparse non-negative feature weight matrix is computed on the basis of an adjusted version of relative frequency.
  
Inspired by the previous studies, we use nine interpolation strategies:\marginnote{Hier een plaatje dat voor een run de waarden geeft, en de gemiddelde proporties ofzo.}
  
  \begin{itemize}
  \item \textsf{ngram}, where we ignore the skipgram probabilities (and prohibit the backoff step to skipgrams): \\
	$I(\mathbf{u}) =
  \begin{cases}
    1 & \text{if } \mathbf{u} \text{ is }n\text{-gram} \\
    0 & \text{if } \mathbf{u} \text{ is skipgram}
  \end{cases}$
\item \textsf{Uninformed uniform prior (uni)}, where all the weights are 1:\\ 
	$ I(\mathbf{u}) = 1 $
\item \textsf{Uninformed $n$-gram preference (npref)}, where we give the $n$-grams twice the importance of skipgrams:\footnote{Later in this chapter we do a more in-depth investigation to find the optimal preference ratio.} \\
	$I(\mathbf{u}) =
  \begin{cases}
    2 & \text{if } \mathbf{u} \text{ is }n\text{-gram} \\
    1 & \text{if } \mathbf{u} \text{ is skipgram}
  \end{cases}$
  \item \textsf{Maximum likelihood-based Linear Interpolation (mle)}, based on the maximum likelihood estimate of the context: \\[0.5ex]
	$ I(\mathbf{u}) = \displaystyle \frac{c(\mathbf{u})}{c(\mathbf{u}\cdot)} $ \\
\item \textsf{Unnormalised count (count)}, based on the occurrence count of the context: \\[0.5ex]
$ I(\mathbf{u}) = \displaystyle c(\mathbf{u}) $ \\
\item \textsf{Entropy-based Linear Interpolation (ent)}, based on the entropy of the context: \\
	$E(\mathbf{u}) = -\displaystyle \sum_{w,c(\mathbf{u}w)>0}^W\frac{c(\mathbf{u})}{c(\mathbf{u}\cdot)}\log\frac{c(\mathbf{u})}{c(\mathbf{u}\cdot)} $ \\
    $ I(\mathbf{u}) = \displaystyle \frac{1}{1+E(\mathbf{u})}$ \\
    where $c(\mathbf{u}w)$ are the counts as estimated by the model. We use the reciprocal because a higher entropy should yield a lower weight.
    % although we did test also tested an increasing function, but this performed worse.
\item \textsf{Perplexity-based Linear Interpolation (ppl)}, raising 2 to the power of the entropy of the context: \\ % shifted into the domain of the counts, by using the entropy: \\
	$\textstyle I(\mathbf{u}) = \displaystyle 2^{-E(\mathbf{u})} $
    \item \textsf{random}, where weights are uniformly distributed between 0 and 1 and assigned to the terms: \\
    $ I(\mathbf{u}) = \text{rand}(0,1) $
\item \textsf{Skipgram-type based Linear Interpolation (value)}, in contrast to many of the interpolation strategies above, and more in line with \textsf{npref}, \textsf{value} assigns a predefined value not based on the content of the context, but on the shape of the context. For example, in \textsf{npref} we only consider the two cases, $n$-gram or skipgram. For \textsf{value} we can assign weights to individual skipgram types such as \emph{a \{1\} c} and \emph{b \{1\} \{1\}}. So if we use the same notation, with \emph{a}, \emph{b}, and \emph{c} as placeholders indicating there is a word in the context on that position\footnote{Positions 1, 2, and 3, respectively.}, and \emph{\{1\}} indicating a single skip, then we can define the function providing the interpolation values as:\footnote{We only outline the parameters for a 4-gram model. Higher-order models are extended analogously.} \\
	$\textstyle I(\mathbf{u}) = \begin{cases}
w_{d} & \text{if the context is empty}\\
w_{cd} & \text{if } \mathbf{u} = \text{\emph{c}} \\
w_{bcd} & \text{if } \mathbf{u} = \text{\emph{bc}} \\
w_{b\{1\}d} & \text{if } \mathbf{u} = \text{\emph{b\{1\}}} \\
w_{abcd} & \text{if } \mathbf{u} = \text{\emph{abc}} \\
w_{a\{1\}cd} & \text{if } \mathbf{u} = \text{\emph{a\{1\}c}} \\
w_{ab\{1\}d} & \text{if } \mathbf{u} = \text{\emph{ab\{1\}}} \\
w_{a\{1\}\{1\}d} & \text{if } \mathbf{u} = \text{\emph{a\{1\}\{1\}}} \\
	\end{cases}%
	$ \\ \noindent
	Setting all weights to 1 results in \textsf{uni}; setting $w_{d}$, $w_{cd}$, $w_{bcd}$, and $w_{abcd}$ to 1, and the others to 2,\footnote{With the special case that if you set these values to 0, you end up with \textsf{ngram}.} yields the default \textsf{npref}.
  \end{itemize} 


The weights for the interpolation strategies \textsl{mle} and \textsl{ppl} are determined at test time, since precomputing and computing all these weights is expensive. To this end we have not ventured into learning the weights during training time, integrated into the Bayesian paradigm of the hierarchical Pitman-Yor process.
As a compromise we have the context-based methods \textsf{ent}, \textsf{ppl}, \textsf{count} and \textsf{mle}, as opposed to the heuristic \textsf{npref}, and the learned \textsf{value}.


We extend \cref{eq:interpolform} for the word probability by adding normalised interpolation weights $I(\cdot)$. The probability of a word $w$ with context $\mathbf{u}$ is then:


\begin{equation}\begin{split}
p(w|\mathbf{u}) &=
\sum_{\mathbf{u}_m\in\boldsymbol\varsigma}
\left[
\frac{I(\mathbf{u}_m)}
{\sum_{\mathbf{x}\in\boldsymbol\varsigma}
	I(\mathbf{x})}
\left(\frac{c_{\mathbf{u}_mw\cdot} - d_{|\mathbf{u}_m|}t_{\mathbf{u}_mw\cdot}}
{\theta_{|\mathbf{u}_m|} + c_{\mathbf{u}_m\cdot\cdot}} + \frac{\theta_{|\mathbf{u}_m|} + d_{|\mathbf{u}_m|}t_{\mathbf{u}_m\cdot\cdot}}
{\theta_{|\mathbf{u}_m|} + c_{\mathbf{u}_m\cdot\cdot}}
Z_{\mathbf{u}_mw})
\right)\right]
\end{split}\label{eq:newinterpolform}\end{equation}
with $c_{\mathbf{u}w\cdot}$ being the number of $\mathbf{u}w$ tokens, and $c_{\mathbf{u}\cdot\cdot}$ the number of patterns starting with context $\mathbf{u}$. Similarly, $t_{\mathbf{u}wk}$ is 1 if the $k$th draw from $G_{\mathbf{u}}$ was $w$, 0 otherwise. $t_{\mathbf{u}w\cdot}$ then denotes if there is a pattern $\mathbf{u}w$, and $t_{\mathbf{u}\cdot\cdot}$ is the number of types following context $\mathbf{u}$.

For \textsl{ngram} and \textsl{full}
\begin{equation}
Z_{\mathbf{u}w} = p(w|\pi(\mathbf{u})),
\end{equation}
for \textsl{limited}\sidenote{Computing the normalisation factor is expensive, because for each word $w$ in the vocabulary that occurs after the context $\mathbf{u}$ you have to compute its probability. Combined with the enormous search space for all contexts of length up to three, computing the normalisation factor is best done at runtime, whilst maintaining a cache.}
\begin{equation}
Z_{\mathbf{u}w} = \left.
\begin{cases}
\frac{1 - \sum_{w\in \mathcal{B}} p_{\mathrm{L}}(w|\pi(\mathbf{u}))}{|\mathcal{N}|}, & \text{if } \mathrm{count}(\mathbf{u}w) > 0 \\
p(w|\pi(\mathbf{u})), & \text{otherwise } 
\end{cases}
\right.
\end{equation}
where the words $w\in\mathcal{N}$ in the patterns $\mathbf{u}w$ have not been seen in the training data, and the patterns $\mathbf{u}w$ with $w\in\mathcal{B}$ are in the training data.

The main difference between \cref{eq:interpolform} and \cref{eq:newinterpolform} is that in the latter we do not use an explicit discount term over the type counts, but a normalisation term. This ensures a simpler strategy, and it is theoretically sound with proper distributions.\footnote{See \cref{apx:proofinterpolform}.} 

Note also that rather than two terms, \cref{eq:newinterpolform} only has one, because the \textsl{ngram} backoff strategy is now interpreted as an interpolation strategy.


\section{Experiments}
In this section we investigate three hypotheses, of which two new hypotheses. First we confirm\footnote{Conform \cref{chap:shpyplm}.} that skipgrams help reduce the perplexity in an intrinsic language model evaluation.\footnote{For the extrinsic counterpart in an automatic speech recognition experiment, we refer to the next chapter.} Second, we investigate whether if we can see an additional effect of different interpolation factors and backoff strategies in a cross-domain setting where the test set is sampled from another text genre as the training data. And finally, we look in a more qualitative way at the effect of skipgrams.

\begin{figure}
\begin{tikzpicture}%[remember picture,overlay]
\node[right] (start) at (0,-0.5) {Worst performance};
\node[left] (end) at (\linewidth-\pgflinewidth,-0.5) {Best performance};
\node[] (mid) at ($(start)!0.5!(end)$) {mid};
\path[left color=worstclr!25, right color=bestclr!25,middle color=avgclr!25]
(0,0) rectangle ++(\linewidth-\pgflinewidth,1); 
\end{tikzpicture}
\caption{Throughout this section we use these colours to highlight numbers, to make it easier to compare the numbers. The range is best on both the \textcolor{worstclr!50}{worst performance}, and the \textcolor{bestclr!50}{best performance}. We use a linear scale (even though for perplexity a log scale might be more appropriate).}
\label{fig:colourrange}
\end{figure}


\subsection{Skipgrams and perplexity reductions}
The first comparison is between \textsf{ngram} and \textsf{uni}, since these backoff strategies embody the difference between only $n$-gram features (\textsf{ngram}), and both $n$-gram and skipgram features (\textsf{uni}). We report the perplexities in \cref{tab:ngramsvsskipgrams}, and the relative difference in perplexity when choosing skipgrams\footnote{Read, skipgrams and $n$-grams. In our experiments we never use only skipgrams. We use this convention in the remainder of this thesis, except in cases where there might be some ambiguity otherwise.} over $n$-grams.

\npdecimalsign{.}
\nprounddigits{0}
\begin{table}[]
	\centering
	\caption{My caption}
	\label{tab:ngramsvsskipgrams}
	\begin{tabular}{lllllllllllllll}
		training & \multicolumn{4}{c}{\obw}            &  & \multicolumn{4}{c}{\emea} &  & \multicolumn{4}{c}{\jrc}             \\
		test     & \obw  & \emea  & \jrc  & \wp    
		      &  & \obw  & \emea  & \jrc  & \wp 
		      &  & \obw  & \emea  & \jrc  & \wp      \\ \cline{2-5}\cline{7-10}\cline{12-15}
		\textsf{ngram}   & \copr{obw}{obw}{129.47} &  \copr{obw}{emea}{1123.89} 
					&  \copr{obw}{jrc}{941.4}  &  \copr{obw}{wp}{456.27} &  
		        & \copr{emea}{obw}{1761.34} & \copr{emea}{emea}{5.63033} 
		            & \copr{emea}{jrc}{898} & \copr{emea}{wp}{1123.58} &  
		        &  \copr{jrc}{obw}{1520.1}  &  \copr{jrc}{emea}{1278.94} 
			         &  \copr{jrc}{jrc}{12.85} &  \copr{jrc}{wp}{1249.28} \\
		\textsf{fulluni}  & \copr{obw}{obw}{124.69} & \copr{obw}{emea}{728.27}  
				 	& \copr{obw}{jrc}{728.98} & \copr{obw}{wp}{392.04} 
				 &  & \copr{emea}{obw}{1393.81} & \copr{emea}{emea}{5.6754} 
				 	& \copr{emea}{jrc}{773.116} & \copr{emea}{wp}{907.558} &  
				 & \copr{jrc}{obw}{1303.66} & \copr{jrc}{emea}{1069.64} 
				 	& \copr{jrc}{jrc}{13.32} & \copr{jrc}{wp}{1067.99} \\
		$\Delta$\% & \numprint{3.1} & \numprint{35.23} & \numprint{22.53}   &  \numprint{14.04}
				& & \numprint{20.840} & \numprint{-0.800} 
					& \numprint{13.91982} & \numprint{19.217} &
				& \numprint{14.21} & \numprint{16.34} & \numprint{-3.65} & \numprint{14.49} \\
	\end{tabular}
\end{table}


\subsection{Interpolation between $n$-grams and skipgrams}
The previous results show that if we add skipgrams, we can reduce the perplexity. Since \textsf{uni} is a very naive prior weight, in this section we investigate the effect of adding weights as interpolation factors.

If we would have enough training material, skipgrams might not be necessary, as all information is then captured by the $n$-grams. This hypothesis suggests that $n$-grams carry more information, and in cases where $n$-grams do not cover the encountered patterns, skipgrams are an additional help.

An initial guess for $n$-gram preference was a ratio of 2:1, in favour of $n$-grams. The results for \textsf{fullnpref}\footnote{Unless otherwise noted for \textsf{fullnpref}, the preference ratio is 2.0.} are shown in \cref{tab:fullunivsfullnpref2}, show around 5\% reductions in perplexity compared to \textsf{fulluni}. Although in itself the reductions are not that impressive, they are combined with the reductions in \cref{tab:ngramsvsskipgrams}.

\begin{table}[]
	\centering
	\caption{My caption}
	\label{tab:fullunivsfullnpref2}
	\begin{tabular}{lllllllllllllll}
		training & \multicolumn{4}{c}{\obw}            &  & \multicolumn{4}{c}{\emea} &  & \multicolumn{4}{c}{\jrc}             \\
		test     & \obw  & \emea  & \jrc  & \wp    
		&  & \obw  & \emea  & \jrc  & \wp 
		&  & \obw  & \emea  & \jrc  & \wp      \\ \cline{2-5}\cline{7-10}\cline{12-15}
		\textsf{fulluni}   & \copr{obw}{obw}{124.69} &  \copr{obw}{emea}{728.27} 
		&  \copr{obw}{jrc}{728.98}  &  \copr{obw}{wp}{392.04} &  
		& \copr{emea}{obw}{1393.81} & \copr{emea}{emea}{5.6754} 
			& \copr{emea}{jrc}{773.116} & \copr{emea}{wp}{907.558} &  
		&  \copr{jrc}{obw}{1303.66}  &  \copr{jrc}{emea}{1069.64} 
		&  \copr{jrc}{jrc}{13.32} &  \copr{jrc}{wp}{1067.99} \\
		\textsf{fullnpref}  & \copr{obw}{obw}{118.28} & \copr{obw}{emea}{699.91}  
		& \copr{obw}{jrc}{694.32} & \copr{obw}{wp}{372.06} 
		&  & \copr{emea}{obw}{1305.9} &  \copr{emea}{emea}{5.59}     
		  & \copr{emea}{jrc}{704.94} & \copr{emea}{wp}{852.52}   &  
		& \copr{jrc}{obw}{1215.52} & \copr{jrc}{emea}{1000.72} 
		& \copr{jrc}{jrc}{12.84} & \copr{jrc}{wp}{1000} \\
		$\Delta$\% & \numprint{5.6} & \numprint{3.85} & \numprint{4.80}   &  \numprint{5.10}
		& &\numprint{6.312769}&\numprint{1.5047}
			&\numprint{8.796895}&\numprint{6.0572687}&
		& \numprint{6.75} & \numprint{6.45} & \numprint{3.6} & \numprint{6.37} \\
	\end{tabular}
\end{table}

Even with these positive results, it is hard to not verify whether 2.0 is indeed the optimal value for \textsf{fullnpref}. In \cref{tab:nprefgrid} we show the results of a search for the lowest perplexity, in 25 steps in the logarithmic space from 0.05 through 20.

\begin{table*}\resizebox{\columnwidth}{!}{%
		\begin{tabular}{llllllllllllllllllllllllll}
			\textsf{fullnpref} & 0.05 & 0.06 & 0.08 & 0.11 & 0.14 & 0.17 & 0.22 & 0.29 & 0.37 & 0.47 & 0.61 & 0.78 & 1 & 1.28 & 1.65 & 2.11 & 2.71 & 3.48 & 4.47 & 5.73 & 7.36 & 9.44 & 12.12 & 15.55 & 19.95 \\
			\obw & \wtc{19.4295700804}\numprint{170.844} & \wtc{17.642782244}\numprint{168.288} & \wtc{14.6752883607}\numprint{164.043} & \wtc{11.2023767913}\numprint{159.075} & \wtc{8.46696959105}\numprint{155.162} & \wtc{6.21740650122}\numprint{151.944} & \wtc{3.19328905977}\numprint{147.618} & \btc{0.045438657812}\numprint{142.985} & \btc{2.85424676686}\numprint{138.967} & \btc{5.52114645229}\numprint{135.152} & \btc{8.27123383432}\numprint{131.218} & \btc{10.6641034603}\numprint{127.795} & \btc{12.8381684726}\numprint{124.685} & \btc{14.7165326809}\numprint{121.998} & \btc{16.3257602237}\numprint{119.696} & \btc{17.5567983223}\numprint{117.935} & \btc{18.4781544914}\numprint{116.617} & \btc{19.0772457183}\numprint{115.76} & \btc{19.38273331}\numprint{115.323} & \btc{19.4295700804}\numprint{115.256} & \btc{19.2590003495}\numprint{115.5} & \btc{18.9171618315}\numprint{115.989} & \btc{18.4445997903}\numprint{116.665} & \btc{17.8818594897}\numprint{117.47} & \btc{17.2645927997}\numprint{118.353} \\
			\emea & \wtc{19.7839277582}\numprint{1041.55} & \wtc{17.6333258196}\numprint{1022.85} & \wtc{14.0817274739}\numprint{991.968} & \wtc{9.95820701901}\numprint{956.113} & \wtc{6.74000947645}\numprint{928.13} & \wtc{4.1170801496}\numprint{905.323} & \wtc{0.633565030982}\numprint{875.033} & \btc{3.03234873333}\numprint{843.157} & \btc{6.13979602633}\numprint{816.137} & \btc{9.01194216606}\numprint{791.163} & \btc{11.8682175535}\numprint{766.327} & \btc{14.2343397077}\numprint{745.753} & \btc{16.2455550393}\numprint{728.265} & \btc{17.8203246834}\numprint{714.572} & \btc{18.9673890542}\numprint{704.598} & \btc{19.6067043578}\numprint{699.039} & \btc{19.7839277582}\numprint{697.498} & \btc{19.5042345007}\numprint{699.93} & \btc{18.802701248}\numprint{706.03} & \btc{17.737405753}\numprint{715.293} & \btc{16.3469898473}\numprint{727.383} & \btc{14.7095422323}\numprint{741.621} & \btc{12.8258679461}\numprint{758.0} & \btc{10.8975715449}\numprint{774.767} & \btc{8.84000901643}\numprint{792.658} \\
			\jrc & \wtc{20.9393339638}\numprint{1053.15} & \wtc{18.8263550482}\numprint{1034.75} & \wtc{15.30204402}\numprint{1004.06} & \wtc{11.1624427185}\numprint{968.012} & \wtc{7.90156503985}\numprint{939.616} & \wtc{5.22577581638}\numprint{916.315} & \wtc{1.647606793}\numprint{885.156} & \btc{2.14955412126}\numprint{852.09} & \btc{5.39688117422}\numprint{823.812} & \btc{8.42774272415}\numprint{797.419} & \btc{11.4793895558}\numprint{770.845} & \btc{14.0498743409}\numprint{748.461} & \btc{16.2861869171}\numprint{728.987} & \btc{18.1019707548}\numprint{713.175} & \btc{19.5143363897}\numprint{700.876} & \btc{20.4275107558}\numprint{692.924} & \btc{20.9017826537}\numprint{688.794} & \btc{20.9393339638}\numprint{688.467} & \btc{20.5781753394}\numprint{691.612} & \btc{19.8758395215}\numprint{697.728} & \btc{18.8791795211}\numprint{706.407} & \btc{17.6627237791}\numprint{717.0} & \btc{16.2760813658}\numprint{729.075} & \btc{14.7856273796}\numprint{742.054} & \btc{13.2386741746}\numprint{755.525} \\
			\wp & \wtc{20.7199743005}\numprint{558.048} & \wtc{18.7042539314}\numprint{548.73} & \wtc{15.3451526338}\numprint{533.202} & \wtc{11.4047849022}\numprint{514.987} & \wtc{8.2998659864}\numprint{500.634} & \wtc{5.74982180193}\numprint{488.846} & \wtc{2.33296161413}\numprint{473.051} & \btc{1.30671376792}\numprint{456.226} & \btc{4.43607745748}\numprint{441.76} & \btc{7.37767067265}\numprint{428.162} & \btc{10.3692350625}\numprint{414.333} & \btc{12.9246873827}\numprint{402.52} & \btc{15.1911289267}\numprint{392.043} & \btc{17.0848417525}\numprint{383.289} & \btc{18.6278910542}\numprint{376.156} & \btc{19.7157916483}\numprint{371.127} & \btc{20.4145227915}\numprint{367.897} & \btc{20.7199743005}\numprint{366.485} & \btc{20.666541919}\numprint{366.732} & \btc{20.3035478452}\numprint{368.41} & \btc{19.677502047}\numprint{371.304} & \btc{18.8515715502}\numprint{375.122} & \btc{17.8729152989}\numprint{379.646} & \btc{16.7984268815}\numprint{384.613} & \btc{15.6705060825}\numprint{389.827} \\
			
		\end{tabular}
	}
\caption{The perplexity values for different \textsf{fullnpref} preference rates with the \obw model. The 25 steps were sampled in a log space from $[10^{-1.3},10^{1.3}]$. The results show that indeed \textsf{fullnpref-2.0} was a good first guess, with optimal values somewhere between 2.71 and 4.47, depending on the test set.}
\label{tab:nprefgrid}
\end{table*}

These results to some extent weaken the position of skipgrams, as $n$-grams are given a preference of at least 2 times, up to 4. But nonetheless, the skipgrams contribute to a lower perplexity,\footnote{See \cref{tab:ngramsvsskipgrams,tab:fullunivsfullnpref2}.} where this could not be achieved with solely using $n$-grams.

\subsection{Individual interpolation values per backoff step}
In the previous chapter we graphically introduced the backoff steps in \cref{fig:bof}. If we consider the directed edges in the tree going from one node to a smaller node\footnote{Where we measure the size of a node by the length of a pattern minus the number of skips.}, then for \textsf{uni} the edges are weighted 1, and for \textsf{npref} the edges $w_{d}$, $w_{cd}$, $w_{bcd}$, and $w_{abcd}$ have weight 1, with the others being 2.

In the following example in \cref{fig:value} we convert the graph into a tree for a 4-gram model, and we add the names of the backoff weights, corresponding to the terms introduced for \textsf{value} earlier this chapter.

In an attempt to find the optimal values, we only have to consider the weights that can be combined. For example, $w_d$ is never interpolated with another term, and since the weighted terms are normalised, its value does not matter. This leaves us with 6 unique weights: $w_{axcd}$, $w_{abxd}$, and $w_{bcd}$, for the first level; $w_{axxd}$, $w_{cd}$, and $w_{bxd}$ for the second level.

We limit the weights to be integers between 0 through 10. We optimise the value per backoff weight, and set it to the lowest, after which we continue to the next parameter. After a period of stagnation in finding a lower perplexity, we randomise all values to escape possible local minima.

\input{figvalue}

%\multicolumn{2}{c}{\drawtwoboxes{3em}{4em}{1em}{red}{blue}}

%\multicolumn{2}{c}{\drawtwoboxes{3em}{4em}{1em}{red}{blue}}

\begin{table}[]
	\centering
	\caption{1bw}
	\label{tab:obwvalues}
	\begin{tabular}{llllllllllllll}
	        & \multicolumn{2}{c}{ppl}                    & \multicolumn{10}{c}{weights}                                                                                                                                              &    \\
                    & \multicolumn{2}{c}{   }                    & \multicolumn{3}{c}{\emph{abcd}}                & \multicolumn{2}{c}{\emph{a\{1\}cd}} & \multicolumn{2}{c}{\emph{ab\{1\}d}} & \multicolumn{2}{c}{\emph{bcd}} &             &    \\ \cline{2-3}\cline{4-6}\cline{7-8}\cline{9-10}\cline{11-12}
                  test  & \textsf{uni}       & \textsf{value}               & $w_{a\{1\}cd}$  & $w_{ab\{1\}d}$  & $w_{bcd}$  & $w_{a\{2\}d}$  & $w_{cd}$           & $w_{a\{2\}d}$  & $w_{b\{1\}d}$      & $w_{cd}$   & $w_{b\{1\}d}$     & $w_{d}$     &    \\
		  \emea & \numprint{728.265} & \numprint{717.17}     & 4               & 2               & 9          & 10             & 9                  & 10             & 4                  & 9          & 4                 & 9           &    \\
		        & \multicolumn{2}{c}{\numprint{1.523483897}} & \wtc{9}27       & \wtc{15}13      & \btc{4}60  & \btc{1}53      & \wtc{1}47          & \btc{8}71      & \wtc{8}29          & \btc{8}69  & \wtc{8}31         & \btc{20}100 & \% \\
		  \jrc  & \numprint{728.987} & \numprint{687.015}    & 4               & 2               & 10         & 7              & 9                  & 7              & 3                  & 9          & 3                 & 9           &    \\
		        & \multicolumn{2}{c}{\numprint{5.757578667}} & \wtc{10}25      & \wtc{15}13      & \btc{5}62  & \wtc{2}44      & \btc{2}56          & \btc{8}70      & \wtc{8}30          & \btc{10}75 & \wtc{10}25        & \btc{20}100 & \% \\
		  \obw  & \numprint{124.685} & \numprint{113.711}    & 4               & 2               & 9          & 2              & 9                  & 2              & 1                  & 9          & 1                 & 9           &    \\
		        & \multicolumn{2}{c}{\numprint{8.801379476}} & \wtc{11}27      & \wtc{15}13      & \btc{4}60  & \wtc{13}18     & \btc{13}82         & \btc{7}66      & \wtc{6}34          & \btc{16}90 & \wtc{16}10        & \btc{20}100 & \% \\
		  \wp   & \numprint{392.043} & \numprint{363.846}    & 2               & 1               & 5          & 3              & 9                  & 3              & 2                  & 6          & 2                 & 4           &    \\
		        & \multicolumn{2}{c}{\numprint{7.192323291}} & \wtc{10}25      & \wtc{15}13      & \btc{5}62  & \wtc{7}33      & \btc{6}66          & \btc{4}60      & \wtc{4}40          & \btc{10}75 & \wtc{10}25        & \btc{20}100 & \% \\
	\end{tabular}
\end{table}

\begin{table}[]
	\centering
	\caption{jrc}
	\label{tab:jrcvalues}
	\begin{tabular}{llllllllllllll}
	            & \multicolumn{2}{c}{ppl}                    & \multicolumn{10}{c}{weights}                                                                                                                                              &    \\
                & \multicolumn{2}{c}{   }                    & \multicolumn{3}{c}{\emph{abcd}}                & \multicolumn{2}{c}{\emph{a\{1\}cd}} & \multicolumn{2}{c}{\emph{ab\{1\}d}} & \multicolumn{2}{c}{\emph{bcd}} &             &    \\ \cline{2-3}\cline{4-6}\cline{7-8}\cline{9-10}\cline{11-12}
          test  & \textsf{uni}       & \textsf{value}        & $w_{a\{1\}cd}$  & $w_{ab\{1\}d}$  & $w_{bcd}$  & $w_{a\{2\}d}$  & $w_{cd}$           & $w_{a\{2\}d}$  & $w_{b\{1\}d}$      & $w_{cd}$   & $w_{b\{1\}d}$     & $w_{d}$     &    \\
		  \emea & \numprint{1100.26} & \numprint{971.437}    & 1               & 1               & 9          & 8              & 6                  & 8              & 4                  & 6          & 4                 & 2           &    \\
		        & \multicolumn{2}{c}{\numprint{11.70841437}} & \wtc{16}9       & \wtc{16}9       & \btc{13}82 & \btc{3}57      & \wtc{3}43          & \btc{7}67      & \wtc{7}33          & \btc{4}60  & \wtc{4}40         & \btc{20}100 & \% \\
		  \jrc  & \numprint{12.588}  & \numprint{11.6829}    & 2               & 1               & 10         & 2              & 8                  & 2              & 1                  & 8          & 1                 & 8           &    \\
		        & \multicolumn{2}{c}{\numprint{7.190181125}} & \wtc{14}15      & \wtc{16}8       & \btc{11}77 & \wtc{12}20     & \btc{12}80         & \btc{7}67      & \wtc{7}33          & \btc{16}89 & \wtc{16}11        & \btc{20}100 & \% \\
		  \obw  & \numprint{1329.83} & \numprint{1166.39}    & 2               & 0.1             & 10         & 2              & 1                  & 2              & 1                  & 1          & 1                 & 1           &    \\
		        & \multicolumn{2}{c}{\numprint{12.29029274}} & \wtc{13}17      & \wtc{20}1       & \btc{13}82 & \wtc{7}67      & \btc{7}33          & \btc{7}67      & \wtc{7}33          & \btc{0}50  & \wtc{0}50         & \btc{20}100 & \% \\
		  \wp   & \numprint{1079.26} & \numprint{954.277}    & 2               & 0.1             & 10         & 1              & 1                  & 1              & 1                  & 1          & 1                 & 1           &    \\
		        & \multicolumn{2}{c}{\numprint{11.58043474}} & \wtc{13}17      & \wtc{20}1       & \btc{13}82 & \wtc{0}50      & \btc{0}50          & \wtc{0}50      & \btc{0}50          & \wtc{0}50  & \btc{0}50        & \btc{20}100 & \% \\
	\end{tabular}
\end{table}

\begin{table}[]
	\centering
	\caption{emea}
	\label{tab:emeavalues}
     \begin{tabular}{llllllllllllll}
	            & \multicolumn{2}{c}{ppl}                    & \multicolumn{10}{c}{weights}                                                                                                                                              &    \\
                & \multicolumn{2}{c}{   }                    & \multicolumn{3}{c}{\emph{abcd}}                & \multicolumn{2}{c}{\emph{a\{1\}cd}} & \multicolumn{2}{c}{\emph{ab\{1\}d}} & \multicolumn{2}{c}{\emph{bcd}} &             &    \\ \cline{2-3}\cline{4-6}\cline{7-8}\cline{9-10}\cline{11-12}
          test  & \textsf{uni}       & \textsf{value}        & $w_{a\{1\}cd}$  & $w_{ab\{1\}d}$  & $w_{bcd}$  & $w_{a\{2\}d}$  & $w_{cd}$           & $w_{a\{2\}d}$  & $w_{b\{1\}d}$      & $w_{cd}$   & $w_{b\{1\}d}$     & $w_{d}$     &    \\
		  \emea & \numprint{5.66484} & \numprint{5.50167}    & 2               & 1               & 10         & 2              & 8                  & 2              & 1                  & 8          & 1                 & 8           &    \\
		        & \multicolumn{2}{c}{\numprint{2.880399093}} & \wtc{14}15      & \wtc{16}8       & \btc{11}77 & \wtc{12}20     & \btc{12}80         & \btc{7}67      & \wtc{7}33          & \btc{16}89 & \wtc{16}11        & \btc{20}100 & \% \\
		  \jrc  & \numprint{762.331} & \numprint{630.976}    & 1               & 1               & 10         & 8              & 6                  & 8              & 5                  & 6          & 5                 & 2           &    \\
		        & \multicolumn{2}{c}{\numprint{17.23070425}} & \wtc{17}8       & \wtc{17}8       & \btc{15}88 & \btc{3}57      & \wtc{3}43          & \btc{5}62      & \wtc{5}38          & \btc{2}55  & \wtc{2}45         & \btc{20}100 & \% \\
		  \obw  & \numprint{1389.33} & \numprint{1217.06}    & 2               & 0.1             & 1          & 1              & 1                  & 1              & 1                  & 1          & 1                 & 1           &    \\
		        & \multicolumn{2}{c}{\numprint{12.39950192}} & \btc{6}65       & \wtc{19}3       & \wtc{7}32  & \wtc{0}50      & \btc{0}50          & \wtc{0}50      & \btc{0}50          & \wtc{0}50  & \btc{0}50         & \btc{20}100 & \% \\
		  \wp   & \numprint{899.598} & \numprint{798.043}    & 2               & 0.1             & 1          & 1              & 1                  & 1              & 1                  & 1          & 1                 & 1           &    \\
		        & \multicolumn{2}{c}{\numprint{11.28893128}} & \btc{6}65       & \wtc{19}3       & \wtc{7}32  & \wtc{0}50      & \btc{0}50          & \wtc{0}50      & \btc{0}50          & \wtc{0}50  & \btc{0}50         & \btc{20}100 & \% \\
	\end{tabular}
\end{table}

We report the findings in \cref{tab:obwvalues}. We use \obw as training set, and report the lowest perplexities found on development sets for \emea, \jrc, \wp, and within-domain \obw. On the rightside of the table, we list the weight values for the \textsf{value} strategy with the lowest perplexity. For each of the weight we also report the relative weight for a particular backoff step. For example, in the case of \jrc, the pattern \emph{abcd} can backoff to three steps \emph{a\{1\}cd}, \emph{ab\{1\}d}, and \emph{bcd}. These respective steps have been assigned weights 4, 2, and 10. During the search for the best weights, we only kept track of the first time the lowest perplexity was found, however, when a set of weights with the same relative distribution have been found per backoff step, these yielded the same perplexity. Since the weight values are normalised, only their relative weight is important. Which is why it does not make a difference whether the value for \emph{d} is 4 or 9.

From \cref{tab:obwvalues} the first thing that jumps out is that for \emph{abcd} the relative weights estimated for all 4 development sets are almost the same. But this is not the case for the three other steps \emph{a\{1\}cd}, \emph{ab\{1\}d}, and \emph{bcd}.

If we do not read the table from left to right, but from top to bottom, we notice that there seems to be a distinction between within-domain behaviour, and cross-domain behaviour.


\begin{table}[]
	\centering
	\caption{My caption}
	\label{tab:allvalues}
	\begin{tabular}{lllllllllllllll}
		training & \multicolumn{4}{c}{\obw}            &  & \multicolumn{4}{c}{\emea} &  & \multicolumn{4}{c}{\jrc}             \\
		test     & \obw  & \emea  & \jrc  & \wp    
		      &  & \obw  & \emea  & \jrc  & \wp 
		      &  & \obw  & \emea  & \jrc  & \wp      \\ \cline{2-5}\cline{7-10}\cline{12-15}
		\textsf{uni}   & \copr{obw}{obw}{124.685} &  \copr{obw}{emea}{728.265} 
					&  \copr{obw}{jrc}{728.987}  &  \copr{obw}{wp}{392.043} &  
		        & \copr{emea}{obw}{1393.81} & \copr{emea}{emea}{5.6754} 
		            & \copr{emea}{jrc}{773.116} & \copr{emea}{wp}{908} &  
		        &  \copr{jrc}{obw}{1303.66}  &  \copr{jrc}{emea}{1069.64} 
			         &  \copr{jrc}{jrc}{13.32} &  \copr{jrc}{wp}{1067.99} \\
		\obw-\textsf{fullvalue}  & \copr{obw}{obw}{114.537} & \copr{obw}{emea}{712.609}  
				 	& \copr{obw}{jrc}{694.436} & \copr{obw}{wp}{365.706} 
				 &  & \copr{emea}{obw}{1212.13} & \copr{emea}{emea}{5.56569} 
				 	& \copr{emea}{jrc}{655.143} & \copr{emea}{wp}{655.143} &  
				 & \copr{jrc}{obw}{1155.22} & \copr{jrc}{emea}{950.893} 
				 	& \copr{jrc}{jrc}{12.6641} & \copr{jrc}{wp}{949.983} \\
        \emea-\textsf{fullvalue}  & \copr{obw}{obw}{115.966} & \copr{obw}{emea}{692.109}  
				 	& \copr{obw}{jrc}{685.726} & \copr{obw}{wp}{366.04} 
				 &  & \copr{emea}{obw}{1221.16} & \copr{emea}{emea}{5.55541} 
				 	& \copr{emea}{jrc}{650.849} & \copr{emea}{wp}{804.805} &  
				 & \copr{jrc}{obw}{1234.75} & \copr{jrc}{emea}{1021.2} 
				 	& \copr{jrc}{jrc}{12.4544} & \copr{jrc}{wp}{1019.34} \\
        \jrc-\textsf{fullvalue}  & \copr{obw}{obw}{115.186} & \copr{obw}{emea}{694}  
				 	& \copr{obw}{jrc}{684.972} & \copr{obw}{wp}{364.5} 
				 &  & \copr{emea}{obw}{1372.8} & \copr{emea}{emea}{5.52968} 
				 	& \copr{emea}{jrc}{708.803} & \copr{emea}{wp}{890.016} &  
				 & \copr{jrc}{obw}{1155.73} & \copr{jrc}{emea}{948.762} 
				 	& \copr{jrc}{jrc}{12.6653} & \copr{jrc}{wp}{951.25} \\
        \wp-\textsf{fullvalue}  & \copr{obw}{obw}{115.009} & \copr{obw}{emea}{696.297}  
				 	& \copr{obw}{jrc}{685.437} & \copr{obw}{wp}{316.727} 
				 &  & \copr{emea}{obw}{1211.78} & \copr{emea}{emea}{5.56345} 
				 	& \copr{emea}{jrc}{653.655} & \copr{emea}{wp}{653.655} &  
				 & \copr{jrc}{obw}{1153.54} & \copr{jrc}{emea}{950.737} 
				 	& \copr{jrc}{jrc}{12.6445} & \copr{jrc}{wp}{949.004} \\
         \end{tabular}
\end{table}

If we are concerned with cross-domain generalisability, we do not want to optimise the parameters for every possible set. According to \cref{tab:allvalues} the parameters learned for \wp-\textsf{fullvalue} seem to be effective on all three training sets, for almost all tests (8 out of 12). For all sets where it is not the best-performing set of parameters, it is a close second with a difference of at most 4 points in perplexity (0.6\%). 

\subsection{Interpolation weights with contextual knowledge}
In contrast to parameters based on heuristics, or parameters learned on a development set, we can also use knowledge from the training corpus to estimate certain contextual knowledge. Here we investigate three such examples: \textsf{mle}, \textsf{count}, \textsf{ent}, and \textsf{ppl}. \cref{tab:contextbasedinterpol}

\begin{table}[]
	\centering
	\caption{My caption}
	\label{tab:contextbasedinterpol}
	\begin{tabular}{lllllllllllllll}
		training & \multicolumn{4}{c}{\obw}            &  & \multicolumn{4}{c}{\emea} &  & \multicolumn{4}{c}{\jrc}             \\
		test     & \obw  & \emea  & \jrc  & \wp    
		      &  & \obw  & \emea  & \jrc  & \wp 
		      &  & \obw  & \emea  & \jrc  & \wp      \\ \cline{2-5}\cline{7-10}\cline{12-15}
		\textsf{fulluni}   & \copr{obw}{obw}{124.685} &  \copr{obw}{emea}{728.265} 
					&  \copr{obw}{jrc}{728.987}  &  \copr{obw}{wp}{392.043} &  
		        & \copr{emea}{obw}{1393.81} & \copr{emea}{emea}{5.6754} 
		            & \copr{emea}{jrc}{773.116} & \copr{emea}{wp}{908} &  
		        &  \numprint{1303.66}  &  \numprint{1069.64} 
			         &  \numprint{13.32} &  \numprint{1067.99} \\
		\textsf{fullmle}  & \copr{obw}{obw}{125.17} & \numprint{000}  
				 	& \numprint{000} & \numprint{000} 
				 &  & \copr{emea}{obw}{1931.25} & \copr{emea}{emea}{5.63} 
				 	& \copr{emea}{jrc}{1015.46} & \copr{emea}{wp}{1225.27} &  
				 & \copr{jrc}{obw}{1535.75} & \copr{jrc}{emea}{1244.74} 
				 	& \numprint{000} & \numprint{000} \\
        \textsf{fullcount}  & \copr{obw}{obw}{122.086} & \copr{obw}{emea}{893.166}  
				 	& \copr{obw}{jrc}{885.283} & \copr{obw}{wp}{421.195} 
				 &  & \copr{emea}{obw}{1681.37} & \copr{emea}{emea}{5.61967} 
				 	& \copr{emea}{jrc}{888.956} & \copr{emea}{wp}{1075.4} &  
				 & \copr{jrc}{obw}{1436.12} & \copr{jrc}{emea}{1168.68} 
				 	& \copr{jrc}{jrc}{12.8619} & \copr{jrc}{wp}{1192.74} \\
        \textsf{fullent}  & \copr{obw}{obw}{132.26} & \copr{obw}{emea}{794.05}  
				 	& \copr{obw}{jrc}{791.69} & \copr{obw}{wp}{434.24} 
				 &  & \copr{emea}{obw}{1552.49} & \copr{emea}{emea}{5.69} 
				 	& \copr{emea}{jrc}{880.78} & \copr{emea}{wp}{1032.07} &  
				 & \copr{jrc}{obw}{1453.86} & \copr{jrc}{emea}{1179.18} 
				 	& \copr{jrc}{jrc}{13.4475} & \copr{jrc}{wp}{1197.05} \\
        \textsf{fullppl}  & \copr{obw}{obw}{157.065} & \copr{obw}{emea}{1002.24}  
				 	& \copr{obw}{jrc}{1027.3} & \copr{obw}{wp}{555.01} 
				 &  & \copr{emea}{obw}{2007.03} & \copr{emea}{emea}{5.82737} 
				 	& \copr{emea}{jrc}{1217.94} & \copr{emea}{wp}{1329.48} &  
				 & \copr{jrc}{obw}{1868.78} & \copr{jrc}{emea}{1475.07} 
				 	& \copr{jrc}{jrc}{14.2414} & \copr{jrc}{wp}{1544.06} \\
		$\Delta$\% & \numprint{000} & \numprint{000} & \numprint{000}   &  \numprint{000}
				& & \numprint{000} & \numprint{000} 
					& \numprint{000} & \numprint{000} &
				& \numprint{000} & \numprint{000} & \numprint{000} & \numprint{000} \\
	\end{tabular}
\end{table}

\subsection{Random interpolation weights}
As a sanity check we have also implemented a random interpolation weight \textsf{random}. The weights are normally distributed between 0 through 1.  \cref{tab:randominterpol}

\begin{table}[]
	\centering
	\caption{My caption}
	\label{tab:randominterpol}
	\begin{tabular}{lllllllllllllll}
		training & \multicolumn{4}{c}{\obw}            &  & \multicolumn{4}{c}{\emea} &  & \multicolumn{4}{c}{\jrc}             \\
		test     & \obw  & \emea  & \jrc  & \wp    
		      &  & \obw  & \emea  & \jrc  & \wp 
		      &  & \obw  & \emea  & \jrc  & \wp      \\ \cline{2-5}\cline{7-10}\cline{12-15}
		\textsf{ngram}   & \copr{obw}{obw}{129.47} &  \copr{obw}{emea}{1123.89} 
					&  \copr{obw}{jrc}{941.4}  &  \copr{obw}{wp}{456.27} &  
		        & \copr{emea}{obw}{1761.34} & \copr{emea}{emea}{5.63033} 
		            & \copr{emea}{jrc}{898} & \copr{emea}{wp}{1123.58} &  
		        &  \copr{jrc}{obw}{1520.1}  &  \copr{jrc}{emea}{1278.94} 
			         &  \copr{jrc}{jrc}{12.85} &  \copr{jrc}{wp}{1249.28} \\
		\textsf{fullrandom}  & \copr{obw}{obw}{129.713} & \copr{obw}{emea}{769.142}  
				 	& \copr{obw}{jrc}{769.019} & \copr{obw}{wp}{411.774} 
				 &  & \copr{emea}{obw}{1483.92} & \copr{emea}{emea}{5.72414} 
				 	& \copr{emea}{jrc}{826.277} & \copr{emea}{wp}{961.939} &  
				 & \copr{jrc}{obw}{1372.32} & \copr{jrc}{emea}{1119.66} 
				 	& \copr{jrc}{jrc}{13.5574} & \copr{jrc}{wp}{1122.53} \\
		$\Delta$\% & \numprint{000} & \numprint{000} & \numprint{000}   &  \numprint{000}
				& & \numprint{000} & \numprint{000} 
					& \numprint{000} & \numprint{000} &
				& \numprint{000} & \numprint{000} & \numprint{000} & \numprint{000} \\
	\end{tabular}
\end{table}

\subsection{\textsf{full} backoff versus \textsf{lim}ited backoff strategies}
In the previous chapter we saw that with the discount-based \textsf{lim}ited backoff strategy, there was a clear effect of testing on either within-domain and cross-domain, in favour of the within-domain setting. We argued that this was the case because \text{lim} stops the backoff procedure once it encounteres a pattern that has also been seen in whole in the training data, and that for already seen patterns, the estimated probability is better than an combinated of estimated patterns up to the uniform probabilities.

But with the \textsf{full} and \textsf{lim}ited backoff strategies in this chapter we do not see this effect. An overview of perplexities is given in \cref{tab:limperplexities}. The colours white through blue show that the perplexities are average on best, and the worst for that training-test combination.

\begin{table}[]
	\centering
	\caption{My caption}
	\label{tab:limperplexities}
	\begin{tabular}{lllllllllllllll}
		training & \multicolumn{4}{c}{\obw}            &  & \multicolumn{4}{c}{\emea} &  & \multicolumn{4}{c}{\jrc}             \\
		test     & \obw  & \emea  & \jrc  & \wp    
		      &  & \obw  & \emea  & \jrc  & \wp 
		      &  & \obw  & \emea  & \jrc  & \wp      \\ \cline{2-5}\cline{7-10}\cline{12-15}
		\textsf{ngram}   & \copr{obw}{obw}{129.47} &  \copr{obw}{emea}{1123.89} 
					&  \copr{obw}{jrc}{941.4}  &  \copr{obw}{wp}{456.27} &  
		        & \copr{emea}{obw}{1761.34} & \copr{emea}{emea}{5.63033} 
		            & \copr{emea}{jrc}{898} & \copr{emea}{wp}{1123.58} &  
		        &  \copr{jrc}{obw}{1520.1}  &  \copr{jrc}{emea}{1278.94} 
			         &  \copr{jrc}{jrc}{12.85} &  \copr{jrc}{wp}{1249.28} \\
	    \textsf{limuni}   & \copr{obw}{obw}{134.17} &  \copr{obw}{emea}{758.54} %limuni done
					&  \copr{obw}{jrc}{755.7}  &  \copr{obw}{wp}{406.31} &  
		        & \copr{emea}{obw}{1421.99} & \copr{emea}{emea}{5.9} 
		            & \copr{emea}{jrc}{793.02} & \copr{emea}{wp}{925.72} &  
		        &  \copr{jrc}{obw}{1353.05}  &  \copr{jrc}{emea}{1112.07} 
			         &  \copr{jrc}{jrc}{14.34} &  \copr{jrc}{wp}{1103.96} \\
	    \textsf{limnpref}   & \copr{obw}{obw}{128.32} &  \copr{obw}{emea}{732.86} %limnpref done
					&  \copr{obw}{jrc}{723.26}  &  \copr{obw}{wp}{387.39} &  
		        & \copr{emea}{obw}{1339.55} & \copr{emea}{emea}{5.83} 
		            & \copr{emea}{jrc}{727.58} & \copr{emea}{wp}{874.17} &  
		        &  \copr{jrc}{obw}{1271.47}  &  \copr{jrc}{emea}{1048.3} 
			         &  \copr{jrc}{jrc}{13.89} &  \copr{jrc}{wp}{1041.44} \\
%		\textsf{limmle}  & \copr{obw}{obw}{138.388} & \copr{obw}{emea}{1027.84}  
%				 	& \copr{obw}{jrc}{993.144} & \copr{obw}{wp}{465.52} 
%				 &  & \copr{emea}{obw}{bbb} & \copr{emea}{emea}{bbb} 
%				 	& \copr{emea}{jrc}{bbb} & \copr{emea}{wp}{bbb} &  
%				 & \copr{jrc}{obw}{ccc} & \copr{jrc}{emea}{ccc} 
%				 	& \copr{jrc}{jrc}{ccc} & \copr{jrc}{wp}{ccc} \\
		\textsf{limcount}   & \copr{obw}{obw}{133.354} &  \copr{obw}{emea}{941.565} 
					&  \copr{obw}{jrc}{927.673}  &  \copr{obw}{wp}{441.112} &  
		        & \copr{emea}{obw}{1745.28} & \copr{emea}{emea}{5.85979} 
		            & \copr{emea}{jrc}{928.113} & \copr{emea}{wp}{1114.12} &  
		        &  \copr{jrc}{obw}{1528.67}  &  \copr{jrc}{emea}{1243.3} 
			         &  \copr{jrc}{jrc}{13.949} &  \copr{jrc}{wp}{1260.12} \\
	    \textsf{liment}   & \copr{obw}{obw}{143.67} &  \copr{obw}{emea}{832.28} 
					&  \copr{obw}{jrc}{824.78}  &  \copr{obw}{wp}{452.52} &  
		        & \copr{emea}{obw}{1583.12} & \copr{emea}{emea}{5.96} 
		            & \copr{emea}{jrc}{903.881} & \copr{emea}{wp}{1052.99} &  
		        &  \copr{jrc}{obw}{1508.13}  &  \copr{jrc}{emea}{1228.23} 
			         &  \copr{jrc}{jrc}{14.6535} &  \copr{jrc}{wp}{1238.09} \\
	    \textsf{limppl}   & \copr{obw}{obw}{172.141} &  \copr{obw}{emea}{1055.32} 
					&  \copr{obw}{jrc}{1074.87}  &  \copr{obw}{wp}{850.723} &  
		        & \copr{emea}{obw}{2049.38} & \copr{emea}{emea}{6.13118} 
		            & \copr{emea}{jrc}{1251.99} & \copr{emea}{wp}{1358.52} &  
		        &  \copr{jrc}{obw}{1945.12}  &  \copr{jrc}{emea}{1543.46} 
			         &  \copr{jrc}{jrc}{15.6463} &  \copr{jrc}{wp}{1602.42} \\
		\textsf{limrandom}  & \copr{obw}{obw}{139.896} & \copr{obw}{emea}{804.404}  
				 	& \copr{obw}{jrc}{799.865} & \copr{obw}{wp}{427.539} 
				 &  & \copr{emea}{obw}{1522.77} & \copr{emea}{emea}{5.95858} 
				 	& \copr{emea}{jrc}{854.708} & \copr{emea}{wp}{985.087} &  
				 & \copr{jrc}{obw}{1433.02} & \copr{jrc}{emea}{1177.73} 
				 	& \copr{jrc}{jrc}{14.611} & \copr{jrc}{wp}{1163.32} \\
	\end{tabular}
\end{table}

\subsection{A qualitative analysis into the contribution of skipgrams}

%\section{Experiments}
%We train 4-gram language model on the two training corpora, the Google 1 billion word benchmark and the Mediargus corpus.\footnote{See~\cref{sec:data} for a description of the corpora.} We do not perform any preprocessing on the data except tokenisation. 
%   %The models are trained with a HPYLM. We do not use sentence beginning and end markers. The results for the {\sf ngram} backoff strategy are obtained by training without skipgrams; for {\sf limited} and {\sf full} we added skipgram features during training.
%
%When setting up the experimental framework, we had to decide on the basis. Earlier work on hierarchical Pitman-Yor language models by Huang and Renals had accompanying software releases. An SRILM extension with HPYPLM was proposed in \autocite{huang2007hierarchical}, and a frequentist approximation extension of the HPYPLM was described in \autocite{huang2010power}. However, at the time I started this thesis, they were no longer accessible. With further inquiries we learned that also none of the source code has survived during the period. 
%
%We found an alternative in cpyp,\footnote{\url{https://github.com/redpony/cpyp}} which is an existing library for non-parametric Bayesian modelling with PY priors with histogram-based sampling \cite{blunsom2009note}. This library has an example application to showcase its performance with $n$-gram based language modelling. Limitations of the library, such as not natively supporting skipgrams, and the lack of other functionality such as thresholding and discarding of certain patterns, led us to extend the library with Colibri Core,\footnote{\url{http://proycon.github.io/colibri-core/}} a pattern modelling library. Colibri Core resolves the limitations, and together the libraries are a complete language model that handles skipgrams: cococpyp.\footnote{\url{https://github.com/naiaden/cococpyp}} This software in turn has been rewritten to allow also for reranking nbest lists, and being more in control of the underlying language model. We gave it the name SLM, for skipgram language model.\footnote{\url{https://github.com/naiaden/SLM}} Throughout the rest of the thesis the reported results were obtained with SLM.
%
%  Each model is run for 50 iterations (without an explicit burn-in phase), with the initial values for hyperparameters $\theta=1.0$ and $\gamma=0.8$. The hyperparameters are resampled every 30 iterations with slice sampling \cite{walker2007sampling}.
%  
%  \textbf{Plot van dalende ppl over iteraties, effect resampling?}
%  
%  We test each model on different test sets, and we collect their intrinsic performance by means of perplexity. We compute the perplexity on all 4-grams, rather than computing the perplexity for sentences. 
%  Words in the test set that were unseen in the training data are ignored in computing the perplexity on test data.\footnote{This is common for perplexity. } 

\subsection{PPL}
\subsection{Learning curves}

\section{Results}

\section{Discussion}