Skip to content

Commit

Permalink
Merge pull request #3606 from vespa-engine/vekterli/add-rank-profile-…
Browse files Browse the repository at this point in the history
…weakand-tuning-doc

Add documentation for `stopword-limit` and `adjust-target` parameters
  • Loading branch information
vekterli authored Jan 30, 2025
2 parents f6fd8fe + 52ac80f commit f92fe09
Showing 1 changed file with 109 additions and 1 deletion.
110 changes: 109 additions & 1 deletion en/reference/schema-reference.html
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,9 @@ <h2 id="elements">Elements</h2>
<a href="#filter-threshold">filter-threshold</a>
<a href="#rank">rank</a>
<a href="#rank-type">rank-type</a>
<a href="#weakand">weakand</a>
<a href="#weakand-stopword-limit">stopword-limit</a>
<a href="#weakand-adjust-target">adjust-target</a>
<a href="#constant">constant</a>
<a href="#onnx-model">onnx-model</a>
<a href="#stemming">stemming</a>
Expand Down Expand Up @@ -1692,6 +1695,17 @@ <h2 id="rank-profile">rank-profile</h2>
<td>Zero or more</td>
<td>The rank-type of a field in this profile.</td>
</tr>
<tr><td><a href="#weakand">weakand</a></td>
<td>Zero or one</td>
<td>
<p>
Tunes the <a href="../using-wand-with-vespa.html#weakand">weakAnd</a> algorithm to automatically
exclude terms and documents with expected low query significance based on document frequency
statistics present in the document corpus. This makes matching faster at the cost of potentially
reduced recall.
</p>
</td>
</tr>
</tbody>
</table>

Expand Down Expand Up @@ -3944,6 +3958,101 @@ <h2 id="rank-type">rank-type</h2>
</p>


<h2 id="weakand">weakand</h2>
<p>
Contained in <a href="#rank-profile">rank-profile</a>.
</p>
<p>
Tunes the <a href="../using-wand-with-vespa.html#weakand">weakAnd</a> algorithm to automatically
exclude terms and documents with expected low query significance based on document frequency
statistics present in the document corpus. This makes matching faster at the cost of potentially
reduced recall.
</p>
<pre>
weakand {
[body]
}
</pre>
<p>
Note that all document frequency calculations are done using <em>content node-local</em> document
statistics (i.e. <a href="../significance.html#global-significance-model">global significance</a>
does not have an effect). This means results may differ across different content nodes and/or
content node groups.
</p>
<p>
The body of a <code>weakand</code> statement consists of:
</p>
<table class="table">
<thead>
<tr>
<th>Property</th>
<th>Occurrence</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td style="white-space: nowrap">stopword-limit</a></td>
<td>Zero to one</td>
<td>
<p id="weakand-stopword-limit">
A number in the range [0, 1].
Represents the maximum normalized document frequency a query term can have in the
corpus (i.e. the ratio of all documents where the term occurs at least once) before
it's considered a stopword and dropped entirely from being a part of the
<code>weakAnd</code> evaluation. This makes matching faster at the cost of
producing more hits. Dropped terms are not exposed as part of ranking.
</p>
<p>
Example:
<pre>stopword-limit: 0.60</pre>
This will drop all query terms that occur in at least 60% of the documents.
</p>
<p>
Using <code>stopword-limit</code> is similar to explicitly removing stopwords
from the query up front, but has the benefit of dynamically adapting to the
actual document corpus and not having to know—or specify—a set of stopwords.
</p>
</td>
</tr>
<tr>
<td style="white-space: nowrap">adjust-target</td>
<td>Zero to one</td>
<td>
<p id="weakand-adjust-target">
A number in the range [0, 1] representing normalized document frequency.
Used to derive a per-query document score threshold, where documents scoring
lower than the threshold will not be considered as potential hits from the
<code>weakAnd</code> operator. The score threshold is selected to be equal to
that of the query term whose document frequency is <em>closest</em> to the
configured <code>adjust-target</code> value.
<p>
<p>
This can be used to efficiently <em>exclude</em> documents that only match terms that
occur very frequently in the document corpus. Such terms are likely to be stopwords
that have low semantic value for the query, and excluding documents only containing
them is likely to only have a minor impact on recall.
</p>
<p>
This makes overall matching faster by reducing the number of hits produced by
the <code>weakAnd</code> operator.
</p>
<p>
Example:
<pre>adjust-target: 0.01</pre>
This excludes documents that only have terms that occur in more than approximately 1%
of the document corpus. The actual threshold is query-specific and based on the query
term score whose document frequency is closest to 1%.
</p>
<p>
<code>adjust-target</code> can be used together with <a href="#weakand-stopword-limit">stopword-limit</a>
to efficiently prune both terms and documents with low significance when processing queries.
</p>
</td>
</tr>
</tbody>
</table>


<h2 id="summary-to">summary-to</h2>
<p>
Expand All @@ -3961,7 +4070,6 @@ <h2 id="summary-to">summary-to</h2>
</p>



<h2 id="summary">summary</h2>
<p>
Contained in <a href="#field">field</a> or
Expand Down

0 comments on commit f92fe09

Please sign in to comment.