From 098968804f2f69de251474d598b9c2548050671a Mon Sep 17 00:00:00 2001
From: Tor Brede Vekterli <vekterli@vespa.ai>
Date: Thu, 30 Jan 2025 16:31:00 +0100
Subject: [PATCH 1/3] Add documentation for `stopword-limit` and
 `adjust-target` parameters

---
 en/reference/schema-reference.html | 110 ++++++++++++++++++++++++++++-
 1 file changed, 109 insertions(+), 1 deletion(-)
diff --git a/en/reference/schema-reference.html b/en/reference/schema-reference.html
index ce2731ea15..ef1579af35 100644
--- a/en/reference/schema-reference.html
+++ b/en/reference/schema-reference.html
@@ -151,6 +151,9 @@ <h2 id="elements">Elements</h2>
         <a href="#filter-threshold">filter-threshold</a>
         <a href="#rank">rank</a>
         <a href="#rank-type">rank-type</a>
+        <a href="#weakand">weakand</a>
+            <a href="#weakand-stopword-limit">stopword-limit</a>
+            <a href="#weakand-adjust-target">adjust-target</a>
     <a href="#constant">constant</a>
     <a href="#onnx-model">onnx-model</a>
     <a href="#stemming">stemming</a>
@@ -1692,6 +1695,17 @@ <h2 id="rank-profile">rank-profile</h2>
   <td>Zero or more</td>
 <td>The rank-type of a field in this profile.</td>
 </tr>
+<tr><td><a href="#weakand">weakand</a></td>
+  <td>Zero or one</td>
+  <td>
+    <p>
+      Tunes the <a href="../using-wand-with-vespa.html#weakand">weakAnd</a> algorithm to automatically
+      exclude terms and documents with expected low query significance based on term frequency
+      statistics present in the document corpus. This makes matching faster at the cost of potentially
+      reduced recall.
+    </p>
+  </td>
+</tr>
 </tbody>
 </table>
 
@@ -3944,6 +3958,101 @@ <h2 id="rank-type">rank-type</h2>
 </p>
 
 
+<h2 id="weakand">weakand</h2>
+<p>
+  Contained in <a href="#rank-profile">rank-profile</a>.
+</p>
+<p>
+  Tunes the <a href="../using-wand-with-vespa.html#weakand">weakAnd</a> algorithm to automatically
+  exclude terms and documents with expected low query significance based on term frequency
+  statistics present in the document corpus. This makes matching faster at the cost of potentially
+  reduced recall.
+</p>
+<pre>
+weakand {
+    [body]
+}
+</pre>
+<p>
+  Note that all term frequency calculations are done using <em>content node-local</em> document
+  statistics (i.e. <a href="../significance.html#global-significance-model">global significance</a>
+  does not have an effect). This means results may differ across different content nodes and/or
+  content node groups.
+</p>
+<p>
+The body of a <code>weakand</code> statement consists of:
+</p>
+<table class="table">
+  <thead>
+    <tr>
+      <th>Property</th>
+      <th>Occurrence</th>
+      <th>Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td style="white-space: nowrap">stopword-limit</a></td>
+      <td>Zero to one</td>
+      <td>
+        <p id="weakand-stopword-limit">
+          A number in the range [0, 1].
+          Represents the maximum normalized document frequency a query term can have in the
+          corpus (i.e. the ratio of all documents where the term occurs at least once) before
+          it's considered a stopword and dropped entirely from being a part of the
+          <code>weakAnd</code> evaluation. This makes matching faster at the cost of
+          producing more hits. Dropped terms are not exposed as part of ranking.
+        </p>
+        <p>
+          Example:
+          <pre>stopword-limit: 0.60</pre>
+          This will drop all query terms that occur in at least 60% of the documents.
+        </p>
+        <p>
+          Using <code>stopword-limit</code> is similar to explicitly removing stop words
+          from the query up front, but has the benefit of dynamically adapting to the
+          actual document corpus and not having to know—or specify—a set of stop-words.
+        </p>
+      </td>
+    </tr>
+    <tr>
+      <td style="white-space: nowrap">adjust-target</td>
+      <td>Zero to one</td>
+      <td>
+        <p id="weakand-adjust-target">
+          A number in the range [0, 1] representing normalized document frequency.
+          Used to derive a per-query document score threshold, where documents scoring
+          lower than the threshold will not be considered as potential hits from the
+          <code>weakAnd</code> operator. The score threshold is selected to be equal to
+          that of the query term whose document frequency is <em>closest</em> to the
+          configured <code>adjust-target</code> value.
+        <p>
+        <p>
+          This can be used to efficiently <em>exclude</em> documents that only match terms that
+          occur very frequently in the document corpus. Such terms are likely to be stop-words
+          that have low semantic value for the query, and excluding documents only containing
+          them is likely to only have a minor impact on recall.
+        </p>
+        <p>
+          This makes overall matching faster by reducing the number of hits produced by
+          the <code>weakAnd</code> operator.
+        </p>
+        <p>
+          Example:
+          <pre>adjust-target: 0.01</pre>
+          This excludes documents that only have terms that occur in more than approximately 1%
+          of the document corpus. The actual threshold is query-specific and based on the query
+          term score whose document frequency is closest to 1%.
+        </p>
+        <p>
+          <code>adjust-target</code> can be used together with <a href="#weakand-stopword-limit">stopword-limit</a>
+          to efficiently prune both terms and documents with low significance when processing queries.
+        </p>
+      </td>
+    </tr>
+  </tbody>
+</table>
+
 
 <h2 id="summary-to">summary-to</h2>
 <p>
@@ -3961,7 +4070,6 @@ <h2 id="summary-to">summary-to</h2>
 </p>
 
 
-
 <h2 id="summary">summary</h2>
 <p>
 Contained in <a href="#field">field</a> or

From 1a2259233fb4f9c064cadb137e8f6255b700c9de Mon Sep 17 00:00:00 2001
From: Tor Brede Vekterli <vekterli@vespa.ai>
Date: Thu, 30 Jan 2025 17:11:18 +0100
Subject: [PATCH 2/3] Use correct frequency terminology

---
 en/reference/schema-reference.html | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/en/reference/schema-reference.html b/en/reference/schema-reference.html
index ef1579af35..5ae6451bab 100644
--- a/en/reference/schema-reference.html
+++ b/en/reference/schema-reference.html
@@ -1700,7 +1700,7 @@ <h2 id="rank-profile">rank-profile</h2>
   <td>
     <p>
       Tunes the <a href="../using-wand-with-vespa.html#weakand">weakAnd</a> algorithm to automatically
-      exclude terms and documents with expected low query significance based on term frequency
+      exclude terms and documents with expected low query significance based on document frequency
       statistics present in the document corpus. This makes matching faster at the cost of potentially
       reduced recall.
     </p>
@@ -3964,7 +3964,7 @@ <h2 id="weakand">weakand</h2>
 </p>
 <p>
   Tunes the <a href="../using-wand-with-vespa.html#weakand">weakAnd</a> algorithm to automatically
-  exclude terms and documents with expected low query significance based on term frequency
+  exclude terms and documents with expected low query significance based on document frequency
   statistics present in the document corpus. This makes matching faster at the cost of potentially
   reduced recall.
 </p>
@@ -3974,7 +3974,7 @@ <h2 id="weakand">weakand</h2>
 }
 </pre>
 <p>
-  Note that all term frequency calculations are done using <em>content node-local</em> document
+  Note that all document frequency calculations are done using <em>content node-local</em> document
   statistics (i.e. <a href="../significance.html#global-significance-model">global significance</a>
   does not have an effect). This means results may differ across different content nodes and/or
   content node groups.

From 52ac80f12acb6270e05088d4bf842c8a93d3ebfa Mon Sep 17 00:00:00 2001
From: Tor Brede Vekterli <vekterli@vespa.ai>
Date: Thu, 30 Jan 2025 17:20:13 +0100
Subject: [PATCH 3/3] Unify formatting of stopword

Co-authored-by: Geir Storli <geirst@vespa.ai>
---
 en/reference/schema-reference.html | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/en/reference/schema-reference.html b/en/reference/schema-reference.html
index 5ae6451bab..2ce0ebe693 100644
--- a/en/reference/schema-reference.html
+++ b/en/reference/schema-reference.html
@@ -4009,9 +4009,9 @@ <h2 id="weakand">weakand</h2>
           This will drop all query terms that occur in at least 60% of the documents.
         </p>
         <p>
-          Using <code>stopword-limit</code> is similar to explicitly removing stop words
+          Using <code>stopword-limit</code> is similar to explicitly removing stopwords
           from the query up front, but has the benefit of dynamically adapting to the
-          actual document corpus and not having to know—or specify—a set of stop-words.
+          actual document corpus and not having to know—or specify—a set of stopwords.
         </p>
       </td>
     </tr>
@@ -4029,7 +4029,7 @@ <h2 id="weakand">weakand</h2>
         <p>
         <p>
           This can be used to efficiently <em>exclude</em> documents that only match terms that
-          occur very frequently in the document corpus. Such terms are likely to be stop-words
+          occur very frequently in the document corpus. Such terms are likely to be stopwords
           that have low semantic value for the query, and excluding documents only containing
           them is likely to only have a minor impact on recall.
         </p>