From 098968804f2f69de251474d598b9c2548050671a Mon Sep 17 00:00:00 2001
From: Tor Brede Vekterli
+ Tunes the weakAnd algorithm to automatically
+ exclude terms and documents with expected low query significance based on term frequency
+ statistics present in the document corpus. This makes matching faster at the cost of potentially
+ reduced recall.
+ Elements
filter-threshold
rank
rank-type
+ weakand
+ stopword-limit
+ adjust-target
constant
onnx-model
stemming
@@ -1692,6 +1695,17 @@ rank-profile
Zero or more
The rank-type of a field in this profile.
+
@@ -3944,6 +3958,101 @@ weakand
+ Zero or one
+
+
+rank-type
+ Contained in rank-profile. +
++ Tunes the weakAnd algorithm to automatically + exclude terms and documents with expected low query significance based on term frequency + statistics present in the document corpus. This makes matching faster at the cost of potentially + reduced recall. +
++weakand { + [body] +} ++
+ Note that all term frequency calculations are done using content node-local document + statistics (i.e. global significance + does not have an effect). This means results may differ across different content nodes and/or + content node groups. +
+
+The body of a weakand
statement consists of:
+
Property | +Occurrence | +Description | +
---|---|---|
stopword-limit | +Zero to one | +
+
+ A number in the range [0, 1].
+ Represents the maximum normalized document frequency a query term can have in the
+ corpus (i.e. the ratio of all documents where the term occurs at least once) before
+ it's considered a stopword and dropped entirely from being a part of the
+ + Example: + stopword-limit: 0.60+ This will drop all query terms that occur in at least 60% of the documents. + +
+ Using |
+
adjust-target | +Zero to one | +
+
+ A number in the range [0, 1] representing normalized document frequency.
+ Used to derive a per-query document score threshold, where documents scoring
+ lower than the threshold will not be considered as potential hits from the
+ + + This can be used to efficiently exclude documents that only match terms that + occur very frequently in the document corpus. Such terms are likely to be stop-words + that have low semantic value for the query, and excluding documents only containing + them is likely to only have a minor impact on recall. + +
+ This makes overall matching faster by reducing the number of hits produced by
+ the + Example: + adjust-target: 0.01+ This excludes documents that only have terms that occur in more than approximately 1% + of the document corpus. The actual threshold is query-specific and based on the query + term score whose document frequency is closest to 1%. + +
+ |
+
@@ -3961,7 +4070,6 @@
Contained in field or
From 1a2259233fb4f9c064cadb137e8f6255b700c9de Mon Sep 17 00:00:00 2001
From: Tor Brede Vekterli
Tunes the weakAnd algorithm to automatically
- exclude terms and documents with expected low query significance based on term frequency
+ exclude terms and documents with expected low query significance based on document frequency
statistics present in the document corpus. This makes matching faster at the cost of potentially
reduced recall.
Tunes the weakAnd algorithm to automatically
- exclude terms and documents with expected low query significance based on term frequency
+ exclude terms and documents with expected low query significance based on document frequency
statistics present in the document corpus. This makes matching faster at the cost of potentially
reduced recall.
- Note that all term frequency calculations are done using content node-local document
+ Note that all document frequency calculations are done using content node-local document
statistics (i.e. global significance
does not have an effect). This means results may differ across different content nodes and/or
content node groups.
From 52ac80f12acb6270e05088d4bf842c8a93d3ebfa Mon Sep 17 00:00:00 2001
From: Tor Brede Vekterli
- Using
This can be used to efficiently exclude documents that only match terms that
- occur very frequently in the document corpus. Such terms are likely to be stop-words
+ occur very frequently in the document corpus. Such terms are likely to be stopwords
that have low semantic value for the query, and excluding documents only containing
them is likely to only have a minor impact on recall.
rank-profile
@@ -4029,7 +4029,7 @@ weakand
weakand
}
weakand
This will drop all query terms that occur in at least 60% of the documents.
stopword-limit
is similar to explicitly removing stop words
+ Using stopword-limit
is similar to explicitly removing stopwords
from the query up front, but has the benefit of dynamically adapting to the
- actual document corpus and not having to know—or specify—a set of stop-words.
+ actual document corpus and not having to know—or specify—a set of stopwords.
weakand