opensearch-project · AntonEliatra · Sep 30, 2024 · Nov 18, 2024
@@ -48,7 +48,7 @@ Normalization | `arabic_normalization`: [ArabicNormalizer](https://lucene.apache
 `phonetic` | N/A | Uses a phonetic encoder to emit a metaphone token for each token in the token stream. Requires installing the `analysis-phonetic` plugin.
 `porter_stem` | [PorterStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/PorterStemFilter.html) | Uses the [Porter stemming algorithm](https://tartarus.org/martin/PorterStemmer/) to perform algorithmic stemming for the English language.
 `predicate_token_filter` | N/A | Removes tokens that don’t match the specified predicate script. Supports inline Painless scripts only.
-`remove_duplicates` | [RemoveDuplicatesTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/RemoveDuplicatesTokenFilter.html) | Removes duplicate tokens that are in the same position.
+[`remove_duplicates`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/remove-duplicate/) | [RemoveDuplicatesTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/RemoveDuplicatesTokenFilter.html) | Removes duplicate tokens that are in the same position.
 `reverse` | [ReverseStringFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html) | Reverses the string corresponding to each token in the token stream. For example, the token `dog` becomes `god`.
 `shingle` | [ShingleFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/shingle/ShingleFilter.html) | Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but apply to words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`].
 `snowball` | N/A | Stems words using a [Snowball-generated stemmer](https://snowballstem.org/). You can use the `snowball` token filter with the following languages in the `language` field: `Arabic`, `Armenian`, `Basque`, `Catalan`, `Danish`, `Dutch`, `English`, `Estonian`, `Finnish`, `French`, `German`, `German2`, `Hungarian`, `Irish`, `Italian`, `Kp`, `Lithuanian`, `Lovins`, `Norwegian`, `Porter`, `Portuguese`, `Romanian`, `Russian`, `Spanish`, `Swedish`, `Turkish`.

@@ -0,0 +1,152 @@
+---
+layout: default
+title: Remove duplicate
+parent: Token filters
+nav_order: 350
+---
+
+# Remove duplicate token filter
+
+The `remove_duplicates` token filter is used to remove duplicate tokens that are generated at the same position during analysis.
+
+## Example
+
+The following example creates an index with a `keyword_repeat` token filter. The filter adds a `keyword` version of each token at the same position as the token itself and then uses a `kstem` to create a stemmed version of the token:
+
+```json
+PUT /example-index
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "custom_analyzer": {
+          "type": "custom",
+          "tokenizer": "standard",
+          "filter": [
+            "lowercase",
+            "keyword_repeat",
+            "kstem"
+          ]
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+Use the following request to analyze the string `Slower turtle`:
+
+```json
+GET /example-index/_analyze
+{
+  "analyzer": "custom_analyzer",
+  "text": "Slower turtle"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the token `turtle` twice in the same position:
+
+```json
+{
+  "tokens": [
+    {
+      "token": "slower",
+      "start_offset": 0,
+      "end_offset": 6,
+      "type": "<ALPHANUM>",
+      "position": 0
+    },
+    {
+      "token": "slow",
+      "start_offset": 0,
+      "end_offset": 6,
+      "type": "<ALPHANUM>",
+      "position": 0
+    },
+    {
+      "token": "turtle",
+      "start_offset": 7,
+      "end_offset": 13,
+      "type": "<ALPHANUM>",
+      "position": 1
+    },
+    {
+      "token": "turtle",
+      "start_offset": 7,
+      "end_offset": 13,
+      "type": "<ALPHANUM>",
+      "position": 1
+    }
+  ]
+}
+```
+
+The duplicate token can be removed by adding a `remove_duplicates` token filter to the index settings:
+
+```json
+PUT /index-remove-duplicate
+{
+  "settings": {
+    "analysis": {
+      "analyzer": {
+        "custom_analyzer": {
+          "type": "custom",
+          "tokenizer": "standard",
+          "filter": [
+            "lowercase",
+            "keyword_repeat",
+            "kstem",
+            "remove_duplicates"
+          ]
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+GET /index-remove-duplicate/_analyze
+{
+  "analyzer": "custom_analyzer",
+  "text": "Slower turtle"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+  "tokens": [
+    {
+      "token": "slower",
+      "start_offset": 0,
+      "end_offset": 6,
+      "type": "<ALPHANUM>",
+      "position": 0
+    },
+    {
+      "token": "slow",
+      "start_offset": 0,
+      "end_offset": 6,
+      "type": "<ALPHANUM>",
+      "position": 0
+    },
+    {
+      "token": "turtle",
+      "start_offset": 7,
+      "end_offset": 13,
+      "type": "<ALPHANUM>",
+      "position": 1
+    }
+  ]
+}
+```