Update SPLADE notebook with new sections

qdrant · Mar 22, 2024 · c651b2b · c651b2b
1 parent bc1e238
commit c651b2b
Showing 1 changed file with 11 additions and 9 deletions.
diff --git a/docs/examples/SPLADE_with_FastEmbed.ipynb b/docs/examples/SPLADE_with_FastEmbed.ipynb
@@ -13,9 +13,10 @@
     "## Outline:\n",
     "1. [What is SPLADE?](#What-is-SPLADE?)\n",
     "2. [Setting up the environment](#Setting-up-the-environment)\n",
-    "3. Generating SPLADE vectors with FastEmbed\n",
-    "4. Understanding SPLADE vectors\n",
-    "5. Applications of SPLADE vectors\n",
+    "3. [Generating SPLADE vectors with FastEmbed](#Generating-SPLADE-vectors-with-FastEmbed)\n",
+    "4. [Understanding SPLADE vectors](#Understanding-SPLADE-vectors)\n",
+    "5. [Observations and Design Choices](#Observations-and-Model-Design-Choices)\n",
+    "\n",
     "\n",
     "## What is SPLADE?\n",
     "\n",
@@ -206,6 +207,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "## Understanding SPLADE vectors\n",
+    "\n",
     "This is still a little abstract, so let's use the tokenizer vocab to make sense of these indices."
    ]
   },
@@ -279,6 +282,7 @@
     }
    ],
    "source": [
+    "import json\n",
     "from transformers import AutoTokenizer\n",
     "\n",
     "tokenizer = AutoTokenizer.from_pretrained(SparseTextEmbedding.list_supported_models()[0][\"sources\"][\"hf\"])"
@@ -333,9 +337,6 @@
     }
    ],
    "source": [
-    "import json\n",
-    "\n",
-    "\n",
     "def get_tokens_and_weights(sparse_embedding, tokenizer):\n",
     "    token_weight_dict = {}\n",
     "    for i in range(len(sparse_embedding.indices)):\n",
@@ -356,15 +357,16 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Observations\n",
+    "## Observations and Model Design Choices\n",
     "\n",
     "1. The relative order of importance is quite useful. The most important tokens in the sentence have the highest weights.\n",
     "1. **Term Expansion**: The model can expand the terms in the document. This means that the model can generate weights for tokens that are not present in the document but are related to the tokens in the document. This is a powerful feature that allows the model to capture the context of the document. Here, you'll see that the model has added the tokens '3' from 'third' and 'moon' from 'lunar' to the sparse vector.\n",
     "\n",
-    "## Design Choices\n",
+    "### Design Choices\n",
     "\n",
     "1. The weights are not normalized. This means that the sum of the weights is not 1 or 100. This is a common practice in sparse embeddings, as it allows the model to capture the importance of each token in the document.\n",
-    "1. Tokens are included in the sparse vector only if they are present in the model's vocabulary. This means that the model will not generate a weight for tokens that it has not seen during training."
+    "1. Tokens are included in the sparse vector only if they are present in the model's vocabulary. This means that the model will not generate a weight for tokens that it has not seen during training.\n",
+    "1. Tokens do not map to words directly -- allowing you to gracefully handle typo errors and out-of-vocabulary tokens."
    ]
   }
  ],