From c651b2b539308557c950188523de218f43c06d56 Mon Sep 17 00:00:00 2001 From: Nirant Kasliwal Date: Fri, 22 Mar 2024 15:18:34 +0530 Subject: [PATCH] Update SPLADE notebook with new sections --- docs/examples/SPLADE_with_FastEmbed.ipynb | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/docs/examples/SPLADE_with_FastEmbed.ipynb b/docs/examples/SPLADE_with_FastEmbed.ipynb index 5dff21c6..1ccb44e7 100644 --- a/docs/examples/SPLADE_with_FastEmbed.ipynb +++ b/docs/examples/SPLADE_with_FastEmbed.ipynb @@ -13,9 +13,10 @@ "## Outline:\n", "1. [What is SPLADE?](#What-is-SPLADE?)\n", "2. [Setting up the environment](#Setting-up-the-environment)\n", - "3. Generating SPLADE vectors with FastEmbed\n", - "4. Understanding SPLADE vectors\n", - "5. Applications of SPLADE vectors\n", + "3. [Generating SPLADE vectors with FastEmbed](#Generating-SPLADE-vectors-with-FastEmbed)\n", + "4. [Understanding SPLADE vectors](#Understanding-SPLADE-vectors)\n", + "5. [Observations and Design Choices](#Observations-and-Model-Design-Choices)\n", + "\n", "\n", "## What is SPLADE?\n", "\n", @@ -206,6 +207,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ + "## Understanding SPLADE vectors\n", + "\n", "This is still a little abstract, so let's use the tokenizer vocab to make sense of these indices." ] }, @@ -279,6 +282,7 @@ } ], "source": [ + "import json\n", "from transformers import AutoTokenizer\n", "\n", "tokenizer = AutoTokenizer.from_pretrained(SparseTextEmbedding.list_supported_models()[0][\"sources\"][\"hf\"])" @@ -333,9 +337,6 @@ } ], "source": [ - "import json\n", - "\n", - "\n", "def get_tokens_and_weights(sparse_embedding, tokenizer):\n", " token_weight_dict = {}\n", " for i in range(len(sparse_embedding.indices)):\n", @@ -356,15 +357,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Observations\n", + "## Observations and Model Design Choices\n", "\n", "1. The relative order of importance is quite useful. The most important tokens in the sentence have the highest weights.\n", "1. **Term Expansion**: The model can expand the terms in the document. This means that the model can generate weights for tokens that are not present in the document but are related to the tokens in the document. This is a powerful feature that allows the model to capture the context of the document. Here, you'll see that the model has added the tokens '3' from 'third' and 'moon' from 'lunar' to the sparse vector.\n", "\n", - "## Design Choices\n", + "### Design Choices\n", "\n", "1. The weights are not normalized. This means that the sum of the weights is not 1 or 100. This is a common practice in sparse embeddings, as it allows the model to capture the importance of each token in the document.\n", - "1. Tokens are included in the sparse vector only if they are present in the model's vocabulary. This means that the model will not generate a weight for tokens that it has not seen during training." + "1. Tokens are included in the sparse vector only if they are present in the model's vocabulary. This means that the model will not generate a weight for tokens that it has not seen during training.\n", + "1. Tokens do not map to words directly -- allowing you to gracefully handle typo errors and out-of-vocabulary tokens." ] } ],