Skip to content

Commit

Permalink
Update SPLADE notebook with new sections
Browse files Browse the repository at this point in the history
  • Loading branch information
NirantK committed Mar 22, 2024
1 parent bc1e238 commit c651b2b
Showing 1 changed file with 11 additions and 9 deletions.
20 changes: 11 additions & 9 deletions docs/examples/SPLADE_with_FastEmbed.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,10 @@
"## Outline:\n",
"1. [What is SPLADE?](#What-is-SPLADE?)\n",
"2. [Setting up the environment](#Setting-up-the-environment)\n",
"3. Generating SPLADE vectors with FastEmbed\n",
"4. Understanding SPLADE vectors\n",
"5. Applications of SPLADE vectors\n",
"3. [Generating SPLADE vectors with FastEmbed](#Generating-SPLADE-vectors-with-FastEmbed)\n",
"4. [Understanding SPLADE vectors](#Understanding-SPLADE-vectors)\n",
"5. [Observations and Design Choices](#Observations-and-Model-Design-Choices)\n",
"\n",
"\n",
"## What is SPLADE?\n",
"\n",
Expand Down Expand Up @@ -206,6 +207,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Understanding SPLADE vectors\n",
"\n",
"This is still a little abstract, so let's use the tokenizer vocab to make sense of these indices."
]
},
Expand Down Expand Up @@ -279,6 +282,7 @@
}
],
"source": [
"import json\n",
"from transformers import AutoTokenizer\n",
"\n",
"tokenizer = AutoTokenizer.from_pretrained(SparseTextEmbedding.list_supported_models()[0][\"sources\"][\"hf\"])"
Expand Down Expand Up @@ -333,9 +337,6 @@
}
],
"source": [
"import json\n",
"\n",
"\n",
"def get_tokens_and_weights(sparse_embedding, tokenizer):\n",
" token_weight_dict = {}\n",
" for i in range(len(sparse_embedding.indices)):\n",
Expand All @@ -356,15 +357,16 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Observations\n",
"## Observations and Model Design Choices\n",
"\n",
"1. The relative order of importance is quite useful. The most important tokens in the sentence have the highest weights.\n",
"1. **Term Expansion**: The model can expand the terms in the document. This means that the model can generate weights for tokens that are not present in the document but are related to the tokens in the document. This is a powerful feature that allows the model to capture the context of the document. Here, you'll see that the model has added the tokens '3' from 'third' and 'moon' from 'lunar' to the sparse vector.\n",
"\n",
"## Design Choices\n",
"### Design Choices\n",
"\n",
"1. The weights are not normalized. This means that the sum of the weights is not 1 or 100. This is a common practice in sparse embeddings, as it allows the model to capture the importance of each token in the document.\n",
"1. Tokens are included in the sparse vector only if they are present in the model's vocabulary. This means that the model will not generate a weight for tokens that it has not seen during training."
"1. Tokens are included in the sparse vector only if they are present in the model's vocabulary. This means that the model will not generate a weight for tokens that it has not seen during training.\n",
"1. Tokens do not map to words directly -- allowing you to gracefully handle typo errors and out-of-vocabulary tokens."
]
}
],
Expand Down

0 comments on commit c651b2b

Please sign in to comment.