Skip to content

Commit

Permalink
Rewrite GettingStarted to use TextEmbedding instead of DefaultEmbedding
Browse files Browse the repository at this point in the history
  • Loading branch information
NirantK committed Mar 7, 2024
1 parent 2a61a30 commit 019c889
Showing 1 changed file with 116 additions and 122 deletions.
238 changes: 116 additions & 122 deletions docs/Getting Started.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,9 @@
"\n",
"## Quick Start\n",
"\n",
"The fastembed package is designed to be easy to use. The main class is the `Embedding` class. It takes a list of strings as input and returns a list of vectors as output. The `Embedding` class is initialized with a model file."
"The fastembed package is designed to be easy to use. We'll be using `TextEmbedding` class. It takes a list of strings as input and returns an generator of vectors. If you're seeing generators for the first time, don't worry, you can convert it to a list using `list()`.\n",
"\n",
"> 💡 You can learn more about generators from [Python Wiki](https://wiki.python.org/moin/Generators)"
]
},
{
Expand All @@ -21,15 +23,7 @@
"metadata": {},
"outputs": [],
"source": [
"!pip install fastembed --upgrade --quiet # Install fastembed "
]
},
{
"cell_type": "markdown",
"id": "ed81d725",
"metadata": {},
"source": [
"Make the necessary imports, initialize the `Embedding` class, and embed your data into vectors:"
"!pip install -Uqq fastembed # Install fastembed"
]
},
{
Expand All @@ -39,186 +33,186 @@
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 76.7M/76.7M [00:05<00:00, 15.0MiB/s]\n",
"100%|██████████| 3/3 [00:00<00:00, 455.37it/s]"
]
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "890cc3b969354eec8d149d143e301a7a",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Fetching 9 files: 0%| | 0/9 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"(384,)\n"
"The model BAAI/bge-small-en-v1.5 is ready to use.\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
"data": {
"text/plain": [
"384"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from typing import List\n",
"import numpy as np\n",
"from fastembed.embedding import DefaultEmbedding\n",
"from fastembed import TextEmbedding\n",
"from typing import List\n",
"\n",
"# Example list of documents\n",
"documents: List[str] = [\n",
" \"Hello, World!\",\n",
" \"This is an example document.\",\n",
" \"This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.\",\n",
" \"fastembed is supported by and maintained by Qdrant.\",\n",
"]\n",
"# Initialize the DefaultEmbedding class\n",
"embedding_model = DefaultEmbedding()\n",
"embeddings: List[np.ndarray] = list(embedding_model.embed(documents))\n",
"print(embeddings[0].shape)"
"\n",
"# This will trigger the model download and initialization\n",
"embedding_model = TextEmbedding()\n",
"print(\"The model BAAI/bge-small-en-v1.5 is ready to use.\")\n",
"\n",
"embeddings_generator = embedding_model.embed(documents) # reminder this is a generator\n",
"embeddings_list = list(embedding_model.embed(documents))\n",
" # you can also convert the generator to a list, and that to a numpy array\n",
"len(embeddings_list[0]) # Vector of 384 dimensions"
]
},
{
"cell_type": "markdown",
"id": "8c49ae50",
"id": "d772190b",
"metadata": {},
"source": [
"## Let's think step by step"
"> 💡 **Why do we use generators?**\n",
"> \n",
"> We use them to save memory mostly. Instead of loading all the vectors into memory, we can load them one by one. This is useful when you have a large dataset and you don't want to load all the vectors at once."
]
},
{
"cell_type": "markdown",
"id": "92cf4b76",
"cell_type": "code",
"execution_count": 3,
"id": "8a225cb8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Document: This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.\n",
"Vector of type: <class 'numpy.ndarray'> with shape: (384,)\n",
"Document: fastembed is supported by and maintained by Qdrant.\n",
"Vector of type: <class 'numpy.ndarray'> with shape: (384,)\n"
]
}
],
"source": [
"### Setup\n",
"embeddings_generator = embedding_model.embed(documents) # reminder this is a generator\n",
"\n",
"Importing the required classes and modules:"
"for doc, vector in zip(documents, embeddings_generator):\n",
" print(\"Document:\", doc)\n",
" print(f\"Vector of type: {type(vector)} with shape: {vector.shape}\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "c0a6f634",
"execution_count": 4,
"id": "769a1be9",
"metadata": {},
"outputs": [],
"outputs": [
{
"data": {
"text/plain": [
"(2, 384)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from typing import List\n",
"import numpy as np\n",
"from fastembed.embedding import DefaultEmbedding"
"embeddings_list = np.array(\n",
" list(embedding_model.embed(documents))\n",
") # you can also convert the generator to a list, and that to a numpy array\n",
"embeddings_list.shape"
]
},
{
"cell_type": "markdown",
"id": "3fd03a71",
"id": "8c49ae50",
"metadata": {},
"source": [
"Notice that we are using the DefaultEmbedding -- which is a quantized, state of the Art Flag Embedding model which beats OpenAI's Embedding by a large margin. \n",
"\n",
"### Prepare your Documents\n",
"You can define a list of documents that you'd like to embed. These can be sentences, paragraphs, or even entire documents. \n",
"\n",
"We're using [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) a state of the art Flag Embedding model. The model does better than OpenAI text-embedding-ada-002. We've made it even faster by converting it to ONNX format and quantizing the model for you.\n",
"#### Format of the Document List\n",
"1. List of Strings: Your documents must be in a list, and each document must be a string\n",
"2. For Retrieval Tasks: If you're working with queries and passages, you can add special labels to them:\n",
"- **Queries**: Add \"query:\" at the beginning of each query string\n",
"- **Passages**: Add \"passage:\" at the beginning of each passage string"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "145a56ce",
"metadata": {},
"outputs": [],
"source": [
"# Example list of documents\n",
"documents: List[str] = [\n",
" \"passage: Hello, World!\",\n",
" \"query: Hello, World!\", # these are two different embedding\n",
" \"passage: This is an example passage.\",\n",
" # You can leave out the prefix but it's recommended\n",
" \"fastembed is supported by and maintained by Qdrant.\",\n",
"]"
]
},
{
"cell_type": "markdown",
"id": "1cb3cc87",
"metadata": {},
"source": [
"### Load the Embedding Model Weights\n",
"Next, initialize the Embedding class with the desired parameters. Here, \"BAAI/bge-small-en\" is the pre-trained model name, and max_length=512 is the maximum token length for each document.\n",
"\n",
"This will download the model weights, decompress to directory `local_cache` and load them into the Embedding class.\n",
"- **Passages**: Add \"passage:\" at the beginning of each passage string\n",
"\n",
"#### Initialize DefaultEmbedding\n",
"\n",
"We will initialize Flag Embeddings with the model name and the maximum token length. That is the DefaultEmbedding class with the model name \"BAAI/bge-small-en\" and max_length=512."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "272c8915",
"metadata": {},
"outputs": [],
"source": [
"embedding_model = DefaultEmbedding()"
]
},
{
"cell_type": "markdown",
"id": "5549d501",
"metadata": {},
"source": [
"### Embed your Documents\n",
"## Beyond the default model\n",
"\n",
"Use the embed method of the embedding model to transform the documents into a List of np.array. The method returns a generator, so we cast it to a list to get the embeddings."
"The default model is built for speed and efficiency. If you need a more accurate model, you can use the `TextEmbedding` class to load any model from our list of available models. You can find the list of available models using `TextEmbedding.list_supported_models()`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "8013eee9",
"id": "2e9c8766",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"100%|██████████| 4/4 [00:00<00:00, 361.82it/s]\n"
]
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "9470ec542f3c4400a42452c2489a1abc",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Fetching 8 files: 0%| | 0/8 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"embeddings: List[np.ndarray] = list(embedding_model.embed(documents))"
]
},
{
"cell_type": "markdown",
"id": "e5b5a6ad",
"metadata": {},
"source": [
"You can print the shape of the embeddings to understand their dimensions. Typically, the shape will indicate the number of dimensions in the vector."
"multilingual_large_model = TextEmbedding(\"intfloat/multilingual-e5-large\") # This can take a few minutes to download"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "0d8c8e08",
"id": "a9e70f0e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(384,)\n"
]
"data": {
"text/plain": [
"(4, 1024)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(embeddings[0].shape) # (384,) or similar output"
"np.array(list(multilingual_large_model.embed([\"Hello, world!\", \"你好世界\", \"¡Hola Mundo!\", \"नमस्ते!\"]))).shape # Vector of 1024 dimensions"
]
},
{
"cell_type": "markdown",
"id": "64fe20ed",
"metadata": {},
"source": [
"Next: Checkout how to use FastEmbed with Qdrant for similarity search: [FastEmbed with Qdrant](https://qdrant.github.io/fastembed/examples/Usage_With_Qdrant/)"
]
}
],
Expand All @@ -238,7 +232,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.17"
"version": "3.10.13"
}
},
"nbformat": 4,
Expand Down

0 comments on commit 019c889

Please sign in to comment.