Rewrite GettingStarted to use TextEmbedding instead of DefaultEmbedding

qdrant · Mar 7, 2024 · 019c889 · 019c889
1 parent 2a61a30
commit 019c889
Showing 1 changed file with 116 additions and 122 deletions.
diff --git a/docs/Getting Started.ipynb b/docs/Getting Started.ipynb
@@ -11,7 +11,9 @@
     "\n",
     "## Quick Start\n",
     "\n",
-    "The fastembed package is designed to be easy to use. The main class is the `Embedding` class. It takes a list of strings as input and returns a list of vectors as output. The `Embedding` class is initialized with a model file."
+    "The fastembed package is designed to be easy to use. We'll be using `TextEmbedding` class. It takes a list of strings as input and returns an generator of vectors. If you're seeing generators for the first time, don't worry, you can convert it to a list using `list()`.\n",
+    "\n",
+    "> 💡 You can learn more about generators from [Python Wiki](https://wiki.python.org/moin/Generators)"
    ]
   },
   {
@@ -21,15 +23,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!pip install fastembed --upgrade --quiet # Install fastembed "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "ed81d725",
-   "metadata": {},
-   "source": [
-    "Make the necessary imports, initialize the `Embedding` class, and embed your data into vectors:"
+    "!pip install -Uqq fastembed # Install fastembed"
    ]
   },
   {
@@ -39,186 +33,186 @@
    "metadata": {},
    "outputs": [
     {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "100%|██████████| 76.7M/76.7M [00:05<00:00, 15.0MiB/s]\n",
-      "100%|██████████| 3/3 [00:00<00:00, 455.37it/s]"
-     ]
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "890cc3b969354eec8d149d143e301a7a",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
     },
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "(384,)\n"
+      "The model BAAI/bge-small-en-v1.5 is ready to use.\n"
      ]
     },
     {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "\n"
-     ]
+     "data": {
+      "text/plain": [
+       "384"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
     }
    ],
    "source": [
-    "from typing import List\n",
     "import numpy as np\n",
-    "from fastembed.embedding import DefaultEmbedding\n",
+    "from fastembed import TextEmbedding\n",
+    "from typing import List\n",
     "\n",
     "# Example list of documents\n",
     "documents: List[str] = [\n",
-    "    \"Hello, World!\",\n",
-    "    \"This is an example document.\",\n",
+    "    \"This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.\",\n",
     "    \"fastembed is supported by and maintained by Qdrant.\",\n",
     "]\n",
-    "# Initialize the DefaultEmbedding class\n",
-    "embedding_model = DefaultEmbedding()\n",
-    "embeddings: List[np.ndarray] = list(embedding_model.embed(documents))\n",
-    "print(embeddings[0].shape)"
+    "\n",
+    "# This will trigger the model download and initialization\n",
+    "embedding_model = TextEmbedding()\n",
+    "print(\"The model BAAI/bge-small-en-v1.5 is ready to use.\")\n",
+    "\n",
+    "embeddings_generator = embedding_model.embed(documents)  # reminder this is a generator\n",
+    "embeddings_list = list(embedding_model.embed(documents))\n",
+    "  # you can also convert the generator to a list, and that to a numpy array\n",
+    "len(embeddings_list[0]) # Vector of 384 dimensions"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "8c49ae50",
+   "id": "d772190b",
    "metadata": {},
    "source": [
-    "## Let's think step by step"
+    "> 💡 **Why do we use generators?**\n",
+    "> \n",
+    "> We use them to save memory mostly. Instead of loading all the vectors into memory, we can load them one by one. This is useful when you have a large dataset and you don't want to load all the vectors at once."
    ]
   },
   {
-   "cell_type": "markdown",
-   "id": "92cf4b76",
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "8a225cb8",
    "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Document: This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.\n",
+      "Vector of type: <class 'numpy.ndarray'> with shape: (384,)\n",
+      "Document: fastembed is supported by and maintained by Qdrant.\n",
+      "Vector of type: <class 'numpy.ndarray'> with shape: (384,)\n"
+     ]
+    }
+   ],
    "source": [
-    "### Setup\n",
+    "embeddings_generator = embedding_model.embed(documents)  # reminder this is a generator\n",
     "\n",
-    "Importing the required classes and modules:"
+    "for doc, vector in zip(documents, embeddings_generator):\n",
+    "    print(\"Document:\", doc)\n",
+    "    print(f\"Vector of type: {type(vector)} with shape: {vector.shape}\")"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 3,
-   "id": "c0a6f634",
+   "execution_count": 4,
+   "id": "769a1be9",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(2, 384)"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
-    "from typing import List\n",
-    "import numpy as np\n",
-    "from fastembed.embedding import DefaultEmbedding"
+    "embeddings_list = np.array(\n",
+    "    list(embedding_model.embed(documents))\n",
+    ")  # you can also convert the generator to a list, and that to a numpy array\n",
+    "embeddings_list.shape"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "3fd03a71",
+   "id": "8c49ae50",
    "metadata": {},
    "source": [
-    "Notice that we are using the DefaultEmbedding -- which is a quantized, state of the Art Flag Embedding model which beats OpenAI's Embedding by a large margin. \n",
-    "\n",
-    "### Prepare your Documents\n",
-    "You can define a list of documents that you'd like to embed. These can be sentences, paragraphs, or even entire documents. \n",
-    "\n",
+    "We're using [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) a state of the art Flag Embedding model. The model does better than OpenAI text-embedding-ada-002. We've made it even faster by converting it to ONNX format and quantizing the model for you.\n",
     "#### Format of the Document List\n",
     "1. List of Strings: Your documents must be in a list, and each document must be a string\n",
     "2. For Retrieval Tasks: If you're working with queries and passages, you can add special labels to them:\n",
     "- **Queries**: Add \"query:\" at the beginning of each query string\n",
-    "- **Passages**: Add \"passage:\" at the beginning of each passage string"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 3,
-   "id": "145a56ce",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Example list of documents\n",
-    "documents: List[str] = [\n",
-    "    \"passage: Hello, World!\",\n",
-    "    \"query: Hello, World!\",  # these are two different embedding\n",
-    "    \"passage: This is an example passage.\",\n",
-    "    # You can leave out the prefix but it's recommended\n",
-    "    \"fastembed is supported by and maintained by Qdrant.\",\n",
-    "]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "1cb3cc87",
-   "metadata": {},
-   "source": [
-    "### Load the Embedding Model Weights\n",
-    "Next, initialize the Embedding class with the desired parameters. Here, \"BAAI/bge-small-en\" is the pre-trained model name, and max_length=512 is the maximum token length for each document.\n",
-    "\n",
-    "This will download the model weights, decompress to directory `local_cache` and load them into the Embedding class.\n",
+    "- **Passages**: Add \"passage:\" at the beginning of each passage string\n",
     "\n",
-    "#### Initialize DefaultEmbedding\n",
-    "\n",
-    "We will initialize Flag Embeddings with the model name and the maximum token length. That is the DefaultEmbedding class with the model name \"BAAI/bge-small-en\" and max_length=512."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "id": "272c8915",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "embedding_model = DefaultEmbedding()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "5549d501",
-   "metadata": {},
-   "source": [
-    "### Embed your Documents\n",
+    "## Beyond the default model\n",
     "\n",
-    "Use the embed method of the embedding model to transform the documents into a List of np.array. The method returns a generator, so we cast it to a list to get the embeddings."
+    "The default model is built for speed and efficiency. If you need a more accurate model, you can use the `TextEmbedding` class to load any model from our list of available models. You can find the list of available models using `TextEmbedding.list_supported_models()`."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 5,
-   "id": "8013eee9",
+   "id": "2e9c8766",
    "metadata": {},
    "outputs": [
     {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "100%|██████████| 4/4 [00:00<00:00, 361.82it/s]\n"
-     ]
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "9470ec542f3c4400a42452c2489a1abc",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
     }
    ],
    "source": [
-    "embeddings: List[np.ndarray] = list(embedding_model.embed(documents))"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "e5b5a6ad",
-   "metadata": {},
-   "source": [
-    "You can print the shape of the embeddings to understand their dimensions. Typically, the shape will indicate the number of dimensions in the vector."
+    "multilingual_large_model = TextEmbedding(\"intfloat/multilingual-e5-large\") # This can take a few minutes to download"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 6,
-   "id": "0d8c8e08",
+   "id": "a9e70f0e",
    "metadata": {},
    "outputs": [
     {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "(384,)\n"
-     ]
+     "data": {
+      "text/plain": [
+       "(4, 1024)"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
     }
    ],
    "source": [
-    "print(embeddings[0].shape)  # (384,) or similar output"
+    "np.array(list(multilingual_large_model.embed([\"Hello, world!\", \"你好世界\", \"¡Hola Mundo!\", \"नमस्ते!\"]))).shape # Vector of 1024 dimensions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "64fe20ed",
+   "metadata": {},
+   "source": [
+    "Next: Checkout how to use FastEmbed with Qdrant for similarity search: [FastEmbed with Qdrant](https://qdrant.github.io/fastembed/examples/Usage_With_Qdrant/)"
    ]
   }
  ],
@@ -238,7 +232,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.17"
+   "version": "3.10.13"
   }
  },
  "nbformat": 4,