From 6e2d4fac5ba748cb9f650eede65011805abe1b10 Mon Sep 17 00:00:00 2001 From: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Date: Wed, 22 Jan 2025 18:06:07 +0100 Subject: [PATCH] docs: add notebook example with XML backends Signed-off-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> --- docs/examples/backend_xml_rag.ipynb | 926 +++++++++++++++++++++++++++- mkdocs.yml | 1 + 2 files changed, 925 insertions(+), 2 deletions(-) diff --git a/docs/examples/backend_xml_rag.ipynb b/docs/examples/backend_xml_rag.ipynb index db08f24c..6d191aaa 100644 --- a/docs/examples/backend_xml_rag.ipynb +++ b/docs/examples/backend_xml_rag.ipynb @@ -4,13 +4,935 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Converting US Patents from XML files for a RAG application" + "\"Open" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Backend converters for XML" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "| Step | Tech | Execution | \n", + "| --- | --- | --- |\n", + "| Embedding | Hugging Face / Sentence Transformers | ๐Ÿ’ป Local |\n", + "| Vector store | Milvus | ๐Ÿ’ป Local |\n", + "| Gen AI | Hugging Face Inference API | ๐ŸŒ Remote | " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Overview" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is an example of using [Docling](https://ds4sd.github.io/docling/) for converting structured data (XML) into a unified document\n", + "representation format, `DoclingDocument`, and leverage its riched structured content for RAG applications.\n", + "Data used in this example consist of patents from the [United States Patent and Trademark Office](https://www.uspto.gov/) and medical\n", + "articles from [PubMed Centralยฎ](https://pmc.ncbi.nlm.nih.gov/).\n", + "\n", + "In this notebook, we accomplish the following:\n", + "- [Setup](#setup) the API access for generative AI\n", + "- [Fetch the data](#fetch-the-data) from USPTO and PubMed Centralยฎ sites, using Docling custom backends\n", + "- [Parse, chunk, and index](#parse-chunk-and-index) the documents in a vector database\n", + "- [Perform RAG](#question-answering-with-rag) using [LlamaIndex Docling extension](../../integrations/llamaindex/)\n", + "- [Delete the temporary files](#delete-temporary-files) used in notebook\n", + "\n", + "For more details on document chunking with Docling, refer to the [Chunking](../../concepts/chunking/) documentation. For RAG with Docling and LlamaIndex, also check the example [RAG with LlamaIndex](../rag_llamaindex/)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Setup" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Requirements can be installed as shown below. The `--no-warn-conflicts` argument is meant for Colab's pre-populated Python environment, feel free to remove for stricter usage." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-vector-stores-milvus llama-index-readers-file python-dotenv" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This notebook uses HuggingFace's Inference API. For an increased LLM quota, a token can be provided via the environment variable `HF_TOKEN`.\n", + "\n", + "If you're running this notebook in Google Colab, make sure you [add](https://medium.com/@parthdasawant/how-to-use-secrets-in-google-colab-450c38e3ec75) your API key as a secret." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from warnings import filterwarnings\n", + "\n", + "from dotenv import load_dotenv\n", + "\n", + "\n", + "def _get_env_from_colab_or_os(key):\n", + " try:\n", + " from google.colab import userdata\n", + "\n", + " try:\n", + " return userdata.get(key)\n", + " except userdata.SecretNotFoundError:\n", + " pass\n", + " except ImportError:\n", + " pass\n", + " return os.getenv(key)\n", + "\n", + "\n", + "load_dotenv()\n", + "\n", + "filterwarnings(action=\"ignore\", category=UserWarning, module=\"pydantic\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now define the main parameters:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "from tempfile import mkdtemp\n", + "\n", + "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n", + "from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI\n", + "\n", + "EMBED_MODEL_ID = \"BAAI/bge-small-en-v1.5\"\n", + "EMBED_MODEL = HuggingFaceEmbedding(model_name=EMBED_MODEL_ID)\n", + "TEMP_DIR = Path(mkdtemp())\n", + "MILVUS_URI = str(TEMP_DIR / \"docling.db\")\n", + "GEN_MODEL = HuggingFaceInferenceAPI(\n", + " token=_get_env_from_colab_or_os(\"HF_TOKEN\"),\n", + " model_name=\"mistralai/Mixtral-8x7B-Instruct-v0.1\",\n", + ")\n", + "embed_dim = len(EMBED_MODEL.get_text_embedding(\"hi\"))\n", + "# https://github.com/huggingface/transformers/issues/5486:\n", + "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Fetch the data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this notebook we will use XML data from collections supported by Docling:\n", + "- Medical articles from the [PubMed Centralยฎ (PMC)](https://pmc.ncbi.nlm.nih.gov/). They are available in an [FTP server](https://ftp.ncbi.nlm.nih.gov/pub/pmc/) as `.tar.gz` files. Each file contains the full article data in XML format, among other supplementary files like images or spreadsheets.\n", + "- Patents from the [United States Patent and Trademark Office](https://www.uspto.gov/). They are available in the [Bulk Data Storage System (BDSS)](https://bulkdata.uspto.gov/) as zip files. Each zip file may contain several patents in XML format.\n", + "\n", + "The raw files will be downloaded form the source and saved in a temporary directory." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### PMC articles\n", + "\n", + "The [OA file](https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.csv) is a manifest file of all the PMC articles, including the URL path to download the source files. In this notebook we will use as example the article [Pathogens spread by high-altitude windborne mosquitoes](https://pmc.ncbi.nlm.nih.gov/articles/PMC11703268/), which is available in the archive file [PMC11703268.tar.gz](https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Downloading https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz...\n", + "Extracting and storing the XML file containing the article text...\n", + "Stored XML file nihpp-2024.12.26.630351v1.nxml\n" + ] + } + ], + "source": [ + "import tarfile\n", + "from io import BytesIO\n", + "\n", + "import requests\n", + "\n", + "# PMC article PMC11703268\n", + "url: str = \"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/e3/6b/PMC11703268.tar.gz\"\n", + "\n", + "print(f\"Downloading {url}...\")\n", + "buf = BytesIO(requests.get(url).content)\n", + "print(\"Extracting and storing the XML file containing the article text...\")\n", + "with tarfile.open(fileobj=buf, mode=\"r:gz\") as tar_file:\n", + " for tarinfo in tar_file:\n", + " if tarinfo.isreg():\n", + " file_path = Path(tarinfo.name)\n", + " if file_path.suffix == \".nxml\":\n", + " with open(TEMP_DIR / file_path.name, \"wb\") as file_obj:\n", + " file_obj.write(tar_file.extractfile(tarinfo).read())\n", + " print(f\"Stored XML file {file_path.name}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### USPTO patents\n", + "\n", + "Since each USPTO file is a concatenation of several patents, we need to split its content into valid XML pieces. The following code downloads a sample zip file, split its content in sections, and dumps each section as an XML file. For simplicity, this pipeline is shown here in a sequential manner, but it could be parallelized." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import zipfile\n", + "\n", + "# Patent grants from December 17-23, 2024\n", + "url: str = (\n", + " \"https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2024/ipg241217.zip\"\n", + ")\n", + "XML_SPLITTER: str = ' 0\n", + " ): # cases like 0 and is_patent:\n", + " doc_num += 1\n", + " patent_id = f\"ipg241217-{doc_num}\"\n", + " with open(TEMP_DIR / f\"{patent_id}.xml\", \"wb\") as file_obj:\n", + " file_obj.write(patent_buffer.getbuffer())\n", + " is_patent = False\n", + " patent_buffer = BytesIO()\n", + " elif decoded_line.startswith(\"" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "index.from_documents(\n", + " documents=reader.load_data(TEMP_DIR / \"nihpp-2024.12.26.630351v1.nxml\"),\n", + " transformations=[node_parser],\n", + " storage_context=StorageContext.from_defaults(vector_store=vector_store),\n", + " embed_model=EMBED_MODEL,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Question-answering with RAG" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The retriever can be used to identify highly relevant documents:" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Node ID: 36851b60-691d-4cf2-a006-67e14cc8680f\n", + "Text: The portable fitness monitoring device 102 may be a device such\n", + "as, for example, a mobile phone, a personal digital assistant, a music\n", + "file player (e.g. and MP3 player), an intelligent article for wearing\n", + "(e.g. a fitness monitoring garment, wrist band, or watch), a dongle\n", + "(e.g. a small hardware device that protects software) that includes a\n", + "fitn...\n", + "Score: 0.769\n", + "\n", + "Node ID: 8ad03482-52e6-4a82-8cf3-ee376995ba1a\n", + "Text: US Patent Application US 20120071306 entitled โ€œPortable\n", + "Multipurpose Whole Body Exercise Deviceโ€ discloses a portable\n", + "multipurpose whole body exercise device which can be used for general\n", + "fitness, Pilates-type, core strengthening, therapeutic, and\n", + "rehabilitative exercises as well as stretching and physical therapy\n", + "and which includes storable acc...\n", + "Score: 0.749\n", + "\n", + "Node ID: 1c84f670-94d9-4531-aa84-338448b99746\n", + "Text: Program products, methods, and systems for providing fitness\n", + "monitoring services of the present invention can include any software\n", + "application executed by one or more computing devices. A computing\n", + "device can be any type of computing device having one or more\n", + "processors. For example, a computing device can be a workstation,\n", + "mobile device (e.g., ...\n", + "Score: 0.745\n", + "\n" + ] + } + ], + "source": [ + "retriever = index.as_retriever(similarity_top_k=3)\n", + "results = retriever.retrieve(\"What patents are related to fitness devices?\")\n", + "\n", + "for item in results:\n", + " print(item)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "With the query engine, we can run the question-answering with the RAG pattern on the set of indexed documents.\n", + "\n", + "First, we can prompt the LLM directly:" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Prompt โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ\n",
+       "โ”‚ Do mosquitoes in high altitude expand viruses over large distances?                                             โ”‚\n",
+       "โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[1;31mโ•ญโ”€\u001b[0m\u001b[1;31mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€\u001b[0m\u001b[1;31m Prompt \u001b[0m\u001b[1;31mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€\u001b[0m\u001b[1;31mโ”€โ•ฎ\u001b[0m\n", + "\u001b[1;31mโ”‚\u001b[0m Do mosquitoes in high altitude expand viruses over large distances? \u001b[1;31mโ”‚\u001b[0m\n", + "\u001b[1;31mโ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Generated Content โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ\n",
+       "โ”‚ Mosquitoes are generally not as prevalent in high altitude areas due to the colder temperatures and lower       โ”‚\n",
+       "โ”‚ oxygen levels, which can make it difficult for them to fly and reproduce. However, some species of mosquitoes   โ”‚\n",
+       "โ”‚ are able to survive at higher altitudes.                                                                        โ”‚\n",
+       "โ”‚                                                                                                                 โ”‚\n",
+       "โ”‚ Mosquitoes can potentially transmit viruses, such as West Nile virus and Zika virus, over long distances by     โ”‚\n",
+       "โ”‚ feeding on the blood of a host and then traveling to a new location where they can transmit the virus to        โ”‚\n",
+       "โ”‚ another host. This is known as the dispersal capacity of mosquitoes.                                            โ”‚\n",
+       "โ”‚                                                                                                                 โ”‚\n",
+       "โ”‚ However, it is important to note that the transmission of viruses by mosquitoes is dependent on a number of     โ”‚\n",
+       "โ”‚ factors, including the presence of the virus in the mosquito population, the presence of susceptible hosts, and โ”‚\n",
+       "โ”‚ environmental conditions that are conducive to mosquito survival and reproduction.                              โ”‚\n",
+       "โ”‚                                                                                                                 โ”‚\n",
+       "โ”‚ In general, the transmission of viruses by mosquitoes is more likely to occur in areas where there is a high    โ”‚\n",
+       "โ”‚ prevalence of the virus and a large mosquito population, rather than in high altitude areas where mosquitoes    โ”‚\n",
+       "โ”‚ are less common. It is also worth noting that the transmission of viruses by mosquitoes is typically a          โ”‚\n",
+       "โ”‚ localized phenomenon, rather than something that occurs over large distances.                                   โ”‚\n",
+       "โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[1;32mโ•ญโ”€\u001b[0m\u001b[1;32mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€\u001b[0m\u001b[1;32m Generated Content \u001b[0m\u001b[1;32mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€\u001b[0m\u001b[1;32mโ”€โ•ฎ\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m Mosquitoes are generally not as prevalent in high altitude areas due to the colder temperatures and lower \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m oxygen levels, which can make it difficult for them to fly and reproduce. However, some species of mosquitoes \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m are able to survive at higher altitudes. \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m Mosquitoes can potentially transmit viruses, such as West Nile virus and Zika virus, over long distances by \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m feeding on the blood of a host and then traveling to a new location where they can transmit the virus to \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m another host. This is known as the dispersal capacity of mosquitoes. \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m However, it is important to note that the transmission of viruses by mosquitoes is dependent on a number of \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m factors, including the presence of the virus in the mosquito population, the presence of susceptible hosts, and \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m environmental conditions that are conducive to mosquito survival and reproduction. \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m In general, the transmission of viruses by mosquitoes is more likely to occur in areas where there is a high \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m prevalence of the virus and a large mosquito population, rather than in high altitude areas where mosquitoes \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m are less common. It is also worth noting that the transmission of viruses by mosquitoes is typically a \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m localized phenomenon, rather than something that occurs over large distances. \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from llama_index.core.base.llms.types import ChatMessage, MessageRole\n", + "from rich.console import Console\n", + "from rich.panel import Panel\n", + "\n", + "console = Console()\n", + "query = \"Do mosquitoes in high altitude expand viruses over large distances?\"\n", + "\n", + "usr_msg = ChatMessage(role=MessageRole.USER, content=query)\n", + "response = GEN_MODEL.chat(messages=[usr_msg])\n", + "\n", + "console.print(Panel(query, title=\"Prompt\", border_style=\"bold red\"))\n", + "console.print(\n", + " Panel(\n", + " response.message.content.strip(),\n", + " title=\"Generated Content\",\n", + " border_style=\"bold green\",\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, we can compare the response when the model is propmpted with the indexed PMC article as supporting context:" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Generated Content with RAG โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ\n",
+       "โ”‚ Yes, mosquitoes in high altitude can expand viruses over large distances. A study intercepted 1,017 female      โ”‚\n",
+       "โ”‚ mosquitoes at altitudes of 120-290 m above ground over Mali and Ghana and screened them for infection with      โ”‚\n",
+       "โ”‚ arboviruses, plasmodia, and filariae. The study found that 3.5% of the mosquitoes were infected with            โ”‚\n",
+       "โ”‚ flaviviruses, and 1.1% were infectious. Additionally, the study identified 19 mosquito-borne pathogens,         โ”‚\n",
+       "โ”‚ including three arboviruses that affect humans (dengue, West Nile, and Mโ€™Poko viruses). The study provides      โ”‚\n",
+       "โ”‚ compelling evidence that mosquito-borne pathogens are often spread by windborne mosquitoes at altitude.         โ”‚\n",
+       "โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ\n",
+       "
\n" + ], + "text/plain": [ + "\u001b[1;32mโ•ญโ”€\u001b[0m\u001b[1;32mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€\u001b[0m\u001b[1;32m Generated Content with RAG \u001b[0m\u001b[1;32mโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€\u001b[0m\u001b[1;32mโ”€โ•ฎ\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m Yes, mosquitoes in high altitude can expand viruses over large distances. A study intercepted 1,017 female \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m mosquitoes at altitudes of 120-290 m above ground over Mali and Ghana and screened them for infection with \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m arboviruses, plasmodia, and filariae. The study found that 3.5% of the mosquitoes were infected with \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m flaviviruses, and 1.1% were infectious. Additionally, the study identified 19 mosquito-borne pathogens, \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m including three arboviruses that affect humans (dengue, West Nile, and Mโ€™Poko viruses). The study provides \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ”‚\u001b[0m compelling evidence that mosquito-borne pathogens are often spread by windborne mosquitoes at altitude. \u001b[1;32mโ”‚\u001b[0m\n", + "\u001b[1;32mโ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ\u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters\n", + "\n", + "filters = MetadataFilters(\n", + " filters=[\n", + " ExactMatchFilter(key=\"filename\", value=\"nihpp-2024.12.26.630351v1.nxml\"),\n", + " ]\n", + ")\n", + "\n", + "query_engine = index.as_query_engine(llm=GEN_MODEL, filter=filters, similarity_top_k=3)\n", + "result = query_engine.query(query)\n", + "\n", + "console.print(\n", + " Panel(\n", + " result.response.strip(),\n", + " title=\"Generated Content with RAG\",\n", + " border_style=\"bold green\",\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Delete temporary files\n", + "\n", + "The XML files used in this notebook, as well as the Milvus local database will be removed." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "import shutil\n", + "\n", + "shutil.rmtree(TEMP_DIR)" ] } ], "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, "language_info": { - "name": "python" + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.8" } }, "nbformat": 4, diff --git a/mkdocs.yml b/mkdocs.yml index 8f8d86d9..2b5036e3 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -77,6 +77,7 @@ nav: - "Force full page OCR": examples/full_page_ocr.py - "Accelerator options": examples/run_with_accelerator.py - "Simple translation": examples/translate.py + - "Backend converters for XML files": examples/backend_xml_rag.ipynb - โœ‚๏ธ Chunking: - "Hybrid chunking": examples/hybrid_chunking.ipynb - ๐Ÿ’ฌ RAG / QA: