Skip to content

Sparse Retrieval

Elias Bassani edited this page Feb 14, 2023 · 2 revisions

Introduction

A Sparse Retriever is retrieval model based on lexical matching.
Classic search engine are based on sparse retrieval models, such as BM25 (used by Elasticsearch.

retriv exposes two identical classes, SparseRetriever and SearchEngine, for using the sparse retrieval model BM25.

Minimal Working Example

from retriv import SearchEngine

collection = [
  {"id": "doc_1", "text": "Generals gathered in their masses"},
  {"id": "doc_2", "text": "Just like witches at black masses"},
  {"id": "doc_3", "text": "Evil minds that plot destruction"},
  {"id": "doc_4", "text": "Sorcerer of death's construction"},
]

se = SearchEngine("new-index")
se.index(collection)

se.search("witches masses")

Output:

[
  {
    "id": "doc_2",
    "text": "Just like witches at black masses",
    "score": 1.7536403
  },
  {
    "id": "doc_1",
    "text": "Generals gathered in their masses",
    "score": 0.6931472
  }
]

Create index from file

You can index a document collection from a JSONl, CSV, or TSV file. CSV and TSV files must have a header. File kind is automatically inferred. Use the callback parameter to pass a function for converting your documents in the format supported by retriv on the fly. Indexes are automatically saved. This is the preferred way of creating indexes as it has a low memory footprint.

from retriv import SearchEngine

se = SearchEngine("new-index")

se.index_file(
  path="path/to/collection",  # File kind is automatically inferred
  show_progress=True,         # Default value
  callback=lambda doc: {      # Callback defaults to None
    "id": doc["id"],
    "text": doc["title"] + "\n" + doc["body"],          
  )

se = SearchEngine("new-index") is equivalent to:

se = SearchEngine(
  index_name="new-index",               # Default value
  min_df=1,                             # Min doc-frequency. Defaults to 1.
  tokenizer="whitespace",               # Default value
  stemmer="english",                    # Default value (Snowball English)
  stopwords="english",                  # Default value
  spell_corrector=None,                 # Default value
  do_lowercasing=True,                  # Default value
  do_ampersand_normalization=True,      # Default value
  do_special_chars_normalization=True,  # Default value
  do_acronyms_normalization=True,       # Default value
  do_punctuation_removal=True,          # Default value
)

Create index from list

collection = [
  {"id": "doc_1", "title": "...", "body": "..."},
  {"id": "doc_2", "title": "...", "body": "..."},
  {"id": "doc_3", "title": "...", "body": "..."},
  {"id": "doc_4", "title": "...", "body": "..."},
]

se = SearchEngine(...)

se.index(
  collection,
  show_progress=True,         # Default value
  callback=lambda doc: {      # Callback defaults to None
    "id": doc["id"],
    "text": doc["title"] + "\n" + doc["body"],          
  )
)

Load / Delete index

from retriv import SearchEngine

se = SearchEngine.load("index-name")

SearchEngine.delete("index-name")
Clone this wiki locally