partners, elasticsearch: Enable `ElasticsearchStore` to retrieve with the pure BM25 algorithm without vector search #19314

g-votte · 2024-03-20T05:43:08Z

Description

This pull request proposes the implementation of the BM25RetrievalStrategy for ElasticsearchStore. This retrieval strategy enables searches purely based on BM25 without involving vector search. Below, the usage example, motivation, and details of the changes are discussed.

Usage Example of Introduced Feature

By specifying the BM25RetrievalStrategy as a constructor argument for ElasticsearchStore, users can perform searches using pure BM25 without vector search. Note that in the example below, the embedding option is not specified, indicating that the search is conducted without using embeddings.

from langchain_elasticsearch.vectorstores import ElasticsearchStore

store = ElasticsearchStore(
    es_url="http://localhost:9200",
    index_name="test_index",
    strategy=ElasticsearchStore.BM25RetrievalStrategy(),
)

store.add_texts(
    [
        "foo",
        "foo bar",
        "foo bar baz",
        "bar",
        "bar baz",
        "baz"
    ],
)

results = store.similarity_search(query="foo", k=10)
print(results)

The example above outputs:

[Document(page_content='foo'), Document(page_content='foo bar'), Document(page_content='foo bar baz')]

Motivation

There is a considerable demand for using Elasticsearch as a pure BM25 retriever without vector search. Although hybrid searches combining vector search and BM25 are supported alongside ApproxRetrievalStrategy, there are cases where pure BM25 searches are needed, for example, when seeking speed performance in search or prioritizing exact matches over semantic searches.

Pure BM25 retrievers, such as ElasticsearchBM25Retriever and BM25Retriever, have already been implemented in langchain_community. However, these classes do not offer the rich and flexible Elasticsearch features supported by ElasticsearchStore, such as various authentication options and flexible querying with custom_query. Additionally, being a subclass of VectorStore, it benefits from components like RecordManager, which are advantageous during operational phases. Therefore, supporting pure BM25 searches in ElasticsearchStore presents significant benefits.

To achieve this, an easy-to-use interface is necessary. The abstraction of ElasticsearchStore itself is sophisticated enough to support pure BM25 without vector search, as it allows inputting a strategy class inheriting from BaseRetrievalStrategy. However, implementing a BM25 search by inheriting from BaseRetrievalStrategy can be challenging for general users (as it took me several days). Without native support from the library, it's difficult for users to arrive at this solution.

Therefore, this PR suggests implementing a new retrieval strategy, BM25RetrievalStrategy, to enable ElasticsearchStore to support pure BM25 searches.

Change Details

Implemented the BM25RetrievalStrategy class.
Added a new argument, text_field, to the index method of BaseRetrievalStrategy.
- While strictly speaking, this modification could potentially affect backward compatibility, it's my belief that the number of users directly inheriting and utilizing BaseRetrievalStrategy in their projects is relatively small. Consequently, the overall impact of this change should be minimal. Nonetheless, I am open to and would appreciate any feedback on this matter.
Added integration test scenarios to verify the behavior of BM25RetrievalStrategy, named test_similarity_search_bm25_search*. Following the example of other retrieval strategy classes, scenarios with and without using the filter option were tested.

vercel · 2024-03-20T05:43:12Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Mar 28, 2024 11:45am

baskaryan · 2024-03-23T00:56:09Z

cc @joemcelroy

joemcelroy · 2024-03-27T08:36:45Z