Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partners, elasticsearch: Enable ElasticsearchStore to retrieve with the pure BM25 algorithm without vector search #19314

Closed

Conversation

g-votte
Copy link
Contributor

@g-votte g-votte commented Mar 20, 2024

Description

This pull request proposes the implementation of the BM25RetrievalStrategy for ElasticsearchStore. This retrieval strategy enables searches purely based on BM25 without involving vector search. Below, the usage example, motivation, and details of the changes are discussed.

Usage Example of Introduced Feature

By specifying the BM25RetrievalStrategy as a constructor argument for ElasticsearchStore, users can perform searches using pure BM25 without vector search. Note that in the example below, the embedding option is not specified, indicating that the search is conducted without using embeddings.

from langchain_elasticsearch.vectorstores import ElasticsearchStore

store = ElasticsearchStore(
    es_url="http://localhost:9200",
    index_name="test_index",
    strategy=ElasticsearchStore.BM25RetrievalStrategy(),
)

store.add_texts(
    [
        "foo",
        "foo bar",
        "foo bar baz",
        "bar",
        "bar baz",
        "baz"
    ],
)

results = store.similarity_search(query="foo", k=10)
print(results)

The example above outputs:

[Document(page_content='foo'), Document(page_content='foo bar'), Document(page_content='foo bar baz')]

Motivation

There is a considerable demand for using Elasticsearch as a pure BM25 retriever without vector search. Although hybrid searches combining vector search and BM25 are supported alongside ApproxRetrievalStrategy, there are cases where pure BM25 searches are needed, for example, when seeking speed performance in search or prioritizing exact matches over semantic searches.

Pure BM25 retrievers, such as ElasticsearchBM25Retriever and BM25Retriever, have already been implemented in langchain_community. However, these classes do not offer the rich and flexible Elasticsearch features supported by ElasticsearchStore, such as various authentication options and flexible querying with custom_query. Additionally, being a subclass of VectorStore, it benefits from components like RecordManager, which are advantageous during operational phases. Therefore, supporting pure BM25 searches in ElasticsearchStore presents significant benefits.

To achieve this, an easy-to-use interface is necessary. The abstraction of ElasticsearchStore itself is sophisticated enough to support pure BM25 without vector search, as it allows inputting a strategy class inheriting from BaseRetrievalStrategy. However, implementing a BM25 search by inheriting from BaseRetrievalStrategy can be challenging for general users (as it took me several days). Without native support from the library, it's difficult for users to arrive at this solution.

Therefore, this PR suggests implementing a new retrieval strategy, BM25RetrievalStrategy, to enable ElasticsearchStore to support pure BM25 searches.

Change Details

  • Implemented the BM25RetrievalStrategy class.
  • Added a new argument, text_field, to the index method of BaseRetrievalStrategy.
    • While strictly speaking, this modification could potentially affect backward compatibility, it's my belief that the number of users directly inheriting and utilizing BaseRetrievalStrategy in their projects is relatively small. Consequently, the overall impact of this change should be minimal. Nonetheless, I am open to and would appreciate any feedback on this matter.
  • Added integration test scenarios to verify the behavior of BM25RetrievalStrategy, named test_similarity_search_bm25_search*. Following the example of other retrieval strategy classes, scenarios with and without using the filter option were tested.

@efriis efriis added the partner label Mar 20, 2024
@efriis efriis self-assigned this Mar 20, 2024
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Mar 20, 2024
Copy link

vercel bot commented Mar 20, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 28, 2024 11:45am

@dosubot dosubot bot added Ɑ: retriever Related to retriever module 🔌: elasticsearch Primarily related to elastic/elasticsearch integrations 🤖:improvement Medium size change to existing code to handle new use-cases labels Mar 20, 2024
@baskaryan
Copy link
Collaborator

cc @joemcelroy

"settings": {
"similarity": {
"custom_bm25": {
"type": "BM25",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is being able to specify similarity options a requirement for you?

Comment on lines 1367 to 1368
k1: Optional. Default is 2.0. This corresponds to the BM25 parameter, k1.
b: Optional. Default is 0.75. This corresponds to the BM25 parameter, b.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reasoning behind these defaults? It might make more sense to stick to Elasticsearch's defaults:
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html#bm25

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not know if these defaults can change in the future, so it's best to not apply any hardcoded defaults in here, the ES service should pick its own defaults, whatever they are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my previous comment, I agree that the default values should be None. I have made this modification in this commit: 31f7909.


@staticmethod
def BM25RetrievalStrategy(
k1: float = 2.0, b: float = 0.75
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might consider setting the defaults on the init function, so that BM25RetrievalStrategy() is simple to use as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest making these two arguments None by default, and only include them in the mapping definition when the user provided custom values. That would allow the ES service to apply its own defaults for the general case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The direct reason for setting these default values is because the ElasticsearchBM25Retriever utilizes them. However, I completely agree with @miguelgrinberg 's viewpoint that the default values should be set to None, allowing Elasticsearch's default values to be used unless explicitly specified by the user. This approach will ensure that the behavior of ElasticsearchStore aligns more consistently with Elasticsearch itself.

@@ -70,6 +70,7 @@ def index(
self,
dims_length: Union[int, None],
vector_query_field: str,
text_field: str,
Copy link
Contributor

@miguelgrinberg miguelgrinberg Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: this could be str | str[], and when a list is provided a multi_match query is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine as a single field as developers are not able to search on multiple fields within a VectorStore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @joemcelroy mentioned, this is because of the limitations of the VectorStore and BaseRetrievalStrategy. I'm also hoping for the ElasticsearchStore to support multi-match searches in the future. But, let's keep that outside this PR for now and think about it as a future work.

class BM25RetrievalStrategy(BaseRetrievalStrategy):
"""Retrieval strategy using the native BM25 algorithm of Elasticsearch."""

def __init__(self, k1: float, b: float):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The k1 and b arguments should default to None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I addressed in this commit: a2eb03a

"custom_bm25": {
"type": "BM25",
"k1": self.k1,
"b": self.b,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The k1 and b settings should be given in the mapping only when custom values have been passed. If they are None then they should not be included, to allow Elasticsearch to use its own defaults.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing that out. I have addressed it in this commit: 31f7909

@joemcelroy
Copy link
Contributor

thanks so much for your contribution. The last part is could you document this too https://github.com/langchain-ai/langchain/blob/master/docs/docs/integrations/vectorstores/elasticsearch.ipynb

@g-votte
Copy link
Contributor Author

g-votte commented Mar 28, 2024

@g-votte
Copy link
Contributor Author

g-votte commented Mar 29, 2024

Hm, libs/partner/elasticsearch has been removed from the repository and moved to https://github.com/langchain-ai/langchain-elastic

@joemcelroy @miguelgrinberg @maxjakob @baskaryan @efriis
May I close this PR and create a one with the same changes in the new langchain-elastic repository?
As I'm not sure about the roadmap/timeline of the repository migration, let me know if there's the best timing of moving this PR.

@baskaryan
Copy link
Collaborator

Hm, libs/partner/elasticsearch has been removed from the repository and moved to https://github.com/langchain-ai/langchain-elastic

@joemcelroy @miguelgrinberg @maxjakob @baskaryan @efriis May I close this PR and create a one with the same changes in the new langchain-elastic repository? As I'm not sure about the roadmap/timeline of the repository migration, let me know if there's the best timing of moving this PR.

yep moving to langchain-elastic makes sense!

@g-votte
Copy link
Contributor Author

g-votte commented Mar 31, 2024

Thank you for your comment. I have recreated the PR in the langchain-elastic repository: langchain-ai/langchain-elastic#6
Please review it at your convenience. @joemcelroy @miguelgrinberg @maxjakob @baskaryan @efriis

I will now close this PR in the LangChain main repository.

@g-votte g-votte closed this Mar 31, 2024
baskaryan pushed a commit that referenced this pull request Apr 9, 2024
…#20098)

This pull request follows up on
#19314 and
langchain-ai/langchain-elastic#6, adding
documentation for the `ElasticsearchStore.BM25RetrievalStrategy`.

Like other retrieval strategies, we are now introducing
BM25RetrievalStrategy.

### Background
- The `BM25RetrievalStrategy` has been introduced to `langchain-elastic`
via the pull request
langchain-ai/langchain-elastic#6.
- This PR was initially created in the main `langchain` repository but
was moved to `langchain-elastic` during the review process due to the
migration of the partner package.
- The original PR can be found at
#19314.
- As
[commented](#19314 (comment))
by @joemcelroy, documenting the new retrieval strategy is part of the
requirements for its introduction.

Although the `BM25RetrievalStrategy` has been merged into
`langchain-elastic`, its documentation is still to be maintained in the
main `langchain` repository. Therefore, this pull request adds the
documentation portion of `BM25RetrievalStrategy`.

The content of the documentation remains the same as that included in
the original PR, #19314.

---------

Co-authored-by: Max Jakob <[email protected]>
hinthornw pushed a commit that referenced this pull request Apr 26, 2024
…#20098)

This pull request follows up on
#19314 and
langchain-ai/langchain-elastic#6, adding
documentation for the `ElasticsearchStore.BM25RetrievalStrategy`.

Like other retrieval strategies, we are now introducing
BM25RetrievalStrategy.

### Background
- The `BM25RetrievalStrategy` has been introduced to `langchain-elastic`
via the pull request
langchain-ai/langchain-elastic#6.
- This PR was initially created in the main `langchain` repository but
was moved to `langchain-elastic` during the review process due to the
migration of the partner package.
- The original PR can be found at
#19314.
- As
[commented](#19314 (comment))
by @joemcelroy, documenting the new retrieval strategy is part of the
requirements for its introduction.

Although the `BM25RetrievalStrategy` has been merged into
`langchain-elastic`, its documentation is still to be maintained in the
main `langchain` repository. Therefore, this pull request adds the
documentation portion of `BM25RetrievalStrategy`.

The content of the documentation remains the same as that included in
the original PR, #19314.

---------

Co-authored-by: Max Jakob <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔌: elasticsearch Primarily related to elastic/elasticsearch integrations 🤖:improvement Medium size change to existing code to handle new use-cases partner Ɑ: retriever Related to retriever module size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants