Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document_store.update_embeddings seems to update embeddings regardless of parameter #39

Open
theoky opened this issue Nov 27, 2023 · 1 comment

Comments

@theoky
Copy link

theoky commented Nov 27, 2023

I'm using qdrant-haystack 1.0.11 with farm-haystack==1.21.2 and python 3.10.13 on Win10 and Qdrant running in Docker.

When updating the embeddings of a document store, document_store.update_embeddings seems to update all embeddings even when update_existing_embeddings is set to False.

I'm running this code:

import timeit
from haystack import Document
from haystack.nodes import EmbeddingRetriever
from qdrant_haystack.document_stores import QdrantDocumentStore

def update_embeddings(existing):
    document_store.update_embeddings(retriever, update_existing_embeddings=existing)
    
document_store = QdrantDocumentStore(url="localhost", index="test_update_embeddings",
                                    embedding_dim=512, similarity="cosine")

retriever = EmbeddingRetriever(document_store=document_store,
                               embedding_model="sentence-transformers/distiluse-base-multilingual-cased-v1",
                               use_gpu=False)

docs_to_index = [Document(content=str(i) + " random text"*100) for i in range(0, 50)]

document_store.write_documents(docs_to_index, duplicate_documents="skip")

res_upd = timeit.timeit(stmt='update_embeddings(True)', globals=globals(), number=2) 
res_noupd = timeit.timeit(stmt='update_embeddings(False)', globals=globals(), number=2)

print(f"Execution with update: {res_upd}, with no update: {res_noupd}")

After the execution the QDrant database contains 50 vectors, as expected.

I would also expect that update_embeddings(False) is running significantly faster than update_embeddings(True), but both statements run for nearly the same time:
Execution with update: 22.15771689999383, with no update: 20.913242900016485

To me this looks like update_embeddings(..., update_existing_embeddings=False) is updating the embeddings, too.

What am I missing?

@theoky
Copy link
Author

theoky commented Dec 1, 2023

I've just found this comment in the relevant source file:

:param update_existing_embeddings: Not used by QdrantDocumentStore, as all the points
                                   must have a corresponding vector in Qdrant.

So for my use case:

  • Precondition: qdrant contains x documents and corresponding embeddings
  • Actions
    • Get n new documents
    • write n documents to qdrant
    • update only n new documents embeddings using update_embeddings

using update_embeddings does not work.

So a working use case would be

  • Precondition: qdrant contains x documents and corresponding embeddings
  • Actions
    • Get n new documents
    • create n new embeddings manually for all new documents
    • write n documents to qdrant (as write documents does not check the validity of the embeddings as far as I've understood).

So update_embeddings is basically useful only when I change the model generating the embeddings? This seems somehow a little bit against the intent of having a simple pipeline, at least to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant