Replies: 1 comment 2 replies
-
I found a similar solved discussion on how to check if a document exists before inserting it into the index. You can use the Here is the relevant code snippet: def _handle_upserts(
self,
nodes: Sequence[BaseNode],
store_doc_text: bool = True,
) -> Sequence[BaseNode]:
"""Handle docstore upserts by checking hashes and ids."""
assert self.docstore is not None
doc_ids_from_nodes = set()
deduped_nodes_to_run = {}
for node in nodes:
ref_doc_id = node.ref_doc_id if node.ref_doc_id else node.id_
doc_ids_from_nodes.add(ref_doc_id)
existing_hash = self.docstore.get_document_hash(ref_doc_id)
if not existing_hash:
# document doesn't exist, so add it
deduped_nodes_to_run[ref_doc_id] = node
elif existing_hash and existing_hash != node.hash:
self.docstore.delete_ref_doc(ref_doc_id, raise_error=False)
if self.vector_store is not None:
self.vector_store.delete(ref_doc_id)
deduped_nodes_to_run[ref_doc_id] = node
else:
continue # document exists and is unchanged, so skip it
nodes_to_run = list(deduped_nodes_to_run.values())
self.docstore.add_documents(nodes_to_run, store_text=store_doc_text)
return nodes_to_run This method ensures that the document store is checked for existing document IDs and hashes before any new documents are added or updated [1][2]. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
how could we check if a doc_id already exists before inserting via IngestionPipeline?
Beta Was this translation helpful? Give feedback.
All reactions