[Question]: How to update index after inserted new document #8392

axz91 · 2023-08-28T02:00:36Z

axz91
Aug 28, 2023

Question Validation

I have searched both the documentation and discord for an answer.

Question

`import os
from pathlib import Path
import shutil
from llama_index import (
VectorStoreIndex,
SimpleDirectoryReader,
load_index_from_storage,
StorageContext,
)
from IPython.display import Markdown, display

new_index = VectorStoreIndex.from_documents([], vectorstore=vector_store,
storage_context=storage_context,
service_context=service_context,
show_progress=True)

new_index.set_index_id("vector_index")
index.storage_context.persist("./stor")

batch_size = 5

Get the list of files from the directory

files = os.listdir(dir_path)

Calculate the number of batches

num_batches = len(files) // batch_size + (len(files) % batch_size > 0)

for i in range(num_batches):
# Select files for the current batch
batch_files = files[i*batch_size : (i+1)*batch_size]

for f in batch_files:
    file_path = os.path.join(dir_path, f)
    print(f"Processing {f}")

    # Move the file to the new directory
    new_path_for_index = os.path.join(path_for_index, f)
    shutil.move(file_path, new_path_for_index)

    # Read the file
    docs = SimpleDirectoryReader(input_files= [f'{new_path_for_index}']).load_data()
    for doc in docs:
        new_index.insert(doc)
        ## save index
        # Move the file to the final directory
        print(f"Moving to already indexed folder {doc}")
        new_path = os.path.join(indexed_dir_path, f)
        shutil.move(new_path_for_index, new_path)`

Here i try to index one file each time, by using index.insert, my question is how to save the new data based on current index not covered it

@dosu-bot · 2023-08-28T02:14:18Z

dosubot[bot]
bot Aug 28, 2023

🤖

Based on your code snippet, you are already using the insert method correctly to add new documents to the index. The insert method in the BaseIndex class is designed to add new documents to the index without overwriting the existing ones. It does this by converting the document into nodes and then adding these nodes to the index.

However, to ensure that the new data is saved and the existing data is not overwritten, you need to persist the index after inserting the new documents. You can do this by calling the persist method of the storage_context object. Here is how you can modify your code to do this:

for doc in docs:
    new_index.insert(doc)
    # Persist the index after inserting the new document
    new_index.storage_context.persist("./stor")
    print(f"Moving to already indexed folder {doc}")
    new_path = os.path.join(indexed_dir_path, f)
    shutil.move(new_path_for_index, new_path)

The persist method saves the current state of the index to the storage. It does this by calling the persist methods of the docstore, vectorstore, and index_store objects, which save the documents, vectors, and index structure, respectively, to the storage.

Please note that persisting the index after inserting each document might be slow if you have a large number of documents. If this is the case, you might want to consider persisting the index after inserting a batch of documents instead of after inserting each document.

I hope this helps! If you have any other questions, feel free to ask.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

axz91 · 2023-08-28T10:33:41Z

axz91
Aug 28, 2023
Author

May I know how to have the document update to the current index name without overwrite:
is this: `

new_index = VectorStoreIndex.from_documents([], vectorstore=vector_store,
storage_context=storage_context,
service_context=service_context,
show_progress=True)
batch_size = 5

files = os.listdir(dir_path)

num_batches = len(files) // batch_size + (len(files) % batch_size > 0)

for i in range(num_batches):
# Select files for the current batch
batch_files = files[i*batch_size : (i+1)*batch_size]

for f in batch_files:
    file_path = os.path.join(dir_path, f)
    print(f"Processing {f}")

    # Move the file to the new directory
    new_path_for_index = os.path.join(path_for_index, f)
    shutil.move(file_path, new_path_for_index)

    # Read the file
    docs = SimpleDirectoryReader(input_files= [f'{new_path_for_index}']).load_data()
    for doc in docs:
        new_index.insert(doc)
        ## save index
        # Move the file to the final directory
        print(f"Moving to already indexed folder {doc}")
        new_path = os.path.join(indexed_dir_path, f)
        shutil.move(new_path_for_index, new_path)
    # Persist the index after inserting the new document
new_index.set_index_id("vector_index")
new_index.storage_context.persist("./stor")   `

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: How to update index after inserted new document #8392

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[Question]: How to update index after inserted new document #8392

axz91 Aug 28, 2023

Question Validation

Question

Get the list of files from the directory

Calculate the number of batches

Replies: 2 comments

dosubot[bot] bot Aug 28, 2023

Sources

axz91 Aug 28, 2023 Author

axz91
Aug 28, 2023

dosubot[bot]
bot Aug 28, 2023

axz91
Aug 28, 2023
Author