-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Document table for storing original loaded documents #867
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Maximilian-Winter could you please also add a test for the document store in test_load_archival.py
(maybe around here https://github.com/cpacker/MemGPT/blob/main/tests/test_load_archival.py#L138) to ensure documents are properly inserted? I think in the test case example, we'd expect a single document (for the single) file to be inserted, and would alos want to check if the passages retrieved match the document ID.
@sarahwooders I think we should add the test to test_storage.py because the Document store is a database as I have implement it not a archival memory if I get the term archival memory correct. But the passage would be part of the archival test, right? |
Ah yeah I think ideally we could update both tests - the |
@sarahwooders I checked again and SimpleWebpageReader returns one document with the complete text when using it with one page. But SimpleDirectoryReader returns a list of document chunks, when using it with one document. I found a work around by creating a new llama index document with the complete text and passing that to the store_docs function. Should also work as expected with multiple documents. |
@sarahwooders I also added the necessary tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few minor comments, and we should get the tests to pass -- but should be close to merging soon!
memgpt/agent_store/db.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to add TableType.DOCUMENTS
here and also for SQLLiteStorageConnector
?
memgpt/agent_store/db.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bug on our end (and causing tests for fail) -- TableType.DATA_SOURCES
doesn't exist anymore so if you remove this the tests should pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please leave a comment on why you're doing doc.text[2:]
for future reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we not need to do the same thing for loading webpages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do this because the SimpleDirectoryReader adds two new lines in the chunks.
@sarahwooders I gonna add the comments about the doc.text[2:] later today. And try to do the rest. |
Please describe the purpose of this pull request.
Adds the things discussed in this issue: #770
Have you tested this PR?
No, wasn't sure about how to test it.
Related issues or PRs
#770
Is your PR over 500 lines of code?
No