Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Document table for storing original loaded documents #867

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

Maximilian-Winter
Copy link
Contributor

Please describe the purpose of this pull request.
Adds the things discussed in this issue: #770

Have you tested this PR?
No, wasn't sure about how to test it.

Related issues or PRs
#770

Is your PR over 500 lines of code?
No

@Maximilian-Winter Maximilian-Winter changed the title feat: Document table for storing original loaded documents feat: Document table for storing original loaded documents Jan 20, 2024
@cpacker cpacker marked this pull request as ready for review January 21, 2024 09:16
Copy link
Collaborator

@sarahwooders sarahwooders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Maximilian-Winter could you please also add a test for the document store in test_load_archival.py (maybe around here https://github.com/cpacker/MemGPT/blob/main/tests/test_load_archival.py#L138) to ensure documents are properly inserted? I think in the test case example, we'd expect a single document (for the single) file to be inserted, and would alos want to check if the passages retrieved match the document ID.

memgpt/cli/cli_load.py Show resolved Hide resolved
@Maximilian-Winter
Copy link
Contributor Author

@sarahwooders I think we should add the test to test_storage.py because the Document store is a database as I have implement it not a archival memory if I get the term archival memory correct. But the passage would be part of the archival test, right?

@sarahwooders
Copy link
Collaborator

@sarahwooders I think we should add the test to test_storage.py because the Document store is a database as I have implement it not a archival memory if I get the term archival memory correct. But the passage would be part of the archival test, right?

Ah yeah I think ideally we could update both tests - the test_load_archival.py, we'd just want to check to make sure the loaded documents and passages match up with the data thats loaded. And for test_storage.py would test to make sure inserting documents works fine (similar to how passages are also tested).

@Maximilian-Winter
Copy link
Contributor Author

Maximilian-Winter commented Jan 22, 2024

@sarahwooders I checked again and SimpleWebpageReader returns one document with the complete text when using it with one page. But SimpleDirectoryReader returns a list of document chunks, when using it with one document. I found a work around by creating a new llama index document with the complete text and passing that to the store_docs function. Should also work as expected with multiple documents.

@Maximilian-Winter
Copy link
Contributor Author

@sarahwooders I also added the necessary tests.

Copy link
Collaborator

@sarahwooders sarahwooders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor comments, and we should get the tests to pass -- but should be close to merging soon!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to add TableType.DOCUMENTS here and also for SQLLiteStorageConnector?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug on our end (and causing tests for fail) -- TableType.DATA_SOURCES doesn't exist anymore so if you remove this the tests should pass.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please leave a comment on why you're doing doc.text[2:] for future reference?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we not need to do the same thing for loading webpages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do this because the SimpleDirectoryReader adds two new lines in the chunks.

@Maximilian-Winter
Copy link
Contributor Author

Maximilian-Winter commented Jan 22, 2024

@sarahwooders I gonna add the comments about the doc.text[2:] later today. And try to do the rest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants