Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CrateDB vector: Refactor SQLAlchemy data model to provide two storage strategies #20

Draft
wants to merge 19 commits into
base: cratedb
Choose a base branch
from

Conversation

amotl
Copy link

@amotl amotl commented Nov 22, 2023

About

This patch makes a stab at GH-12, aiming to implement a more advanced data model for storing collections vs. associated embeddings. As suggested by @ckurze, this will follow a "embedding-table-per-collection" strategy, instead of storing all embedding vectors into the same database table.

Installation

🚧 The preview version can be installed directly from the branch embedding-table-per-collection on GitHub. 🚧

pip install 'langchain[cratedb,openai] @ git+https://github.com/crate-workbench/langchain.git@embedding-table-per-collection#subdirectory=libs/langchain'

Usage

The default storage strategy is LANGCHAIN_PGVECTOR, which is the same as before. To configure an alternative storage strategy, invoke this snippet before doing any other operations on CrateDBVectorSearch:

from langchain.vectorstores import CrateDBVectorSearch
from langchain.vectorstores.cratedb import StorageStrategy

CrateDBVectorSearch.configure(
    storage_strategy=StorageStrategy.EMBEDDING_TABLE_PER_COLLECTION
)

Details

Those strategies are implemented. The default is LANGCHAIN_PGVECTOR.

  • StorageStrategy.LANGCHAIN_PGVECTOR

    Reflects the vanilla way the pgvector adapter manages the data model: There is a single collection table and a single embedding table.

  • StorageStrategy.EMBEDDING_TABLE_PER_COLLECTION

    Reflects a more advanced way to manage the data model: There is a single collection table, and multiple embedding tables, one per collection.

Backlog

  • Add test case exercising different vector store handles with different dimension sizes at the same time.
  • Multi-collection querying as introduced by CrateDB vector: Add CrateDBVectorSearchMultiCollection #15 is currently not supported by the new EMBEDDING_TABLE_PER_COLLECTION storage strategy.
  • Think about amending the vanilla data model to also store the associated vector/embedding table name within the CollectionStore database entity, in a new column, embedding_table.
  • For the second strategy, get rid of the collection table at all, no index table is needed.
  • Also on the first strategy, get rid of the collection-level cmetadata field. It isn't used at all within the other implementations, so it is considered to be legacy.
  • Add documentation on behalf of Jupyter Notebook and .mdx file.
  • Collect and evaluate user feedback.

@amotl amotl changed the title CrateDB vector: Refactor SQLAlchemy data model to provide two strategies CrateDB vector: Refactor SQLAlchemy data model to provide two storage strategies Nov 22, 2023
@@ -247,6 +308,7 @@ def test_cratedb_with_filter_match() -> None:
assert output in [
[(Document(page_content="foo", metadata={"page": "0"}), 2.1307645)],
[(Document(page_content="foo", metadata={"page": "0"}), 2.3150668)],
[(Document(page_content="foo", metadata={"page": "0"}), 2.4458315)],
Copy link
Author

@amotl amotl Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matriv: That's just been a fluke on the _score-nondeterminism again ;], so I just added another variant to the list of possible outcomes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a possibility to use an approximation for the score?
https://stackoverflow.com/questions/8560131/pytest-assert-almost-equal

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good idea. I've submitted a corresponding patch. Thank you.

@amotl amotl force-pushed the embedding-table-per-collection branch 2 times, most recently from 4c5c422 to 5f3fcf9 Compare November 22, 2023 01:31
@amotl amotl added the enhancement New feature or request label Nov 22, 2023
@amotl amotl force-pushed the cratedb branch 2 times, most recently from 8ab8203 to ee0d1f5 Compare November 27, 2023 23:18
@amotl amotl force-pushed the embedding-table-per-collection branch from 5f3fcf9 to 78d6c9a Compare November 27, 2023 23:41
@amotl amotl changed the base branch from cratedb to fix-init-dimensionality November 27, 2023 23:41
Base automatically changed from fix-init-dimensionality to cratedb November 28, 2023 09:23
@amotl amotl force-pushed the cratedb branch 2 times, most recently from c39ee8d to f7ea8f0 Compare November 29, 2023 00:38
@amotl amotl force-pushed the embedding-table-per-collection branch from 78d6c9a to 500d755 Compare November 30, 2023 12:08
The implementation is based on the `pgvector` adapter, as both PostgreSQL and
CrateDB share similar attributes, and can be wrapped well by using the same
SQLAlchemy layer on top.
The implementation is based on the generic SQLAlchemy document loader.
The implementation is based on the generic `SQLChatMessageHistory`.
When not adding any embeddings upfront, the runtime model factory was
not able to derive the vector dimension size, because the SQLAlchemy
models have not been initialized correctly.
From now on, _all_ instances of SQLAlchemy model types will be created
at runtime through the `ModelFactory` utility.

By using `__table_args__ = {"keep_existing": True}` on the ORM entity
definitions, this seems to work well, even with multiple invocations
of `CrateDBVectorSearch.from_texts()` using different `collection_name`
argument values.

While being at it, this patch also fixes a few linter errors.
When deleting a collection, also delete its associated embeddings.
It is a special adapter which provides similarity search across multiple
collections. It can not be used for indexing documents.
The CrateDB adapter works a bit different compared to the pgvector
adapter it is building upon: Because the dimensionality of the vector
field needs to be specified at table creation time, but because it is
also a runtime parameter in LangChain, the table creation needs to be
delayed.

In some cases, the tables do not exist yet, but this is only relevant
for the case when the user requests to pre-delete the collection, using
the `pre_delete_collection` argument. So, do the error handling only
there instead, and _not_ on the generic data model utility functions.
- StorageStrategy.LANGCHAIN_PGVECTOR

  Reflects the vanilla way the pgvector adapter manages the data model:
  There is a single `collection` table and a single `embedding` table.

- StorageStrategy.EMBEDDING_TABLE_PER_COLLECTION

  Reflects a more advanced way to manage the data model: There is a
  single `collection` table, and multiple `embedding` tables, one per
  collection.

The default storage strategy is `LANGCHAIN_PGVECTOR`. To configure an
alternative storage strategy, invoke this snippet before doing any
other operations using `CrateDBVectorSearch`:

  CrateDBVectorSearch.configure(
    storage_strategy=StorageStrategy.EMBEDDING_TABLE_PER_COLLECTION
  )
@amotl amotl force-pushed the embedding-table-per-collection branch from 500d755 to 07ba7af Compare December 1, 2023 09:39
@andnig
Copy link

andnig commented Dec 1, 2023

When you delete a collection, would it make sense to drop the table of this collection when using this new storage strategy?
Currently when calling delete_collection(), the embeddings in this collection are deleted, but the table is still there.

Use case for why it might be good to delete is, when someone adds embeddings. but recognizes that they need different embedding dimensions. I assume they would delete the collection and re-add the new embeddings to a collection with the same name - this is then not possible, as the old embedding table is still there (with the now wrong embedding dimensions). What do you think about this?

Other than that I did some monkey testing and found the strategy reliable. Querying, deleting, being a little stupid all seem to result in the expected outcome. 👍

CrateDBVectorSearch.configure(
    storage_strategy=StorageStrategy.LANGCHAIN_PGVECTOR
)
vector_store = CrateDBVectorSearch.from_documents(pages, OpenAIEmbeddings(), collection_name="langchain_2023_11_30", connection_string=CONNECTION_STRING)
vector_store.delete_collection() # here I'd suspect that the table for this embedding is dropped instead of emptied (but only in this multi storage strategy. For the default strategy this does not apply

@amotl amotl force-pushed the cratedb branch 2 times, most recently from 2379986 to 4e00dcb Compare December 7, 2023 20:56
@amotl
Copy link
Author

amotl commented Jan 7, 2024

Other than that I did some monkey testing and found the strategy reliable. Querying, deleting, being a little stupid all seem to result in the expected outcome. 👍

Wonderful.

When you delete a collection, would it make sense to drop the table of this collection when using this new storage strategy?

That's a good suggestion, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants