CrateDB vector: Refactor SQLAlchemy data model to provide two storage strategies #20

amotl · 2023-11-22T00:55:38Z

About

This patch makes a stab at GH-12, aiming to implement a more advanced data model for storing collections vs. associated embeddings. As suggested by @ckurze, this will follow a "embedding-table-per-collection" strategy, instead of storing all embedding vectors into the same database table.

Installation

🚧 The preview version can be installed directly from the branch embedding-table-per-collection on GitHub. 🚧

pip install 'langchain[cratedb,openai] @ git+https://github.com/crate-workbench/langchain.git@embedding-table-per-collection#subdirectory=libs/langchain'

Usage

The default storage strategy is LANGCHAIN_PGVECTOR, which is the same as before. To configure an alternative storage strategy, invoke this snippet before doing any other operations on CrateDBVectorSearch:

from langchain.vectorstores import CrateDBVectorSearch
from langchain.vectorstores.cratedb import StorageStrategy

CrateDBVectorSearch.configure(
    storage_strategy=StorageStrategy.EMBEDDING_TABLE_PER_COLLECTION
)

Details

Those strategies are implemented. The default is LANGCHAIN_PGVECTOR.

StorageStrategy.LANGCHAIN_PGVECTOR

Reflects the vanilla way the pgvector adapter manages the data model: There is a single collection table and a single embedding table.
StorageStrategy.EMBEDDING_TABLE_PER_COLLECTION

Reflects a more advanced way to manage the data model: There is a single collection table, and multiple embedding tables, one per collection.

Backlog

Add test case exercising different vector store handles with different dimension sizes at the same time.
Multi-collection querying as introduced by CrateDB vector: Add CrateDBVectorSearchMultiCollection #15 is currently not supported by the new EMBEDDING_TABLE_PER_COLLECTION storage strategy.
Think about amending the vanilla data model to also store the associated vector/embedding table name within the CollectionStore database entity, in a new column, embedding_table.
For the second strategy, get rid of the collection table at all, no index table is needed.
Also on the first strategy, get rid of the collection-level cmetadata field. It isn't used at all within the other implementations, so it is considered to be legacy.
Add documentation on behalf of Jupyter Notebook and .mdx file.
Collect and evaluate user feedback.

amotl · 2023-11-22T01:09:35Z

libs/langchain/tests/integration_tests/vectorstores/test_cratedb.py

@@ -247,6 +308,7 @@ def test_cratedb_with_filter_match() -> None:
    assert output in [
        [(Document(page_content="foo", metadata={"page": "0"}), 2.1307645)],
        [(Document(page_content="foo", metadata={"page": "0"}), 2.3150668)],
+        [(Document(page_content="foo", metadata={"page": "0"}), 2.4458315)],


@matriv: That's just been a fluke on the _score-nondeterminism again ;], so I just added another variant to the list of possible outcomes.

Is there a possibility to use an approximation for the score?
https://stackoverflow.com/questions/8560131/pytest-assert-almost-equal

That is a good idea. I've submitted a corresponding patch. Thank you.

CrateDB vector: Test non-deterministic values by using pytest.approx #21

The implementation is based on the `pgvector` adapter, as both PostgreSQL and CrateDB share similar attributes, and can be wrapped well by using the same SQLAlchemy layer on top.

The implementation is based on the generic SQLAlchemy document loader.

The implementation is based on the generic `SQLChatMessageHistory`.

When not adding any embeddings upfront, the runtime model factory was not able to derive the vector dimension size, because the SQLAlchemy models have not been initialized correctly.

From now on, _all_ instances of SQLAlchemy model types will be created at runtime through the `ModelFactory` utility. By using `__table_args__ = {"keep_existing": True}` on the ORM entity definitions, this seems to work well, even with multiple invocations of `CrateDBVectorSearch.from_texts()` using different `collection_name` argument values. While being at it, this patch also fixes a few linter errors.

When deleting a collection, also delete its associated embeddings.

It is a special adapter which provides similarity search across multiple collections. It can not be used for indexing documents.

The CrateDB adapter works a bit different compared to the pgvector adapter it is building upon: Because the dimensionality of the vector field needs to be specified at table creation time, but because it is also a runtime parameter in LangChain, the table creation needs to be delayed. In some cases, the tables do not exist yet, but this is only relevant for the case when the user requests to pre-delete the collection, using the `pre_delete_collection` argument. So, do the error handling only there instead, and _not_ on the generic data model utility functions.

The performance gains can be substantially.

The test cases can be written substantially more elegant.

- StorageStrategy.LANGCHAIN_PGVECTOR Reflects the vanilla way the pgvector adapter manages the data model: There is a single `collection` table and a single `embedding` table. - StorageStrategy.EMBEDDING_TABLE_PER_COLLECTION Reflects a more advanced way to manage the data model: There is a single `collection` table, and multiple `embedding` tables, one per collection. The default storage strategy is `LANGCHAIN_PGVECTOR`. To configure an alternative storage strategy, invoke this snippet before doing any other operations using `CrateDBVectorSearch`: CrateDBVectorSearch.configure( storage_strategy=StorageStrategy.EMBEDDING_TABLE_PER_COLLECTION )

andnig · 2023-12-01T12:22:16Z

When you delete a collection, would it make sense to drop the table of this collection when using this new storage strategy?
Currently when calling delete_collection(), the embeddings in this collection are deleted, but the table is still there.

Use case for why it might be good to delete is, when someone adds embeddings. but recognizes that they need different embedding dimensions. I assume they would delete the collection and re-add the new embeddings to a collection with the same name - this is then not possible, as the old embedding table is still there (with the now wrong embedding dimensions). What do you think about this?

Other than that I did some monkey testing and found the strategy reliable. Querying, deleting, being a little stupid all seem to result in the expected outcome. 👍

CrateDBVectorSearch.configure(
    storage_strategy=StorageStrategy.LANGCHAIN_PGVECTOR
)
vector_store = CrateDBVectorSearch.from_documents(pages, OpenAIEmbeddings(), collection_name="langchain_2023_11_30", connection_string=CONNECTION_STRING)
vector_store.delete_collection() # here I'd suspect that the table for this embedding is dropped instead of emptied (but only in this multi storage strategy. For the default strategy this does not apply

amotl · 2024-01-07T02:35:33Z

Other than that I did some monkey testing and found the strategy reliable. Querying, deleting, being a little stupid all seem to result in the expected outcome. 👍

Wonderful.

When you delete a collection, would it make sense to drop the table of this collection when using this new storage strategy?

That's a good suggestion, thank you.

amotl changed the title ~~CrateDB vector: Refactor SQLAlchemy data model to provide two strategies~~ CrateDB vector: Refactor SQLAlchemy data model to provide two storage strategies Nov 22, 2023

amotl mentioned this pull request Nov 22, 2023

Data model: Create an individual embedding table per collection #12

Open

amotl commented Nov 22, 2023

View reviewed changes

amotl force-pushed the embedding-table-per-collection branch 2 times, most recently from 4c5c422 to 5f3fcf9 Compare November 22, 2023 01:31

amotl added the enhancement New feature or request label Nov 22, 2023

amotl force-pushed the cratedb branch 2 times, most recently from 8ab8203 to ee0d1f5 Compare November 27, 2023 23:18

amotl force-pushed the embedding-table-per-collection branch from 5f3fcf9 to 78d6c9a Compare November 27, 2023 23:41

amotl changed the base branch from cratedb to fix-init-dimensionality November 27, 2023 23:41

Base automatically changed from fix-init-dimensionality to cratedb November 28, 2023 09:23

amotl force-pushed the cratedb branch 2 times, most recently from c39ee8d to f7ea8f0 Compare November 29, 2023 00:38

amotl force-pushed the embedding-table-per-collection branch from 78d6c9a to 500d755 Compare November 30, 2023 12:08

amotl added 16 commits December 1, 2023 10:35

Fix pytest option parsing

c0e260f

pgvector: Slight refactoring to make code a bit more reusable

249580f

CrateDB vector: Add vector store support

b752717

The implementation is based on the `pgvector` adapter, as both PostgreSQL and CrateDB share similar attributes, and can be wrapped well by using the same SQLAlchemy layer on top.

CrateDB vector: Add documentation

6cab9b5

Add SQLAlchemy document loader

6444a46

CrateDB loader: Add document loader support

f494d64

The implementation is based on the generic SQLAlchemy document loader.

Generalize SQLChatMessageHistory to make code a bit more reusable

5894310

CrateDB memory: Add conversational memory support

08f87b6

The implementation is based on the generic `SQLChatMessageHistory`.

CrateDB vector: Fix usage when only reading, and not storing

901fdcc

When not adding any embeddings upfront, the runtime model factory was not able to derive the vector dimension size, because the SQLAlchemy models have not been initialized correctly.

CrateDB vector: Unable to invoke add_embeddings without embeddings

33d81e3

CrateDB vector: Fix cascading deletes

0e7f16b

When deleting a collection, also delete its associated embeddings.

CrateDB vector: Add CrateDBVectorSearchMultiCollection

e5c947c

It is a special adapter which provides similarity search across multiple collections. It can not be used for indexing documents.

CrateDB vector: Improve testing when initialized without dimensionality

d8429f7

pgvector: Use SA's bulk_save_objects method for inserting embeddings

02cab14

The performance gains can be substantially.

amotl added 2 commits December 1, 2023 10:36

CrateDB vector: Test non-deterministic values by using pytest.approx

bcd304b

The test cases can be written substantially more elegant.

CrateDB vector: Fix initialization of vector dimensionality

dd64cd4

amotl force-pushed the cratedb branch from 511f004 to dd64cd4 Compare December 1, 2023 09:37

amotl force-pushed the embedding-table-per-collection branch from 500d755 to 07ba7af Compare December 1, 2023 09:39

amotl force-pushed the cratedb branch 2 times, most recently from 2379986 to 4e00dcb Compare December 7, 2023 20:56

amotl force-pushed the cratedb branch 4 times, most recently from bf406c7 to c8bb72e Compare January 19, 2024 00:10

amotl force-pushed the cratedb branch from 5df2429 to 60c2ad9 Compare January 26, 2024 20:18

amotl force-pushed the cratedb branch 2 times, most recently from 40bcf25 to 0a75262 Compare June 7, 2024 01:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CrateDB vector: Refactor SQLAlchemy data model to provide two storage strategies #20

CrateDB vector: Refactor SQLAlchemy data model to provide two storage strategies #20

amotl commented Nov 22, 2023 •

edited

Loading

amotl Nov 22, 2023 •

edited

Loading

matriv Nov 22, 2023

amotl Nov 22, 2023

andnig commented Dec 1, 2023 •

edited

Loading

amotl commented Jan 7, 2024

CrateDB vector: Refactor SQLAlchemy data model to provide two storage strategies #20

Are you sure you want to change the base?

CrateDB vector: Refactor SQLAlchemy data model to provide two storage strategies #20

Conversation

amotl commented Nov 22, 2023 • edited Loading

About

Installation

Usage

Details

Backlog

amotl Nov 22, 2023 • edited Loading

Choose a reason for hiding this comment

matriv Nov 22, 2023

Choose a reason for hiding this comment

amotl Nov 22, 2023

Choose a reason for hiding this comment

andnig commented Dec 1, 2023 • edited Loading

amotl commented Jan 7, 2024

amotl commented Nov 22, 2023 •

edited

Loading

amotl Nov 22, 2023 •

edited

Loading

andnig commented Dec 1, 2023 •

edited

Loading