Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CrateDB vector: Refactor SQLAlchemy data model to provide two storage strategies #20

Draft
wants to merge 19 commits into
base: cratedb
Choose a base branch
from

Commits on Dec 1, 2023

  1. Fix pytest option parsing

    amotl committed Dec 1, 2023
    Configuration menu
    Copy the full SHA
    c0e260f View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    249580f View commit details
    Browse the repository at this point in the history
  3. CrateDB vector: Add vector store support

    The implementation is based on the `pgvector` adapter, as both PostgreSQL and
    CrateDB share similar attributes, and can be wrapped well by using the same
    SQLAlchemy layer on top.
    amotl committed Dec 1, 2023
    Configuration menu
    Copy the full SHA
    b752717 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    6cab9b5 View commit details
    Browse the repository at this point in the history
  5. Add SQLAlchemy document loader

    amotl committed Dec 1, 2023
    Configuration menu
    Copy the full SHA
    6444a46 View commit details
    Browse the repository at this point in the history
  6. CrateDB loader: Add document loader support

    The implementation is based on the generic SQLAlchemy document loader.
    amotl committed Dec 1, 2023
    Configuration menu
    Copy the full SHA
    f494d64 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    5894310 View commit details
    Browse the repository at this point in the history
  8. CrateDB memory: Add conversational memory support

    The implementation is based on the generic `SQLChatMessageHistory`.
    amotl committed Dec 1, 2023
    Configuration menu
    Copy the full SHA
    08f87b6 View commit details
    Browse the repository at this point in the history
  9. CrateDB vector: Fix usage when only reading, and not storing

    When not adding any embeddings upfront, the runtime model factory was
    not able to derive the vector dimension size, because the SQLAlchemy
    models have not been initialized correctly.
    amotl committed Dec 1, 2023
    Configuration menu
    Copy the full SHA
    901fdcc View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    33d81e3 View commit details
    Browse the repository at this point in the history
  11. CrateDB vector: Improve SQLAlchemy model factory

    From now on, _all_ instances of SQLAlchemy model types will be created
    at runtime through the `ModelFactory` utility.
    
    By using `__table_args__ = {"keep_existing": True}` on the ORM entity
    definitions, this seems to work well, even with multiple invocations
    of `CrateDBVectorSearch.from_texts()` using different `collection_name`
    argument values.
    
    While being at it, this patch also fixes a few linter errors.
    amotl committed Dec 1, 2023
    Configuration menu
    Copy the full SHA
    dfc9243 View commit details
    Browse the repository at this point in the history
  12. CrateDB vector: Fix cascading deletes

    When deleting a collection, also delete its associated embeddings.
    amotl committed Dec 1, 2023
    Configuration menu
    Copy the full SHA
    0e7f16b View commit details
    Browse the repository at this point in the history
  13. CrateDB vector: Add CrateDBVectorSearchMultiCollection

    It is a special adapter which provides similarity search across multiple
    collections. It can not be used for indexing documents.
    amotl committed Dec 1, 2023
    Configuration menu
    Copy the full SHA
    e5c947c View commit details
    Browse the repository at this point in the history
  14. CrateDB vector: Improve SQLAlchemy data model query utility functions

    The CrateDB adapter works a bit different compared to the pgvector
    adapter it is building upon: Because the dimensionality of the vector
    field needs to be specified at table creation time, but because it is
    also a runtime parameter in LangChain, the table creation needs to be
    delayed.
    
    In some cases, the tables do not exist yet, but this is only relevant
    for the case when the user requests to pre-delete the collection, using
    the `pre_delete_collection` argument. So, do the error handling only
    there instead, and _not_ on the generic data model utility functions.
    amotl committed Dec 1, 2023
    Configuration menu
    Copy the full SHA
    2208963 View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    d8429f7 View commit details
    Browse the repository at this point in the history
  16. pgvector: Use SA's bulk_save_objects method for inserting embeddings

    The performance gains can be substantially.
    amotl committed Dec 1, 2023
    Configuration menu
    Copy the full SHA
    02cab14 View commit details
    Browse the repository at this point in the history
  17. CrateDB vector: Test non-deterministic values by using pytest.approx

    The test cases can be written substantially more elegant.
    amotl committed Dec 1, 2023
    Configuration menu
    Copy the full SHA
    bcd304b View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    dd64cd4 View commit details
    Browse the repository at this point in the history
  19. CrateDB vector: Refactor SQLAlchemy data model to provide two strategies

    - StorageStrategy.LANGCHAIN_PGVECTOR
    
      Reflects the vanilla way the pgvector adapter manages the data model:
      There is a single `collection` table and a single `embedding` table.
    
    - StorageStrategy.EMBEDDING_TABLE_PER_COLLECTION
    
      Reflects a more advanced way to manage the data model: There is a
      single `collection` table, and multiple `embedding` tables, one per
      collection.
    
    The default storage strategy is `LANGCHAIN_PGVECTOR`. To configure an
    alternative storage strategy, invoke this snippet before doing any
    other operations using `CrateDBVectorSearch`:
    
      CrateDBVectorSearch.configure(
        storage_strategy=StorageStrategy.EMBEDDING_TABLE_PER_COLLECTION
      )
    amotl committed Dec 1, 2023
    Configuration menu
    Copy the full SHA
    07ba7af View commit details
    Browse the repository at this point in the history