Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

renamed MetadataDenormalizer into ShreddingTransformer #132

Merged
merged 1 commit into from
Feb 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/guide/adapters.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ __Supported__

__Collection Support__

: Indicates whether the store supports lists in metadata values or not. Stores which do not support it directly (:material-alert-circle:{.yellow}) can be used by applying the [MetadataDenormalizer][langchain_graph_retriever.transformers.metadata_denormalizer.MetadataDenormalizer] document transformer to documents before writing, which spreads the items of the collection into multiple metadata keys.
: Indicates whether the store supports lists in metadata values or not. Stores which do not support it directly (:material-alert-circle:{.yellow}) can be used by applying the [ShreddingTransformer][langchain_graph_retriever.transformers.ShreddingTransformer] document transformer to documents before writing, which spreads the items of the collection into multiple metadata keys.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be formatted as code (backticks)? Or should we not use them elsewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I can add 'em


__Combined Adjacent Query__

Expand Down Expand Up @@ -43,12 +43,12 @@ supports operating on metadata containing both primitive and list values. Additi

### Apache Cassandra {: #cassandra}

[Apache Cassandra](https://cassandra.apache.org/) is supported by the [`CassandraAdapter`][langchain_graph_retriever.adapters.cassandra.CassandraAdapter]. The adapter requires denormalizing metadata containing lists in order to use them as edges. It does not combine the adjacent query.
[Apache Cassandra](https://cassandra.apache.org/) is supported by the [`CassandraAdapter`][langchain_graph_retriever.adapters.cassandra.CassandraAdapter]. The adapter requires shredding metadata containing lists in order to use them as edges. It does not combine the adjacent query.

### Chroma

[Chroma](https://www.trychroma.com/) is supported by the [`ChromaAdapter`][langchain_graph_retriever.adapters.chroma.ChromaAdapter]. The adapter requires denormalizing metadata containing lists in order to use them as edges. It does not combine the adjacent query.
[Chroma](https://www.trychroma.com/) is supported by the [`ChromaAdapter`][langchain_graph_retriever.adapters.chroma.ChromaAdapter]. The adapter requires shredding metadata containing lists in order to use them as edges. It does not combine the adjacent query.

## Implementation

The [Adapter][graph_retriever.adapters.Adapter] interface may be implemented directly. For LangChain [VectorStores][langchain_core.vectorstores.base.VectorStore], [LangchainAdapter][langchain_graph_retriever.adapters.langchain.LangchainAdapter] and [DenormalizedAdapter][langchain_graph_retriever.adapters.langchain.DenormalizedAdapter] provide much of the necessary functionality.
The [Adapter][graph_retriever.adapters.Adapter] interface may be implemented directly. For LangChain [VectorStores][langchain_core.vectorstores.base.VectorStore], [LangchainAdapter][langchain_graph_retriever.adapters.langchain.LangchainAdapter] and [ShreddedLangchainAdapter][langchain_graph_retriever.adapters.langchain.ShreddedLangchainAdapter] provide much of the necessary functionality.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may make sense to be consistent on ShreddingTrannsformer vs. ShreddedLangchainAdapter (eg., ShreddingLangchainAdapter?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the adapters use already shredded documents. That is why I used Shredded instead of Shredding.

But I'll change to make consistent if you feel that is better.

36 changes: 16 additions & 20 deletions docs/guide/get-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,22 +81,20 @@ The following shows how to populate a variety of vector stores with the animal d
```python
from langchain_community.vectorstores.cassandra import Cassandra
from langchain_openai import OpenAIEmbeddings
from langchain_graph_retriever.transformers.metadata_denormalizer import (
MetadataDenormalizer,
)
from langchain_graph_retriever.transformers import ShreddingTransformer

metadata_denormalizer = MetadataDenormalizer() # (1)!
shredder = ShreddingTransformer() # (1)!
vector_store = Cassandra.from_documents(
documents=list(metadata_denormalizer.transform_documents(animals)),
documents=list(shredder.transform_documents(animals)),
embedding=OpenAIEmbeddings(),
table_name="animals",
)
```

1. Since Cassandra doesn't index items in lists for querying, it is necessary to
denormalize metadata containing list to be queried. By default, the
[MetadataDenormalizer][langchain_graph_retriever.transformers.metadata_denormalizer.MetadataDenormalizer]
denormalizes all keys. It may be configured to only denormalize those
shred metadata containing list to be queried. By default, the
[ShreddingTransformer][langchain_graph_retriever.transformers.ShreddingTransformer]
shreds all keys. It may be configured to only shred those
metadata keys used as edge targets.

=== "OpenSearch"
Expand All @@ -119,22 +117,20 @@ The following shows how to populate a variety of vector stores with the animal d
```python
from langchain_chroma.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_graph_retriever.transformers.metadata_denormalizer import (
MetadataDenormalizer,
)
from langchain_graph_retriever.transformers import ShreddingTransformer

metadata_denormalizer = MetadataDenormalizer() # (1)!
shredder = ShreddingTransformer() # (1)!
vector_store = Chroma.from_documents(
documents=list(metadata_denormalizer.transform_documents(animals)),
documents=list(shredder.transform_documents(animals)),
embedding=OpenAIEmbeddings(),
collection_name_name="animals",
)
```

1. Since Chroma doesn't index items in lists for querying, it is necessary to
denormalize metadata containing list to be queried. By default, the
[MetadataDenormalizer][langchain_graph_retriever.transformers.metadata_denormalizer.MetadataDenormalizer]
denormalizes all keys. It may be configured to only denormalize those
shred metadata containing list to be queried. By default, the
[ShreddingTransformer][langchain_graph_retriever.transformers.ShreddingTransformer]
shreds all keys. It may be configured to only shred those
metadata keys used as edge targets.

## Simple Traversal
Expand Down Expand Up @@ -162,7 +158,7 @@ For our first retrieval and graph traversal, we're going to start with a single
from langchain_graph_retriever.adapters.cassandra import CassandraAdapter

simple = GraphRetriever(
store = store = CassandraAdapter(vector_store, metadata_denormalizer, {"keywords"}),,
store = store = CassandraAdapter(vector_store, shredder, {"keywords"}),,
edges = [("habitat", "habitat"), ("origin", "origin"), ("keywords", "keywords")],
strategy = Eager(k=10, start_k=1, depth=2),
)
Expand Down Expand Up @@ -190,15 +186,15 @@ For our first retrieval and graph traversal, we're going to start with a single
from langchain_graph_retriever.adapters.chroma import ChromaAdapter

simple = GraphRetriever(
store = ChromaAdapter(vector_store, metadata_denormalizer, {"keywords"}),
store = ChromaAdapter(vector_store, shredder, {"keywords"}),
edges = [("habitat", "habitat"), ("origin", "origin"), ("keywords", "keywords")],
strategy = Eager(k=10, start_k=1, depth=2),
)
```

!!! note "Denormalization"
!!! note "Shredding"

The above code is exactly the same for all stores, however adapters for denormalized stores (Chroma and Apache Cassandra) require configuration to specify which metadata fields need to be rewritten when issuing queries.
The above code is exactly the same for all stores, however adapters for shredded stores (Chroma and Apache Cassandra) require configuration to specify which metadata fields need to be rewritten when issuing queries.

The above creates a graph traversing retriever that starts with the nearest animal (`start_k=1`), retrieves 10 documents (`k=10`) and limits the search to documents that are at most 2 steps away from the first animal (`depth=2`).

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,9 @@ class MetadataEdgeFunction:
"""
Helper for extracting and encoding edges in metadata.

This class provides tools to extract incoming and outgoing edges from document
metadata and normalize metadata where needed. Both incoming and outgoing edges
use the same target name, enabling equality matching for keys.
This class provides tools to extract incoming and outgoing edges from
document metadata. Both incoming and outgoing edges use the same target
name, enabling equality matching for keys.

Parameters
----------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@
from langchain_core.documents import Document

from langchain_graph_retriever._conversion import METADATA_EMBEDDING_KEY
from langchain_graph_retriever.adapters.langchain import DenormalizedAdapter
from langchain_graph_retriever.adapters.langchain import ShreddedLangchainAdapter


class CassandraAdapter(DenormalizedAdapter[Cassandra]):
class CassandraAdapter(ShreddedLangchainAdapter[Cassandra]):
"""
Adapter for the [Apache Cassandra](https://cassandra.apache.org/) vector store.

Expand All @@ -28,9 +28,9 @@ class CassandraAdapter(DenormalizedAdapter[Cassandra]):
----------
vector_store :
The Cassandra vector store instance.
metadata_denormalizer: MetadataDenormalizer, optional
An instance of the MetadataDenormalizer used for doc insertion.
If not passed then a default instance of MetadataDenormalizer is used.
shredder: ShreddingTransformer, optional
An instance of the ShreddingTransformer used for doc insertion.
If not passed then a default instance of ShreddingTransformer is used.
"""

@override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from typing_extensions import override

from langchain_graph_retriever._conversion import METADATA_EMBEDDING_KEY
from langchain_graph_retriever.adapters.langchain import DenormalizedAdapter
from langchain_graph_retriever.adapters.langchain import ShreddedLangchainAdapter

try:
from langchain_chroma import Chroma
Expand All @@ -16,7 +16,7 @@
raise ImportError(msg)


class ChromaAdapter(DenormalizedAdapter[Chroma]):
class ChromaAdapter(ShreddedLangchainAdapter[Chroma]):
"""
Adapter for [Chroma](https://www.trychroma.com/) vector store.

Expand All @@ -27,9 +27,9 @@ class ChromaAdapter(DenormalizedAdapter[Chroma]):
----------
vector_store :
The Chroma vector store instance.
metadata_denormalizer: MetadataDenormalizer, optional
An instance of the MetadataDenormalizer used for doc insertion.
If not passed then a default instance of MetadataDenormalizer is used.
shredder: ShreddingTransformer, optional
An instance of the ShreddingTransformer used for doc insertion.
If not passed then a default instance of ShreddingTransformer is used.
"""

@override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,7 @@
)

from langchain_graph_retriever._conversion import doc_to_content
from langchain_graph_retriever.transformers.metadata_denormalizer import (
MetadataDenormalizer,
)
from langchain_graph_retriever.transformers import ShreddingTransformer

StoreT = TypeVar("StoreT", bound=VectorStore)

Expand Down Expand Up @@ -392,7 +390,7 @@ def _metadata_filter(
return metadata_filter


class DenormalizedAdapter(LangchainAdapter[StoreT]):
class ShreddedLangchainAdapter(LangchainAdapter[StoreT]):
"""
Base adapter for integrating vector stores with the graph retriever system.

Expand All @@ -404,26 +402,22 @@ class DenormalizedAdapter(LangchainAdapter[StoreT]):
----------
vector_store :
The vector store instance.
metadata_denormalizer: MetadataDenormalizer, optional
An instance of the MetadataDenormalizer used for doc insertion.
If not passed then a default instance of MetadataDenormalizer is used.
shredder: ShreddingTransformer, optional
An instance of the ShreddingTransformer used for doc insertion.
If not passed then a default instance of ShreddingTransformer is used.
nested_metadata_fields: set[str]
The set of metadata fields that contain nested values.
"""

def __init__(
self,
vector_store: StoreT,
metadata_denormalizer: MetadataDenormalizer | None = None,
shredder: ShreddingTransformer | None = None,
nested_metadata_fields: set[str] = set(),
):
"""Initialize the base adapter."""
super().__init__(vector_store=vector_store)
self.metadata_denormalizer = (
MetadataDenormalizer()
if metadata_denormalizer is None
else metadata_denormalizer
)
self.shredder = ShreddingTransformer() if shredder is None else shredder
self.nested_metadata_fields = nested_metadata_fields

@override
Expand All @@ -433,17 +427,17 @@ def update_filter_hook(
if filter is None:
return None

denormalized_filter = {}
shredded_filter = {}
for key, value in filter.items():
if key in self.nested_metadata_fields:
denormalized_filter[
self.metadata_denormalizer.denormalized_key(key, value)
] = self.metadata_denormalizer.denormalized_value()
shredded_filter[self.shredder.shredded_key(key, value)] = (
self.shredder.shredded_value()
)
else:
denormalized_filter[key] = value
return denormalized_filter
shredded_filter[key] = value
return shredded_filter

@override
def format_documents_hook(self, docs: list[Document]) -> list[Content]:
normalized = list(self.metadata_denormalizer.revert_documents(documents=docs))
return super().format_documents_hook(normalized)
restored = list(self.shredder.restore_documents(documents=docs))
return super().format_documents_hook(restored)
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@
Many of these add metadata that could be useful for linking content, such as
extracting named entities or keywords from the page content.

Also includes a transform for denormalizing metadata, for use with stores
Also includes a transformer for shredding metadata, for use with stores
that do not support querying on elements of lists.
"""

from .shredding import ShreddingTransformer

__all__ = [
"ShreddingTransformer",
]
Loading