Use JinaAI models for embeddings (#14252)

* add generic onnx model class and use jina ai clip models for all embeddings * fix merge confligt * add generic onnx model class and use jina ai clip models for all embeddings * fix merge confligt * preferred providers * fix paths * disable download progress bar * remove logging of path * drop and recreate tables on reindex * use cache paths * fix model name * use trust remote code per transformers docs * ensure tokenizer and feature extractor are correctly loaded * revert * manually download and cache feature extractor config * remove unneeded * remove old clip and minilm code * docs update
blakeblackshear · Oct 9, 2024 · d492562 · d492562
1 parent dbeaf43
commit d492562
Show file tree

Hide file tree

Showing 7 changed files with 277 additions and 331 deletions.
diff --git a/docs/docs/configuration/semantic_search.md b/docs/docs/configuration/semantic_search.md
@@ -5,7 +5,7 @@ title: Using Semantic Search
 
 Semantic Search in Frigate allows you to find tracked objects within your review items using either the image itself, a user-defined text description, or an automatically generated one. This feature works by creating _embeddings_ — numerical vector representations — for both the images and text descriptions of your tracked objects. By comparing these embeddings, Frigate assesses their similarities to deliver relevant search results.
 
-Frigate has support for two models to create embeddings, both of which run locally: [OpenAI CLIP](https://openai.com/research/clip) and [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). Embeddings are then saved to Frigate's database.
+Frigate has support for [Jina AI's CLIP model](https://huggingface.co/jinaai/jina-clip-v1) to create embeddings, which runs locally. Embeddings are then saved to Frigate's database.
 
 Semantic Search is accessed via the _Explore_ view in the Frigate UI.
 
@@ -27,13 +27,11 @@ If you are enabling the Search feature for the first time, be advised that Friga
 
 :::
 
-### OpenAI CLIP
+### Jina AI CLIP
 
-This model is able to embed both images and text into the same vector space, which allows `image -> image` and `text -> image` similarity searches. Frigate uses this model on tracked objects to encode the thumbnail image and store it in the database. When searching for tracked objects via text in the search box, Frigate will perform a `text -> image` similarity search against this embedding. When clicking "Find Similar" in the tracked object detail pane, Frigate will perform an `image -> image` similarity search to retrieve the closest matching thumbnails.
+The vision model is able to embed both images and text into the same vector space, which allows `image -> image` and `text -> image` similarity searches. Frigate uses this model on tracked objects to encode the thumbnail image and store it in the database. When searching for tracked objects via text in the search box, Frigate will perform a `text -> image` similarity search against this embedding. When clicking "Find Similar" in the tracked object detail pane, Frigate will perform an `image -> image` similarity search to retrieve the closest matching thumbnails.
 
-### all-MiniLM-L6-v2
-
-This is a sentence embedding model that has been fine tuned on over 1 billion sentence pairs. This model is used to embed tracked object descriptions and perform searches against them. Descriptions can be created, viewed, and modified on the Search page when clicking on the gray tracked object chip at the top left of each review item. See [the Generative AI docs](/configuration/genai.md) for more information on how to automatically generate tracked object descriptions.
+The text model is used to embed tracked object descriptions and perform searches against them. Descriptions can be created, viewed, and modified on the Search page when clicking on the gray tracked object chip at the top left of each review item. See [the Generative AI docs](/configuration/genai.md) for more information on how to automatically generate tracked object descriptions.
 
 ## Usage
 

diff --git a/frigate/embeddings/__init__.py b/frigate/embeddings/__init__.py
@@ -73,7 +73,7 @@ class EmbeddingsContext:
     def __init__(self, db: SqliteVecQueueDatabase):
         self.embeddings = Embeddings(db)
         self.thumb_stats = ZScoreNormalization()
-        self.desc_stats = ZScoreNormalization(scale_factor=3, bias=-2.5)
+        self.desc_stats = ZScoreNormalization()
 
         # load stats from disk
         try:

diff --git a/frigate/embeddings/embeddings.py b/frigate/embeddings/embeddings.py
@@ -7,6 +7,7 @@
 import time
 from typing import List, Tuple, Union
 
+import numpy as np
 from PIL import Image
 from playhouse.shortcuts import model_to_dict
 
@@ -16,8 +17,7 @@
 from frigate.models import Event
 from frigate.types import ModelStatusTypesEnum
 
-from .functions.clip import ClipEmbedding
-from .functions.minilm_l6_v2 import MiniLMEmbedding
+from .functions.onnx import GenericONNXEmbedding
 
 logger = logging.getLogger(__name__)
 
@@ -53,9 +53,23 @@ def get_metadata(event: Event) -> dict:
     )
 
 
-def serialize(vector: List[float]) -> bytes:
-    """Serializes a list of floats into a compact "raw bytes" format"""
-    return struct.pack("%sf" % len(vector), *vector)
+def serialize(vector: Union[List[float], np.ndarray, float]) -> bytes:
+    """Serializes a list of floats, numpy array, or single float into a compact "raw bytes" format"""
+    if isinstance(vector, np.ndarray):
+        # Convert numpy array to list of floats
+        vector = vector.flatten().tolist()
+    elif isinstance(vector, (float, np.float32, np.float64)):
+        # Handle single float values
+        vector = [vector]
+    elif not isinstance(vector, list):
+        raise TypeError(
+            f"Input must be a list of floats, a numpy array, or a single float. Got {type(vector)}"
+        )
+
+    try:
+        return struct.pack("%sf" % len(vector), *vector)
+    except struct.error as e:
+        raise ValueError(f"Failed to pack vector: {e}. Vector: {vector}")
 
 
 def deserialize(bytes_data: bytes) -> List[float]:
@@ -74,10 +88,10 @@ def __init__(self, db: SqliteVecQueueDatabase) -> None:
         self._create_tables()
 
         models = [
-            "sentence-transformers/all-MiniLM-L6-v2-model.onnx",
-            "sentence-transformers/all-MiniLM-L6-v2-tokenizer",
-            "clip-clip_image_model_vitb32.onnx",
-            "clip-clip_text_model_vitb32.onnx",
+            "jinaai/jina-clip-v1-text_model_fp16.onnx",
+            "jinaai/jina-clip-v1-tokenizer",
+            "jinaai/jina-clip-v1-vision_model_fp16.onnx",
+            "jinaai/jina-clip-v1-preprocessor_config.json",
         ]
 
         for model in models:
@@ -89,10 +103,33 @@ def __init__(self, db: SqliteVecQueueDatabase) -> None:
                 },
             )
 
-        self.clip_embedding = ClipEmbedding(
-            preferred_providers=["CPUExecutionProvider"]
+        def jina_text_embedding_function(outputs):
+            return outputs[0]
+
+        def jina_vision_embedding_function(outputs):
+            return outputs[0]
+
+        self.text_embedding = GenericONNXEmbedding(
+            model_name="jinaai/jina-clip-v1",
+            model_file="text_model_fp16.onnx",
+            tokenizer_file="tokenizer",
+            download_urls={
+                "text_model_fp16.onnx": "https://huggingface.co/jinaai/jina-clip-v1/resolve/main/onnx/text_model_fp16.onnx",
+            },
+            embedding_function=jina_text_embedding_function,
+            model_type="text",
+            preferred_providers=["CPUExecutionProvider"],
         )
-        self.minilm_embedding = MiniLMEmbedding(
+
+        self.vision_embedding = GenericONNXEmbedding(
+            model_name="jinaai/jina-clip-v1",
+            model_file="vision_model_fp16.onnx",
+            download_urls={
+                "vision_model_fp16.onnx": "https://huggingface.co/jinaai/jina-clip-v1/resolve/main/onnx/vision_model_fp16.onnx",
+                "preprocessor_config.json": "https://huggingface.co/jinaai/jina-clip-v1/resolve/main/preprocessor_config.json",
+            },
+            embedding_function=jina_vision_embedding_function,
+            model_type="vision",
             preferred_providers=["CPUExecutionProvider"],
         )
 
@@ -101,23 +138,30 @@ def _create_tables(self):
         self.db.execute_sql("""
             CREATE VIRTUAL TABLE IF NOT EXISTS vec_thumbnails USING vec0(
                 id TEXT PRIMARY KEY,
-                thumbnail_embedding FLOAT[512]
+                thumbnail_embedding FLOAT[768]
             );
         """)
 
         # Create vec0 virtual table for description embeddings
         self.db.execute_sql("""
             CREATE VIRTUAL TABLE IF NOT EXISTS vec_descriptions USING vec0(
                 id TEXT PRIMARY KEY,
-                description_embedding FLOAT[384]
+                description_embedding FLOAT[768]
             );
         """)
 
+    def _drop_tables(self):
+        self.db.execute_sql("""
+            DROP TABLE vec_descriptions;
+        """)
+        self.db.execute_sql("""
+            DROP TABLE vec_thumbnails;
+        """)
+
     def upsert_thumbnail(self, event_id: str, thumbnail: bytes):
         # Convert thumbnail bytes to PIL Image
         image = Image.open(io.BytesIO(thumbnail)).convert("RGB")
-        # Generate embedding using CLIP
-        embedding = self.clip_embedding([image])[0]
+        embedding = self.vision_embedding([image])[0]
 
         self.db.execute_sql(
             """
@@ -130,8 +174,7 @@ def upsert_thumbnail(self, event_id: str, thumbnail: bytes):
         return embedding
 
     def upsert_description(self, event_id: str, description: str):
-        # Generate embedding using MiniLM
-        embedding = self.minilm_embedding([description])[0]
+        embedding = self.text_embedding([description])[0]
 
         self.db.execute_sql(
             """
@@ -177,7 +220,7 @@ def search_thumbnail(
                 thumbnail = base64.b64decode(query.thumbnail)
                 query_embedding = self.upsert_thumbnail(query.id, thumbnail)
         else:
-            query_embedding = self.clip_embedding([query])[0]
+            query_embedding = self.text_embedding([query])[0]
 
         sql_query = """
             SELECT
@@ -211,7 +254,7 @@ def search_thumbnail(
     def search_description(
         self, query_text: str, event_ids: List[str] = None
     ) -> List[Tuple[str, float]]:
-        query_embedding = self.minilm_embedding([query_text])[0]
+        query_embedding = self.text_embedding([query_text])[0]
 
         # Prepare the base SQL query
         sql_query = """
@@ -246,6 +289,9 @@ def search_description(
     def reindex(self) -> None:
         logger.info("Indexing event embeddings...")
 
+        self._drop_tables()
+        self._create_tables()
+
         st = time.time()
         totals = {
             "thumb": 0,

diff --git a/frigate/embeddings/functions/clip.py b/frigate/embeddings/functions/clip.py