diff --git a/.github/workflows/CI.yml b/.github/workflows/CI.yml index b2c58574..1e37d6c6 100644 --- a/.github/workflows/CI.yml +++ b/.github/workflows/CI.yml @@ -105,8 +105,6 @@ jobs: strategy: matrix: platform: - - runner: macos-12 - target: x86_64 - runner: macos-14 target: aarch64 steps: diff --git a/README.md b/README.md index 625e7fda..fb89f98f 100644 --- a/README.md +++ b/README.md @@ -21,14 +21,14 @@
- 🦀 Rust-powered Framework for Lightning-Fast Ingestion, Inference, and Indexing
+ Inference, ingestion, and indexing – supercharged by Rust 🦀
Explore the docs »
View Demo
·
- Examples
+ Benches
·
Vector Streaming Adapters
.
@@ -83,7 +83,7 @@ EmbedAnything is a minimalist, highly performant, lightning-fast, lightweight, m
## đź’ˇWhat is Vector Streaming
-Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings Chunk by Chunk, that you can segment semantically, and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once.
+Vector Streaming enables you to process and generate embeddings for files and stream them, so if you have 10 GB of file, it can continuously generate embeddings Chunk by Chunk, that you can segment semantically, and store them in the vector database of your choice, Thus it eliminates bulk embeddings storage on RAM at once. The embedding process happens separetly from the main process, so as to maintain high performance enabled by rust MPSC.
[![EmbedAnythingXWeaviate](https://res.cloudinary.com/dltwftrgc/image/upload/v1731166897/demo_o8auu4.gif)](https://www.youtube.com/watch?v=OJRWPLQ44Dw)
@@ -107,7 +107,7 @@ model = EmbeddingModel.from_pretrained_hf(
WhichModel.Bert, model_id="model link from huggingface"
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
-data = embed_anything.embed_file("file_address", embeder=model, config=config)
+data = embed_anything.embed_file("file_address", embedder=model, config=config)
```
@@ -190,7 +190,7 @@ pip install embed-anything-gpu
model = EmbeddingModel.from_pretrained_local(
WhichModel.Bert, model_id="Hugging_face_link"
)
-data = embed_anything.embed_file("test_files/test.pdf", embeder=model)
+data = embed_anything.embed_file("test_files/test.pdf", embedder=model)
```
@@ -206,11 +206,11 @@ model = embed_anything.EmbeddingModel.from_pretrained_local(
model_id="openai/clip-vit-base-patch16",
# revision="refs/pr/15",
)
-data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
+data: list[EmbedData] = embed_anything.embed_directory("test_files", embedder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
- embed_anything.embed_query(query, embeder=model)[0].embedding
+ embed_anything.embed_query(query, embedder=model)[0].embedding
)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
@@ -233,7 +233,7 @@ from embed_anything import (
audio_decoder = AudioDecoderModel.from_pretrained_hf(
"openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
-embeder = EmbeddingModel.from_pretrained_hf(
+embedder = EmbeddingModel.from_pretrained_hf(
embed_anything.WhichModel.Bert,
model_id="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
@@ -242,7 +242,7 @@ config = TextEmbedConfig(chunk_size=200, batch_size=32)
data = embed_anything.embed_audio_file(
"test_files/audio/samples_hp0.wav",
audio_decoder=audio_decoder,
- embeder=embeder,
+ embedder=embedder,
text_embed_config=config,
)
print(data[0].metadata)
diff --git a/docs/blog/posts/Journey.md b/docs/blog/posts/Journey.md
new file mode 100644
index 00000000..6dd1336c
--- /dev/null
+++ b/docs/blog/posts/Journey.md
@@ -0,0 +1,77 @@
+---
+draft: false
+date: 2024-12-15
+authors:
+ - akshay
+ - sonam
+slug: embed-anything
+title: The path ahead of EmbedAnything
+---
+In March, we set out to build a local file search app. We aimed to create a tool that would make file searching faster, more innovative, and more efficient. However, we quickly hit a roadblock: no high-performance backend fit our needs.
+
+![image.png](https://royal-hygienic-522.notion.site/image/https%3A%2F%2Fprod-files-secure.s3.us-west-2.amazonaws.com%2Ff1bf59bf-2c3f-4b4d-a5f9-109d041ef45a%2Faa8abe48-4210-494c-af98-458b6694b09a%2Fimage.png?table=block&id=15d81b6a-6bbe-80cc-883e-fcafd65e619d&spaceId=f1bf59bf-2c3f-4b4d-a5f9-109d041ef45a&width=1420&userId=&cache=v2)
+
+### Short of backend
+
+Initially, we experimented with LlamaIndex, hoping it would provide the required speed and reliability. Unfortunately, it fell short. Its performance didn’t meet our expectations, and its heavy dependencies added unnecessary complexity to our stack. We realized we needed a better solution.
+
+Around the same time, we discovered **Candle**, a Rust-based framework for transformer model inference. Candle stood out with its remarkable speed and minimal dependency footprint. It was exactly what we were looking for a high-performing, lightweight backend that aligned with our vision for a seamless file search experience.
+
+### Experimentation and Breakthroughs
+
+Excited by Candle’s potential, we experimented to see how well it could handle our use case. The results were outstanding. Candle’s blazing-fast inference speeds and low resource demands enabled us to build a prototype that surpassed our initial performance goals.
+
+With a working prototype, we decided to share it with the world. We knew a compelling demonstration could capture attention and validate our efforts. The next step was to make a splash with our launch.
+
+### Demo Released
+
+On **April 2nd**, we unveiled our demo online, carefully choosing the date to avoid confusion with April Fool’s Day. We created an engaging demo video to highlight the app’s capabilities and shared it on Twitter. What happened next exceeded all our expectations.
+
+The demo received an overwhelming response. What began as a simple showcase of our prototype transformed into a pivotal moment for our project. In the next 30 days, we released it as an open-source project, seeing the demand and people’s interest.
+
+[Demo](https://www.youtube.com/watch?v=HLXIuznnXcI)
+### 0.2 released
+
+Since then, we have never looked back. We kept embedding anything better and better. In the next three months, we released a more stable version, 0.2, with all the Python versions. It was running amazingly on AWS and could support multimodality.
+
+At the same time, we realized that people wanted an end-to-end solution, not just an embedding generation platform. So we tried to integrate a vector database, but we realized that it would just make our library heavier and not give the value we were looking for, which was confirmed by this discussion opened on our GitHub.
+
+[—GitHub discussion](https://github.com/StarlightSearch/EmbedAnything/discussions/44#discussion-6953627)
+
+Akshay started looking for ways to index embeddings without being dependent on vector databases as a dependency, and he came up with a brilliant method that enhanced performance and made indexing extremely memory efficient.
+
+And thus, vector streaming was born.
+
+— [vector streaming blog](https://starlight-search.com/blog/2024/01/31/vector-streaming/)
+
+### 0.3 release
+
+It's time to release 0.3 because we underwent major code refactoring. All the major functions are refactored, making calling models more intuitive and optimized. Check out our docs and usage. We also added audio modality and different types of ingestions.
+
+We only supported dense, so we expanded the types of embedding we could support. We went for sparse and started supporting ColPali, Onnx, and Candle.
+
+## What We Got Right
+
+We actively listened to our community and prioritized their needs in the library's development. When users requested support for sparse matrices in hybrid models, we delivered. When they wanted advanced indexing, we made it happen. During the critical three-month period between versions 0.2 and 0.4, our efforts were laser-focused on enhancing the product to meet and exceed expectations.
+
+We also released benches comparing it with other inference and to our suprise it's faster than libraries like sentence transformer and fastembed. Check out [Benches](https://colab.research.google.com/drive/1nXvd25hDYO-j7QGOIIC0M7MDpovuPCaD?usp=sharing).
+
+
+We presented Embedanything at many conferences, like Pydata Global, Elastic, voxel 51 meetups, AI builders, etc. Additionally, we forged collaborations with major brands like Weaviate and Elastic, a strategy we’re excited to continue expanding in 2025.
+
+[Weaviate Collab](https://www.youtube.com/watch?v=OJRWPLQ44Dw)
+
+
+## What We Initially Got Wrong
+
+In hindsight, one significant mistake was prematurely releasing the library before it was ready for production. As the saying goes, “You never get a second chance to make a first impression,” and this holds true even for open-source projects.
+
+The library was unusable on macOS for the first three months, and we only released compatibility with Python 10. We didn’t focus enough on how we were rolling out updates, partly because we never anticipated the overwhelming rate of experimentation and interest it would receive right from the start.
+
+I intended to foster a “build in public” project, encouraging collaboration and rapid iteration. I wanted to showcase how quickly we could improve and refine this amazing library.
+
+### In the year 2025
+
+We are committed to applying everything we’ve learned from this journey and doubling down on what truly matters: our hero, the product. In the grand scheme of things, nothing else is as important. Moving forward, we’re also excited to announce even more collaborations with amazing brands, further expanding the impact and reach of our work.
+
+Heartfelt thanks to all our amazing contributors and stargazers for your unwavering support and dedication to *embedanything*. Your continuous experimentation and feedback inspire us to keep refining and enhancing the library with every iteration. We deeply appreciate your efforts in making this journey truly collaborative. Let’s go from 100k+ to a million downloads this year!
\ No newline at end of file
diff --git a/docs/blog/posts/embed-anything.md b/docs/blog/posts/embed-anything.md
index f4266507..9c545325 100644
--- a/docs/blog/posts/embed-anything.md
+++ b/docs/blog/posts/embed-anything.md
@@ -1,6 +1,6 @@
---
draft: false
-date: 2024-01-31
+date: 2024-03-31
authors:
- akshay
- sonam
@@ -115,7 +115,7 @@ model = EmbeddingModel.from_pretrained_hf(
WhichModel.Bert, model_id="model link from huggingface"
)
config = TextEmbedConfig(chunk_size=200, batch_size=32)
-data = embed_anything.embed_file("file_address", embeder=model, config=config)
+data = embed_anything.embed_file("file_address", embedder=model, config=config)
```
You can check out the documentation at https://starlight-search.com/references/
diff --git a/docs/blog/posts/vector-streaming.md b/docs/blog/posts/vector-streaming.md
index c6b8a2be..abc2c35e 100644
--- a/docs/blog/posts/vector-streaming.md
+++ b/docs/blog/posts/vector-streaming.md
@@ -1,6 +1,6 @@
---
draft: false
-date: 2024-01-31
+date: 2024-03-31
authors:
- akshay
- sonam
@@ -114,7 +114,7 @@ model = embed_anything.EmbeddingModel.from_pretrained_cloud(
data = embed_anything.embed_image_directory(
"\image_directory",
- embeder=model,
+ embedder=model,
adapter=weaviate_adapter,
config=embed_anything.ImageEmbedConfig(buffer_size=100),
)
@@ -124,7 +124,7 @@ data = embed_anything.embed_image_directory(
#### Step 4: Query the Vector Database
```python
-query_vector = embed_anything.embed_query(["image of a cat"], embeder=model)[0].embedding
+query_vector = embed_anything.embed_query(["image of a cat"], embedder=model)[0].embedding
```
#### Step 5: Query the Vector Database
diff --git a/docs/index.md b/docs/index.md
index 02643ed6..e45cf78e 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -119,7 +119,7 @@ pip install embed-anything-gpu
model = EmbeddingModel.from_pretrained_local(
WhichModel.Bert, model_id="sentence-transformers/all-MiniLM-L6-v2"
)
-data = embed_anything.embed_file("test_files/test.pdf", embeder=model)
+data = embed_anything.embed_file("test_files/test.pdf", embedder=model)
```
@@ -162,11 +162,11 @@ model = embed_anything.EmbeddingModel.from_pretrained_local(
model_id="openai/clip-vit-base-patch16",
# revision="refs/pr/15",
)
-data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
+data: list[EmbedData] = embed_anything.embed_directory("test_files", embedder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
- embed_anything.embed_query(query, embeder=model)[0].embedding
+ embed_anything.embed_query(query, embedder=model)[0].embedding
)
similarities = np.dot(embeddings, query_embedding)
max_index = np.argmax(similarities)
@@ -199,7 +199,7 @@ jina_config = JinaConfig(
config = EmbedConfig(jina=jina_config, audio_decoder=audio_decoder_config)
data = embed_anything.embed_file(
- "test_files/audio/samples_hp0.wav", embeder="Audio", config=config
+ "test_files/audio/samples_hp0.wav", embedder="Audio", config=config
)
print(data[0].metadata)
end_time = time.time()
diff --git a/examples/adapters/elastic.py b/examples/adapters/elastic.py
index 0600090f..4b70f606 100644
--- a/examples/adapters/elastic.py
+++ b/examples/adapters/elastic.py
@@ -77,7 +77,7 @@ def upsert(self, data: List[Dict]):
data = embed_anything.embed_file(
"/path/to/my-file.pdf",
- embeder="Bert",
+ embedder="Bert",
adapter=elasticsearch_adapter,
config=embed_config,
)
diff --git a/examples/adapters/pinecone_db.py b/examples/adapters/pinecone_db.py
index de054f7e..7545bc7e 100644
--- a/examples/adapters/pinecone_db.py
+++ b/examples/adapters/pinecone_db.py
@@ -123,7 +123,7 @@ def upsert(self, data: List[Dict]):
data = embed_anything.embed_image_directory(
"test_files",
- embeder=clip_model,
+ embedder=clip_model,
adapter=pinecone_adapter,
config=embed_config,
)
diff --git a/examples/adapters/weaviate_db.py b/examples/adapters/weaviate_db.py
index e0dbf456..1fc54780 100644
--- a/examples/adapters/weaviate_db.py
+++ b/examples/adapters/weaviate_db.py
@@ -65,10 +65,10 @@ def delete_index(self, index_name: str):
data = embed_anything.embed_directory(
- "test_files", embeder=model, adapter=weaviate_adapter
+ "test_files", embedder=model, adapter=weaviate_adapter
)
-query_vector = embed_anything.embed_query(["What is attention"], embeder=model)[
+query_vector = embed_anything.embed_query(["What is attention"], embedder=model)[
0
].embedding
diff --git a/examples/audio.py b/examples/audio.py
index 5277d0da..ba000aee 100644
--- a/examples/audio.py
+++ b/examples/audio.py
@@ -14,7 +14,7 @@
"openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
-embeder = EmbeddingModel.from_pretrained_hf(
+embedder = EmbeddingModel.from_pretrained_hf(
embed_anything.WhichModel.Bert,
model_id="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
@@ -24,7 +24,7 @@
data = embed_anything.embed_audio_file(
"test_files/audio/samples_hp0.wav",
audio_decoder=audio_decoder,
- embeder=embeder,
+ embedder=embedder,
text_embed_config=config,
)
print(data[0].metadata)
diff --git a/examples/clip.py b/examples/clip.py
index ca4b14cd..61dcd80c 100644
--- a/examples/clip.py
+++ b/examples/clip.py
@@ -11,7 +11,7 @@
model_id="openai/clip-vit-base-patch16",
)
data: list[EmbedData] = embed_anything.embed_image_directory(
- "test_files", embeder=model
+ "test_files", embedder=model
)
# Convert the embeddings to a numpy array
@@ -22,7 +22,7 @@
# Embed a query
query = ["Photo of a monkey?"]
query_embedding = np.array(
- embed_anything.embed_query(query, embeder=model)[0].embedding
+ embed_anything.embed_query(query, embedder=model)[0].embedding
)
# Calculate the similarities between the query embedding and all the embeddings
diff --git a/examples/hybridsearch.py b/examples/hybridsearch.py
index c3d635d4..855cabc8 100644
--- a/examples/hybridsearch.py
+++ b/examples/hybridsearch.py
@@ -50,16 +50,16 @@
WhichModel.Jina, model_id="jinaai/jina-embeddings-v2-small-en"
)
-jina_embedddings = embed_anything.embed_query(sentences, embeder=jina_model)
-jina_query = embed_anything.embed_query(query_text, embeder=jina_model)[0]
+jina_embedddings = embed_anything.embed_query(sentences, embedder=jina_model)
+jina_query = embed_anything.embed_query(query_text, embedder=jina_model)[0]
splade_model = EmbeddingModel.from_pretrained_hf(
WhichModel.SparseBert, "prithivida/Splade_PP_en_v1"
)
-jina_embedddings = embed_anything.embed_query(sentences, embeder=jina_model)
+jina_embedddings = embed_anything.embed_query(sentences, embedder=jina_model)
-splade_query = embed_anything.embed_query(query_text, embeder=splade_model)
+splade_query = embed_anything.embed_query(query_text, embedder=splade_model)
client.query_points(
collection_name="my-hybrid-collection",
diff --git a/examples/onnx_models.py b/examples/onnx_models.py
index 312556cc..a6752f8f 100644
--- a/examples/onnx_models.py
+++ b/examples/onnx_models.py
@@ -25,7 +25,7 @@
"The dog is sitting in the park",
]
-embedddings = embed_query(sentences, embeder=model)
+embedddings = embed_query(sentences, embedder=model)
embed_vector = np.array([e.embedding for e in embedddings])
diff --git a/examples/semantic_chunking.py b/examples/semantic_chunking.py
index 7575f60a..20e59a5c 100644
--- a/examples/semantic_chunking.py
+++ b/examples/semantic_chunking.py
@@ -16,7 +16,7 @@
semantic_encoder=semantic_encoder,
)
-data = embed_anything.embed_file("test_files/bank.txt", embeder=model, config=config)
+data = embed_anything.embed_file("test_files/bank.txt", embedder=model, config=config)
for d in data:
print(d.text)
diff --git a/examples/splade.py b/examples/splade.py
index 4f806614..40b7738f 100644
--- a/examples/splade.py
+++ b/examples/splade.py
@@ -22,7 +22,7 @@
"Do you like pizza?",
]
-embedddings = embed_query(sentences, embeder=model)
+embedddings = embed_query(sentences, embedder=model)
embed_vector = np.array([e.embedding for e in embedddings])
diff --git a/examples/text.py b/examples/text.py
index 15225727..296bff83 100644
--- a/examples/text.py
+++ b/examples/text.py
@@ -21,7 +21,7 @@ def embed_directory_example():
# Embed all files in a directory
data: list[EmbedData] = embed_anything.embed_directory(
- "bench", embeder=model, config=config
+ "bench", embedder=model, config=config
)
# End timing
@@ -39,7 +39,7 @@ def embed_query_example():
# Embed a query
embeddings: EmbedData = embed_anything.embed_query(
- ["Hello world my"], embeder=model, config=config
+ ["Hello world my"], embedder=model, config=config
)[0]
# Print the shape of the embedding
@@ -48,7 +48,7 @@ def embed_query_example():
# Embed another query and print the result
print(
embed_anything.embed_query(
- ["What is the capital of India?"], embeder=model, config=config
+ ["What is the capital of India?"], embedder=model, config=config
)
)
@@ -62,7 +62,7 @@ def embed_file_example():
# Embed a single file
data: list[EmbedData] = embed_anything.embed_file(
- "test_files/bank.txt", embeder=model, config=config
+ "test_files/bank.txt", embedder=model, config=config
)
# Print the embedded data
diff --git a/examples/text_ocr.py b/examples/text_ocr.py
index a0db094d..f01a68bf 100644
--- a/examples/text_ocr.py
+++ b/examples/text_ocr.py
@@ -22,7 +22,7 @@
data: list[EmbedData] = embed_anything.embed_file(
"/home/akshay/projects/starlaw/src-server/test_files/court.pdf", # Replace with your file path
- embeder=model,
+ embedder=model,
config=config,
)
end = time()
diff --git a/examples/web.py b/examples/web.py
index e877e084..dcb55d88 100644
--- a/examples/web.py
+++ b/examples/web.py
@@ -1,3 +1,3 @@
import embed_anything
-data = embed_anything.embed_webpage("https://www.akshaymakes.com/", embeder="Bert")
+data = embed_anything.embed_webpage("https://www.akshaymakes.com/", embedder="Bert")
diff --git a/python/python/embed_anything/__init__.py b/python/python/embed_anything/__init__.py
index 95f0f29f..4c8506b5 100644
--- a/python/python/embed_anything/__init__.py
+++ b/python/python/embed_anything/__init__.py
@@ -21,7 +21,7 @@
model = EmbeddingModel.from_pretrained_local(
WhichModel.Bert, model_id="Hugging_face_link"
)
-data = embed_anything.embed_file("test_files/test.pdf", embeder=model)
+data = embed_anything.embed_file("test_files/test.pdf", embedder=model)
#For images
@@ -30,11 +30,11 @@
model_id="openai/clip-vit-base-patch16",
# revision="refs/pr/15",
)
-data: list[EmbedData] = embed_anything.embed_directory("test_files", embeder=model)
+data: list[EmbedData] = embed_anything.embed_directory("test_files", embedder=model)
embeddings = np.array([data.embedding for data in data])
query = ["Photo of a monkey?"]
query_embedding = np.array(
- embed_anything.embed_query(query, embeder=model)[0].embedding
+ embed_anything.embed_query(query, embedder=model)[0].embedding
)
# For audio files
from embed_anything import (
@@ -47,7 +47,7 @@
audio_decoder = AudioDecoderModel.from_pretrained_hf(
"openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
-embeder = EmbeddingModel.from_pretrained_hf(
+embedder = EmbeddingModel.from_pretrained_hf(
embed_anything.WhichModel.Bert,
model_id="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
@@ -56,7 +56,7 @@
data = embed_anything.embed_audio_file(
"test_files/audio/samples_hp0.wav",
audio_decoder=audio_decoder,
- embeder=embeder,
+ embedder=embedder,
text_embed_config=config,
)
@@ -98,7 +98,7 @@
data = embed_anything.embed_image_directory(
"test_files",
- embeder=clip_model,
+ embedder=clip_model,
adapter=pinecone_adapter,
# config=embed_config,
```
diff --git a/python/python/embed_anything/_embed_anything.pyi b/python/python/embed_anything/_embed_anything.pyi
index 50b32783..00290dd8 100644
--- a/python/python/embed_anything/_embed_anything.pyi
+++ b/python/python/embed_anything/_embed_anything.pyi
@@ -53,14 +53,14 @@ class Adapter(ABC):
"""
def embed_query(
- query: list[str], embeder: EmbeddingModel, config: TextEmbedConfig | None = None
+ query: list[str], embedder: EmbeddingModel, config: TextEmbedConfig | None = None
) -> list[EmbedData]:
"""
Embeds the given query and returns a list of EmbedData objects.
Args:
query: The query to embed.
- embeder: The embedding model to use.
+ embedder: The embedding model to use.
config: The configuration for the embedding model.
Returns:
@@ -80,7 +80,7 @@ def embed_query(
def embed_file(
file_path: str,
- embeder: EmbeddingModel,
+ embedder: EmbeddingModel,
config: TextEmbedConfig | None = None,
adapter: Adapter | None = None,
) -> list[EmbedData]:
@@ -89,7 +89,7 @@ def embed_file(
Args:
file_path: The path to the file to embed.
- embeder: The embedding model to use.
+ embedder: The embedding model to use.
config: The configuration for the embedding model.
adapter: The adapter to use for storing the embeddings in a vector database.
@@ -104,13 +104,13 @@ def embed_file(
model_id="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
)
- data = embed_anything.embed_file("test_files/test.pdf", embeder=model)
+ data = embed_anything.embed_file("test_files/test.pdf", embedder=model)
```
"""
def embed_directory(
file_path: str,
- embeder: EmbeddingModel,
+ embedder: EmbeddingModel,
extensions: list[str],
config: TextEmbedConfig | None = None,
adapter: Adapter | None = None,
@@ -120,7 +120,7 @@ def embed_directory(
Args:
file_path: The path to the directory containing the files to embed.
- embeder: The embedding model to use.
+ embedder: The embedding model to use.
extensions: The list of file extensions to consider for embedding.
config: The configuration for the embedding model.
adapter: The adapter to use for storing the embeddings in a vector database.
@@ -136,13 +136,13 @@ def embed_directory(
model_id="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
)
- data = embed_anything.embed_directory("test_files", embeder=model, extensions=[".pdf"])
+ data = embed_anything.embed_directory("test_files", embedder=model, extensions=[".pdf"])
```
"""
def embed_image_directory(
file_path: str,
- embeder: EmbeddingModel,
+ embedder: EmbeddingModel,
config: ImageEmbedConfig | None = None,
adapter: Adapter | None = None,
) -> list[EmbedData]:
@@ -151,7 +151,7 @@ def embed_image_directory(
Args:
file_path: The path to the directory containing the images to embed.
- embeder: The embedding model to use.
+ embedder: The embedding model to use.
config: The configuration for the embedding model.
adapter: The adapter to use for storing the embeddings in a vector database.
@@ -161,7 +161,7 @@ def embed_image_directory(
def embed_webpage(
url: str,
- embeder: EmbeddingModel,
+ embedder: EmbeddingModel,
config: TextEmbedConfig | None,
adapter: Adapter | None,
) -> list[EmbedData] | None:
@@ -170,7 +170,7 @@ def embed_webpage(
Args:
url: The URL of the webpage to embed.
- embeder: The name of the embedding model to use. Choose between "OpenAI", "Jina", "Bert"
+ embedder: The name of the embedding model to use. Choose between "OpenAI", "Jina", "Bert"
config: The configuration for the embedding model.
adapter: The adapter to use for storing the embeddings.
@@ -185,7 +185,7 @@ def embed_webpage(
openai_config=embed_anything.OpenAIConfig(model="text-embedding-3-small")
)
data = embed_anything.embed_webpage(
- "https://www.akshaymakes.com/", embeder="OpenAI", config=config
+ "https://www.akshaymakes.com/", embedder="OpenAI", config=config
)
```
"""
@@ -193,7 +193,7 @@ def embed_webpage(
def embed_audio_file(
file_path: str,
audio_decoder: AudioDecoderModel,
- embeder: EmbeddingModel,
+ embedder: EmbeddingModel,
text_embed_config: TextEmbedConfig | None = TextEmbedConfig(
chunk_size=200, batch_size=32
),
@@ -204,7 +204,7 @@ def embed_audio_file(
Args:
file_path: The path to the audio file to embed.
audio_decoder: The audio decoder model to use.
- embeder: The embedding model to use.
+ embedder: The embedding model to use.
text_embed_config: The configuration for the embedding model.
Returns:
@@ -218,7 +218,7 @@ def embed_audio_file(
"openai/whisper-tiny.en", revision="main", model_type="tiny-en", quantized=False
)
- embeder = embed_anything.EmbeddingModel.from_pretrained_hf(
+ embedder = embed_anything.EmbeddingModel.from_pretrained_hf(
embed_anything.WhichModel.Bert,
model_id="sentence-transformers/all-MiniLM-L6-v2",
revision="main",
@@ -228,7 +228,7 @@ def embed_audio_file(
data = embed_anything.embed_audio_file(
"test_files/audio/samples_hp0.wav",
audio_decoder=audio_decoder,
- embeder=embeder,
+ embedder=embedder,
text_embed_config=config,
)
```
diff --git a/python/src/lib.rs b/python/src/lib.rs
index 3fd6b1e2..d5afba28 100644
--- a/python/src/lib.rs
+++ b/python/src/lib.rs
@@ -353,14 +353,14 @@ impl AudioDecoderModel {
}
#[pyfunction]
-#[pyo3(signature = (query, embeder, config=None))]
+#[pyo3(signature = (query, embedder, config=None))]
pub fn embed_query(
query: Vec