Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug/Model Request]: Is slower than sentence transformer for all-minilm-l6-v2 #292

Open
0110G opened this issue Jul 9, 2024 · 10 comments
Open

Comments

@0110G
Copy link

0110G commented Jul 9, 2024

What happened?

On benchmarking synchronous computation times for generating embeddings for

  1. Using sentence transformers: ~1300 msgs per sec
    from sentence_transformers import SentenceTransformer
    model_standard = SentenceTransformer("all-MiniLM-L6-v2")

    start_time = time.time()
    for i in range(iter_count):
        model_standard.encode(random.sample(sentences, 1)[0])
    time_standard = time.time() - start_time
    print("Standard requires: {}s".format(time_standard))
    print("{} processed per sec".format(batch_size*iter_count/time_standard))

VS

  1. Using FastEmbed (Synchronously): 800 msgs per sec
    fast_model = TextEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
    start_time = time.time()
    for i in range(iter_count):
        list(fast_model.embed(random.sample(sentences, 1)[0]))
    time_standard = time.time() - start_time
    print("Fast requires: {}s".format(time_standard))
    print("{} processed per sec".format(batch_size*iter_count/time_standard))

I am using fastembed 0.3.3

pip show fastembed
Name: fastembed
Version: 0.3.3
Summary: Fast, light, accurate library built for retrieval embedding generation
Home-page: https://github.com/qdrant/fastembed
Author: Qdrant Team
Author-email: [email protected]
License: Apache License
Location: /Users/<>/PycharmProjects/Voyager/venv/lib/python3.9/site-packages
Requires: tqdm, PyStemmer, numpy, mmh3, onnxruntime, pillow, onnx, loguru, tokenizers, huggingface-hub, snowballstemmer, requests
Required-by: 

Why is this working so slow wrt original impl.? What can I do to improve performance ?

What Python version are you on? e.g. python --version

3.9.16

Version

0.2.7 (Latest)

What os are you seeing the problem on?

MacOS

Relevant stack traces and/or logs

No response

@generall
Copy link
Member

generall commented Jul 9, 2024

For reference, our benchmark of fastembed is here - https://colab.research.google.com/github/qdrant/fastembed/blob/main/experiments/Throughput_Across_Models.ipynb

I would have to try your version to tell for sure what's the difference, but at the first glance you are encoding one sentence at a time, while our benchmarks are in batches

@0110G
Copy link
Author

0110G commented Jul 9, 2024

I am also computing batch wise (batch size=512):

sentences = [["Some arbitrary sentence 1"]*512, ["Some arbitrary sentence 2"]*512] 

@0110G
Copy link
Author

0110G commented Jul 9, 2024

Complete python benchmarking code:

import random
import time

from sentence_transformers import SentenceTransformer
from fastembed import TextEmbedding



if __name__ == '__main__':
    iter_count = 50
    batch_size = 512
    sentences = [["biblestudytools kjv romans 6"]*512, ["MS Dhoni is one of the best wicket keeper in the world"]*512] #Standard requires: 39.150851249694824s
    
    # Sentence transformers
    model_standard = SentenceTransformer("all-MiniLM-L6-v2")
    fast_model = TextEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

    start_time = time.time()
    for i in range(iter_count):
        model_standard.encode(random.sample(sentences, 1)[0])
    time_standard = time.time() - start_time
    print("Standard requires: {}s".format(time_standard))
    print("{} processed per sec".format(batch_size*iter_count/time_standard))

    start_time = time.time()
    for i in range(iter_count):
        list(fast_model.embed(random.sample(sentences, 1)[0]))
    time_standard = time.time() - start_time
    print("Fast requires: {}s".format(time_standard))
    print("{} processed per sec".format(batch_size*iter_count/time_standard))

Output:


Standard requires: 21.204905033111572s
1207.267844870112 processed per sec
Fast requires: 25.721112966537476s
995.2913014808091 processed per sec

@generall
Copy link
Member

generall commented Jul 9, 2024

Thanks for sharing, we will look into it!

@generall
Copy link
Member

generall commented Jul 9, 2024

@0110G

Refactored the testing script a bit, here are my results: https://colab.research.google.com/drive/1SroKOUZ0iYN1vo2mRXdhIQeVyy0RWQTG?usp=sharing

It uses internal batching instead of external loop, as both libraries actually provide the interface capable of creating batches internally.
If your use-case requires different batching, it apparently might not work so well with fastembed.

Additionally, tried a different scenario of inferencing individual queries, data-parallel approach and running on higher CPU machine (default colab has 2 cpus, but higher tier has 8)

@0110G
Copy link
Author

0110G commented Jul 10, 2024

My use case involves constanly consuming messages from a stream, in a batch size (configurable), computing embeddings and doing some computation and writing it to a db. Therefore your approach is not fit for my use case

Seems like fast embed is not so fast after all.

@generall
Copy link
Member

@0110G
I think I understood the problem: when you call embed function in fastembed, it spawns workers each time. So, it would create an overhead.

I tried to convert fastembed version into steaming with python generators, so the embed function is only called once: https://colab.research.google.com/drive/1X03qTpBVNGDYs82CztfpqF2JOq_-75hK?usp=sharing

Please let me know if this option is closer to your use-case.

@0110G
Copy link
Author

0110G commented Jul 10, 2024

This works but I am not getting the similar results to what you showed on collab. Sentence transormers is still faster for me.
I find this absurd how can onnx model be slower than the actual implemenation

@joein
Copy link
Member

joein commented Jul 10, 2024

hi @0110G

Actually, I've encountered several cases, when onnx model was slower on mac os, the issue might be in onnxruntime

@generall
Copy link
Member

I was running colab on a higher tier machine with 8cpu, it might be the reason

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants