Add Jina Embeddings #61

JohannesMessner · 2023-11-07T08:24:50Z

Hi Qdrant team, as promised, here comes the Jina x Fastembed integration!

The main differences to the existing FlagEmbedding class are:

The model files are downloaded directly form HuggingFace and leverages the HF local cache
Mean pooling is applied

TODO:

add model info
documentation (actually I think there is nothing to do here)
tests

NirantK

Hey, thanks a ton for doing this!

Please avoid adding pooling to other embedding implementations like FlagEmbedding, and add to JinaEmbedding. FlagEmbedding is a specific Embedding model, much like JINA
The intent behind FastEmbed is to continue to be light, and an important part of that is to keep the dependency graph small. Do not add a Huggingface Hub dependency please?

The comments below are specifics of the above 2 ideas, and should automatically get resolved if you reason through the code base as shared above.

Please note that the code breaks right now since Huggingface hub dependency is not poetry.toml.

fastembed/embedding.py

tests/test_onnx_embeddings.py

fastembed/embedding.py

JohannesMessner · 2023-11-07T14:06:27Z

Hey, thanks a ton for doing this!
1. Please avoid adding pooling to other embedding implementations like `FlagEmbedding`,  and add to `JinaEmbedding`. `FlagEmbedding` is a specific Embedding model, much like JINA

2. The intent behind FastEmbed is to continue to be light, and an important part of that is to keep the dependency graph small. Do not add a Huggingface Hub dependency please?
The comments below are specifics of the above 2 ideas, and should automatically get resolved if you reason through the code base as shared above.

Please note that the code breaks right now since Huggingface hub dependency is not poetry.toml.

Thanks for your very fast review!

The way I am doing it right now, FlagEmbedding still doesn't use any pooling, I just chose this code structure for code sharing. But I can refactor it in a way where this part is completely removed from the FlagEmbedding class, if you prefer.
I saw the huggingface hub in the lock file so I thought it was a transitive dependency of one of the specified ones. I'll remove it

JohannesMessner · 2023-11-15T10:56:37Z

I refactored the code to come closer to your suggestions, unfortunately this results in some code duplication between the FlagEmbedding and JinaEmbedding classes, let me know if it is ok like this!

NirantK · 2023-11-15T12:22:32Z

fastembed/embedding.py

@@ -464,7 +527,8 @@ def embed(

        if parallel is None or is_small:
            for batch in iter_batch(documents, batch_size):
-                yield from self.model.onnx_embed(batch)
+                embeddings, _ = self.model.onnx_embed(batch)


Hmm, I'm confused. I believe FlagEmbedding should be left untouched since all the changes are in the parent class and JinaAI Embedding class, right?

Similarly, the list_supported_models rewrite isn't needed and should be removed from all implementations now?

FlagEmbedding cannot be left entirely untouched unfortunately, unless I am missing something.

Before this PR, the EmbeddingModel.onnx_embed() method picks out the first token as form of pooling, and then applies normalization. Baked in with this is the assumption that all subclasses of Embedding (that hold an EmbeddingModel instance) intend for that behaviour. That assumption is broken by Jina embeddings, which requires mean pooling before the normalization.
And mean pooling cannot be applied after this, since the existing implementation of EmbeddingModel.onnx_embed() "throws away" the tokens needed for that.

Therefore, the implementation of EmbeddingModel.onnx_embed() needs two small modifications:

It delegates pooling and normalization to the subclasses of Embedding

It returns the tokenizer's attentions mask. Otherwise, without access to the attention mask, pooling schemes such as mean pooling cannot be implemented on the Embedding level.

This requires FlagEmbedding to adjust to those changes.
Just like JinaEmbedding, it now implements its own pooling scheme (just picking out the first token). The attention mask is not required for this, so it can be ignored when returned by EmbeddingModel.onnx_embed().

As for the list_supported_models() rewrite, yes, I can remove that. But then JinaEmbedding.list_supported_models() would return a bunch of models that are actually not supported by the JinaEmbedding class.

I see your point. Looks like we've to figure out a way to handle normalize, attention and pooling steps separately for each embedding implementation. At the moment, what you've proposed kinda works.

Let me think about this + test your PR and then we're good to go and merge this.

NirantK · 2023-11-20T03:50:17Z

Two requests:

Error on the pooling, this is on the mean pooling implementation. If there is a cannon implementation from Torch or Jina itself, let's re-use that here?

This is from the pytest which I ran locally with Python 3.9.17 on M2 in a fresh poetry install:

fastembed/embedding.py:648: in embed
    yield from normalize(self.mean_pooling(embeddings, attn_mask)).astype(np.float32)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

model_output = array([[-0.4756149 , -0.44713712, -0.12763295, ...,  0.6697987 ,
         0.30507904,  0.28676268],
       [-0.46482596, -0.20412004, -0.27510062, ...,  0.48833442,
         0.19993246, -0.01699639]], dtype=float32)
attention_mask = array([[1, 1, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 1]])

    @staticmethod
    def mean_pooling(model_output, attention_mask):
        token_embeddings = model_output
        input_mask_expanded = (np.expand_dims(attention_mask, axis=-1)).astype(float)
    
>       sum_embeddings = np.sum(token_embeddings * input_mask_expanded, axis=1)
E       ValueError: operands could not be broadcast together with shapes (2,512) (2,6,1)

I'll also figure out how you can get these tests on Github itself instead of needing to run them locally. That might help with faster dev for you

Once this is done, let's resolve merge conflicts?

NirantK · 2023-11-21T05:28:33Z

Added via another PR!

JohannesMessner added 8 commits November 7, 2023 09:16

feat: add jina embeddings

a701a32

refactor: remove prints

aeb4de1

chore: add model infos

733dd51

fix: precise supported models for jina and flag classes

a818305

refactor: tweak naming

eacf33d

fix: typo

c0adc14

fix: bug where mean pooling was not applied when processing in parallel

6688f3b

test: add tests for jina embeddings

41d046c

JohannesMessner marked this pull request as ready for review November 7, 2023 10:10

test: remove prints

e5680db

azayarni requested a review from NirantK November 7, 2023 10:28

NirantK suggested changes Nov 7, 2023

View reviewed changes

JohannesMessner added 3 commits November 14, 2023 10:24

test: refacotr tests to avoid code duplication

8963ee3

refactor: apply suggestions from code review

4451366

test: remove prints

60bcfff

JohannesMessner requested a review from NirantK November 15, 2023 10:55

chore: add hf_hub dependency

6c99bdc

NirantK reviewed Nov 15, 2023

View reviewed changes

NirantK closed this Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Jina Embeddings #61

Add Jina Embeddings #61

JohannesMessner commented Nov 7, 2023 •

edited

Loading

NirantK left a comment •

edited

Loading

JohannesMessner commented Nov 7, 2023

JohannesMessner commented Nov 15, 2023

NirantK Nov 15, 2023

JohannesMessner Nov 15, 2023 •

edited

Loading

JohannesMessner Nov 15, 2023

NirantK Nov 16, 2023

NirantK commented Nov 20, 2023

NirantK commented Nov 21, 2023

Add Jina Embeddings #61

Add Jina Embeddings #61

Conversation

JohannesMessner commented Nov 7, 2023 • edited Loading

NirantK left a comment • edited Loading

Choose a reason for hiding this comment

JohannesMessner commented Nov 7, 2023

JohannesMessner commented Nov 15, 2023

NirantK Nov 15, 2023

Choose a reason for hiding this comment

JohannesMessner Nov 15, 2023 • edited Loading

Choose a reason for hiding this comment

JohannesMessner Nov 15, 2023

Choose a reason for hiding this comment

NirantK Nov 16, 2023

Choose a reason for hiding this comment

NirantK commented Nov 20, 2023

NirantK commented Nov 21, 2023

JohannesMessner commented Nov 7, 2023 •

edited

Loading

NirantK left a comment •

edited

Loading

JohannesMessner Nov 15, 2023 •

edited

Loading