encode_text gives different clip features for the same text, single batch vs mutliple batch #429

KevinNWalker · 2024-02-27T15:22:39Z

I seem to get different results from encode_text when providing text as a single batch, or as part of several batches.

See the code sample below:

` import clip
import numpy

device = 'cuda'
clip_model, _ = clip.load('ViT-B/32', device)

clip_model.eval()

# Process 'I am happy' as a single batch
text_1 = clip.tokenize( 'I am happy', truncate=True).to(device)
feature_1 = clip_model.encode_text(text_1)
feature_1_np = feature_1.detach().cpu().numpy()
text_1_np = text_1.detach().cpu().numpy()

# Process 'I am happy' with a second batch
text_2 = clip.tokenize(['I am happy', 'I am happy'], truncate=True).to(device)
feature_2 = clip_model.encode_text(text_2)
feature_2_np = feature_2.detach().cpu().numpy()[0]
text_2_np = text_2.detach().cpu().numpy()[0]

print(f'Max diff in tokens {numpy.abs(text_2_np-text_1_np).max()}')
print(f'Max diff in features {numpy.abs(feature_2_np-feature_1_np).max()}')`

When I run this I get the following results:

Max diff in tokens 0 Max diff in features 0.000732421875

Is this to be expected or am I using the code incorrectly?

Many thanks

The text was updated successfully, but these errors were encountered:

bonjour-npy · 2024-03-02T16:00:16Z

Hi there👋

I think your code is correct and the result of your code is also correct.

To the best of my knowledge (I can't be certain it's 100% right), the tokenization of the text in CLIP model is taking corresponding values from the fixed dict vocab. In other words, if the input text is the same, the tokenizer will always return the same result. That's why text_1, text_2[0] and text_2[1] are exactly the same.

But when it comes to function encode_text, it's generated by nn.Embedding, and the output of the Embedding Layer may be affected by the batch_size and context of the input or something else like that.

Here's a simple test:

import torch
import numpy
import clip

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
clip_model, _ = clip.load('ViT-B/32', device)

clip_model.eval()

text_1 = clip.tokenize('I am happy', truncate=True).to(device)
feature_1 = clip_model.encode_text(text_1)

text_2 = clip.tokenize(['I am happy', 'I am happy'], truncate=True).to(device)
feature_2 = clip_model.encode_text(text_2)

text_3 = clip.tokenize(['I am happy', 'I am happy', 'I am happy'], truncate=True).to(device)
feature_3 = clip_model.encode_text(text_3)

text_4 = clip.tokenize(['I am happy', 'I am sad', 'I am angry'], truncate=True).to(device)
feature_4 = clip_model.encode_text(text_4)

print((text_1 - text_2[0]).sum(), '\n', (text_1[0] - text_3[0]).sum(), '\n', (text_1[0] - text_4[0]).sum())
print((feature_1 - feature_2[0]).sum(), '\n', (feature_1[0] - feature_3[0]).sum(), '\n', (feature_1[0] - feature_4[0]).sum())

And my output is shown below:

tensor(0, device='cuda:0') 
 tensor(0, device='cuda:0') 
 tensor(0, device='cuda:0')
tensor(0.0089, device='cuda:0', dtype=torch.float16, grad_fn=<SumBackward0>) 
 tensor(0.0180, device='cuda:0', dtype=torch.float16, grad_fn=<SumBackward0>) 
 tensor(0.0180, device='cuda:0', dtype=torch.float16, grad_fn=<SumBackward0>)

We can jump to a conclusion (not meticulous enough apparently) from the result of text3 and text4, the context didn't affect the output of encode_text, it was the batch_size.

If you'd like to further communicate, feel free to reach out to me at [email protected].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encode_text gives different clip features for the same text, single batch vs mutliple batch #429

encode_text gives different clip features for the same text, single batch vs mutliple batch #429

KevinNWalker commented Feb 27, 2024

bonjour-npy commented Mar 2, 2024

encode_text gives different clip features for the same text, single batch vs mutliple batch #429

encode_text gives different clip features for the same text, single batch vs mutliple batch #429

Comments

KevinNWalker commented Feb 27, 2024

bonjour-npy commented Mar 2, 2024