Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encode_text gives different clip features for the same text, single batch vs mutliple batch #429

Open
KevinNWalker opened this issue Feb 27, 2024 · 1 comment

Comments

@KevinNWalker
Copy link

I seem to get different results from encode_text when providing text as a single batch, or as part of several batches.

See the code sample below:

` import clip
import numpy

device = 'cuda'
clip_model, _ = clip.load('ViT-B/32', device)

clip_model.eval()

# Process 'I am happy' as a single batch
text_1 = clip.tokenize( 'I am happy', truncate=True).to(device)
feature_1 = clip_model.encode_text(text_1)
feature_1_np = feature_1.detach().cpu().numpy()
text_1_np = text_1.detach().cpu().numpy()

# Process 'I am happy' with a second batch
text_2 = clip.tokenize(['I am happy', 'I am happy'], truncate=True).to(device)
feature_2 = clip_model.encode_text(text_2)
feature_2_np = feature_2.detach().cpu().numpy()[0]
text_2_np = text_2.detach().cpu().numpy()[0]

print(f'Max diff in tokens {numpy.abs(text_2_np-text_1_np).max()}')
print(f'Max diff in features {numpy.abs(feature_2_np-feature_1_np).max()}')`

When I run this I get the following results:

Max diff in tokens 0 Max diff in features 0.000732421875

Is this to be expected or am I using the code incorrectly?

Many thanks

@bonjour-npy
Copy link

Hi there👋

I think your code is correct and the result of your code is also correct.

To the best of my knowledge (I can't be certain it's 100% right), the tokenization of the text in CLIP model is taking corresponding values from the fixed dict vocab. In other words, if the input text is the same, the tokenizer will always return the same result. That's why text_1, text_2[0] and text_2[1] are exactly the same.

But when it comes to function encode_text, it's generated by nn.Embedding, and the output of the Embedding Layer may be affected by the batch_size and context of the input or something else like that.

Here's a simple test:

import torch
import numpy
import clip

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
clip_model, _ = clip.load('ViT-B/32', device)

clip_model.eval()

text_1 = clip.tokenize('I am happy', truncate=True).to(device)
feature_1 = clip_model.encode_text(text_1)

text_2 = clip.tokenize(['I am happy', 'I am happy'], truncate=True).to(device)
feature_2 = clip_model.encode_text(text_2)

text_3 = clip.tokenize(['I am happy', 'I am happy', 'I am happy'], truncate=True).to(device)
feature_3 = clip_model.encode_text(text_3)

text_4 = clip.tokenize(['I am happy', 'I am sad', 'I am angry'], truncate=True).to(device)
feature_4 = clip_model.encode_text(text_4)

print((text_1 - text_2[0]).sum(), '\n', (text_1[0] - text_3[0]).sum(), '\n', (text_1[0] - text_4[0]).sum())
print((feature_1 - feature_2[0]).sum(), '\n', (feature_1[0] - feature_3[0]).sum(), '\n', (feature_1[0] - feature_4[0]).sum())

And my output is shown below:

tensor(0, device='cuda:0') 
 tensor(0, device='cuda:0') 
 tensor(0, device='cuda:0')
tensor(0.0089, device='cuda:0', dtype=torch.float16, grad_fn=<SumBackward0>) 
 tensor(0.0180, device='cuda:0', dtype=torch.float16, grad_fn=<SumBackward0>) 
 tensor(0.0180, device='cuda:0', dtype=torch.float16, grad_fn=<SumBackward0>)

We can jump to a conclusion (not meticulous enough apparently) from the result of text3 and text4, the context didn't affect the output of encode_text, it was the batch_size.

If you'd like to further communicate, feel free to reach out to me at [email protected].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants