You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I seem to get different results from encode_text when providing text as a single batch, or as part of several batches.
See the code sample below:
` import clip
import numpy
device = 'cuda'
clip_model, _ = clip.load('ViT-B/32', device)
clip_model.eval()
# Process 'I am happy' as a single batch
text_1 = clip.tokenize( 'I am happy', truncate=True).to(device)
feature_1 = clip_model.encode_text(text_1)
feature_1_np = feature_1.detach().cpu().numpy()
text_1_np = text_1.detach().cpu().numpy()
# Process 'I am happy' with a second batch
text_2 = clip.tokenize(['I am happy', 'I am happy'], truncate=True).to(device)
feature_2 = clip_model.encode_text(text_2)
feature_2_np = feature_2.detach().cpu().numpy()[0]
text_2_np = text_2.detach().cpu().numpy()[0]
print(f'Max diff in tokens {numpy.abs(text_2_np-text_1_np).max()}')
print(f'Max diff in features {numpy.abs(feature_2_np-feature_1_np).max()}')`
When I run this I get the following results:
Max diff in tokens 0 Max diff in features 0.000732421875
Is this to be expected or am I using the code incorrectly?
Many thanks
The text was updated successfully, but these errors were encountered:
I think your code is correct and the result of your code is also correct.
To the best of my knowledge (I can't be certain it's 100% right), the tokenization of the text in CLIP model is taking corresponding values from the fixed dict vocab. In other words, if the input text is the same, the tokenizer will always return the same result. That's why text_1, text_2[0] and text_2[1] are exactly the same.
But when it comes to function encode_text, it's generated by nn.Embedding, and the output of the Embedding Layer may be affected by the batch_size and context of the input or something else like that.
Here's a simple test:
importtorchimportnumpyimportclipdevice=torch.device('cuda'iftorch.cuda.is_available() else'cpu')
clip_model, _=clip.load('ViT-B/32', device)
clip_model.eval()
text_1=clip.tokenize('I am happy', truncate=True).to(device)
feature_1=clip_model.encode_text(text_1)
text_2=clip.tokenize(['I am happy', 'I am happy'], truncate=True).to(device)
feature_2=clip_model.encode_text(text_2)
text_3=clip.tokenize(['I am happy', 'I am happy', 'I am happy'], truncate=True).to(device)
feature_3=clip_model.encode_text(text_3)
text_4=clip.tokenize(['I am happy', 'I am sad', 'I am angry'], truncate=True).to(device)
feature_4=clip_model.encode_text(text_4)
print((text_1-text_2[0]).sum(), '\n', (text_1[0] -text_3[0]).sum(), '\n', (text_1[0] -text_4[0]).sum())
print((feature_1-feature_2[0]).sum(), '\n', (feature_1[0] -feature_3[0]).sum(), '\n', (feature_1[0] -feature_4[0]).sum())
We can jump to a conclusion (not meticulous enough apparently) from the result of text3 and text4, the context didn't affect the output of encode_text, it was the batch_size.
If you'd like to further communicate, feel free to reach out to me at [email protected].
I seem to get different results from encode_text when providing text as a single batch, or as part of several batches.
See the code sample below:
` import clip
import numpy
When I run this I get the following results:
Max diff in tokens 0 Max diff in features 0.000732421875
Is this to be expected or am I using the code incorrectly?
Many thanks
The text was updated successfully, but these errors were encountered: