You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I have checked the Clip-Vision embedding (last hidden state) of Blip2&InstructBlip on huggingface (instructblip-vicuna-7b), the dimension is 257x1408. However, the multi-modal matching space of ViT-Lens uses 1x768 dimension. I wonder how to use InstructBlip and Seed for text and image generation directly, have they been fine-tuned?
The text was updated successfully, but these errors were encountered:
For integration, we used the same ViT in InstructBLIP/SEED for ViT-Lens training. FYI, this ckpt in huggingface is for 3D integration. I will upload ckpts for more modalities and open-source this part in the near future.
Thanks a lot. It seems that only the 'vitlensL_processors' is available now.
By the way, I notice that SEED-LLaMa has outperformed InstructBlip in image captioning, maybe it's more concise and well-performed to use the SEED-LLaMa for both text and image generation.
Hi, I have checked the Clip-Vision embedding (last hidden state) of Blip2&InstructBlip on huggingface (instructblip-vicuna-7b), the dimension is 257x1408. However, the multi-modal matching space of ViT-Lens uses 1x768 dimension. I wonder how to use InstructBlip and Seed for text and image generation directly, have they been fine-tuned?
The text was updated successfully, but these errors were encountered: