InstructBLIP and SEED Implementation #14

MichaelMaiii · 2024-03-10T18:04:21Z

Hi, I have checked the Clip-Vision embedding (last hidden state) of Blip2&InstructBlip on huggingface (instructblip-vicuna-7b), the dimension is 257x1408. However, the multi-modal matching space of ViT-Lens uses 1x768 dimension. I wonder how to use InstructBlip and Seed for text and image generation directly, have they been fine-tuned?

StanLei52 · 2024-03-10T18:09:17Z

Hi, please check #11.

For integration, we used the same ViT in InstructBLIP/SEED for ViT-Lens training. FYI, this ckpt in huggingface is for 3D integration. I will upload ckpts for more modalities and open-source this part in the near future.

MichaelMaiii · 2024-03-11T05:33:01Z

Thanks a lot. It seems that only the 'vitlensL_processors' is available now.

By the way, I notice that SEED-LLaMa has outperformed InstructBlip in image captioning, maybe it's more concise and well-performed to use the SEED-LLaMa for both text and image generation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InstructBLIP and SEED Implementation #14

InstructBLIP and SEED Implementation #14

MichaelMaiii commented Mar 10, 2024

StanLei52 commented Mar 10, 2024 •

edited

Loading

MichaelMaiii commented Mar 11, 2024

InstructBLIP and SEED Implementation #14

InstructBLIP and SEED Implementation #14

Comments

MichaelMaiii commented Mar 10, 2024

StanLei52 commented Mar 10, 2024 • edited Loading

MichaelMaiii commented Mar 11, 2024

StanLei52 commented Mar 10, 2024 •

edited

Loading