Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InstructBLIP and SEED Implementation #14

Open
MichaelMaiii opened this issue Mar 10, 2024 · 2 comments
Open

InstructBLIP and SEED Implementation #14

MichaelMaiii opened this issue Mar 10, 2024 · 2 comments

Comments

@MichaelMaiii
Copy link

Hi, I have checked the Clip-Vision embedding (last hidden state) of Blip2&InstructBlip on huggingface (instructblip-vicuna-7b), the dimension is 257x1408. However, the multi-modal matching space of ViT-Lens uses 1x768 dimension. I wonder how to use InstructBlip and Seed for text and image generation directly, have they been fine-tuned?

@StanLei52
Copy link
Collaborator

StanLei52 commented Mar 10, 2024

Hi, please check #11.

For integration, we used the same ViT in InstructBLIP/SEED for ViT-Lens training. FYI, this ckpt in huggingface is for 3D integration. I will upload ckpts for more modalities and open-source this part in the near future.

@MichaelMaiii
Copy link
Author

Thanks a lot. It seems that only the 'vitlensL_processors' is available now.

By the way, I notice that SEED-LLaMa has outperformed InstructBlip in image captioning, maybe it's more concise and well-performed to use the SEED-LLaMa for both text and image generation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants