Skip to content

models Salesforce BLIP 2 opt 2 7b vqa

github-actions[bot] edited this page May 20, 2024 · 11 revisions

Salesforce-BLIP-2-opt-2-7b-vqa

Overview

The BLIP-2 model, utilizing OPT-2.7b (a large language model with 2.7 billion parameters), is presented in the paper titled "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models". This is a generic and efficient pre-training strategy that easily harvests development of pre-trained vision models and large language models (LLMs) for Vision-Language Pre-training (VLP). This model was made available in this repository.

BLIP-2 consists of 3 models: a CLIP-like image encoder, a Querying Transformer (Q-Former) and a large language model. The authors initialize the weights of the image encoder and large language model from pre-trained checkpoints and keep them frozen while training the Querying Transformer, which is a BERT-like Transformer encoder that maps a set of "query tokens" to query embeddings, which bridge the gap between the embedding space of the image encoder and the large language model.

The model's objective is to predict the next text token based on query embeddings and the previous text. This functionality allows the model to undertake a range of tasks, such as generating image captions, responding to visual questions (VQA), and participating in chat-like conversations using the image and preceding chat as input prompts.

Limitations and Biases

BLIP2-OPT uses off-the-shelf OPT as the language model. It shares the same potential risks and limitations outlined in Meta's model card.

Like other large language models for which the diversity (or lack thereof) of training data induces downstream impact on the quality of our model, OPT-175B has limitations in terms of bias and safety. OPT-175B can also have quality issues in terms of generation diversity and hallucination. In general, OPT-175B is not immune from the plethora of issues that plague modern large language models.

BLIP2 undergoes fine-tuning on internet collected image-text datasets, which raises concerns about potential inappropriate content generation or replicating inherent biases from the underlying data. The model has not been tested in real-world applications, and caution is advised against direct deployment. Researchers should carefully assess the model's safety and fairness in the specific deployment context before considering its use.

License

mit

Inference Samples

Inference type Python sample (Notebook) CLI with YAML
Real time visual-question-answering-online-endpoint.ipynb visual-question-answering-online-endpoint.sh
Batch visual-question-answering-batch-endpoint.ipynb visual-question-answering-batch-endpoint.sh

Sample input and output

Sample input

{
   "input_data":{
      "columns":[
         "image",
         "text"
      ],
      "index":[0, 1],
      "data":[
         ["image1", "What is in the picture? Answer: "],
         ["image2", "what are people doing? Answer: "]
      ]
   }
}

Note:

  • "image1" and "image2" should be publicly accessible urls or strings in base64 format.

Sample output

[
   {
      "text": "a stream in the desert"
   },
   {
      "text": "they're buying coffee"
   }
]

Visualization of inference result for a sample image

For sample image below and text prompt "what are people doing? Answer: ", the output text is "they're buying coffee".

Salesforce-BLIP2-vqa

Version: 5

Tags

Preview license : mit task : visual-question-answering SharedComputeCapacityEnabled huggingface_model_id : Salesforce/blip2-opt-2.7b author : Salesforce hiddenlayerscanned inference_compute_allow_list : ['Standard_DS5_v2', 'Standard_D8a_v4', 'Standard_D8as_v4', 'Standard_D16a_v4', 'Standard_D16as_v4', 'Standard_D32a_v4', 'Standard_D32as_v4', 'Standard_D48a_v4', 'Standard_D48as_v4', 'Standard_D64a_v4', 'Standard_D64as_v4', 'Standard_D96a_v4', 'Standard_D96as_v4', 'Standard_FX4mds', 'Standard_FX12mds', 'Standard_F16s_v2', 'Standard_F32s_v2', 'Standard_F48s_v2', 'Standard_F64s_v2', 'Standard_F72s_v2', 'Standard_FX24mds', 'Standard_FX36mds', 'Standard_FX48mds', 'Standard_E4s_v3', 'Standard_E8s_v3', 'Standard_E16s_v3', 'Standard_E32s_v3', 'Standard_E48s_v3', 'Standard_E64s_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC24ads_A100_v4', 'Standard_NC48ads_A100_v4', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2']

View in Studio: https://ml.azure.com/registries/azureml/models/Salesforce-BLIP-2-opt-2-7b-vqa/version/5

License: mit

Properties

SharedComputeCapacityEnabled: True

SHA: 6e723d92ee91ebcee4ba74d7017632f11ff4217b

inference-min-sku-spec: 4|0|32|64

inference-recommended-sku: Standard_DS5_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_FX4mds, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

Clone this wiki locally