Skip to content

models Salesforce BLIP vqa base

github-actions[bot] edited this page May 20, 2024 · 12 revisions

Salesforce-BLIP-vqa-base

Overview

BLIP (Bootstrapping Language-Image Pre-training) designed for unified vision-language understanding and generation is a new VLP framework that expands the scope of downstream tasks compared to existing methods. The framework encompasses two key contributions from both model and data perspectives.

  1. BLIP incorporates the Multi-modal Mixture of Encoder-Decoder (MED), an innovative model architecture designed to facilitate effective multi-task pre-training and flexible transfer learning. This model is jointly pre-trained using three vision-language objectives: image-text contrastive learning, image-text matching, and image-conditioned language modeling.

  2. BLIP introduces Captioning and Filtering (CapFilt), a distinctive dataset bootstrapping method aimed at learning from noisy image-text pairs. The pre-trained MED is fine-tuned into a captioner that generates synthetic captions from web images, and a filter that removes noisy captions from both the original web texts and synthetic texts.

Authors of BLIP make following key observations based on extensive experiments and analysis. The collaboration between the captioner and filter significantly enhances performance across diverse downstream tasks through caption bootstrapping, with greater diversity in captions leading to more substantial gains. BLIP achieves state-of-the-art performance in various vision-language tasks, including image-text retrieval, image captioning, visual question answering, visual reasoning, and visual dialog. It also achieves state-of-the-art zero-shot performance when directly applied to video-language tasks such as text-to-video retrieval and videoQA.

Researchers should carefully assess the safety and fairness of the model before deploying it in any real-world applications.

In Visual Question Answering (VQA) task, the objective is to predict an answer given an image and a question. In the fine-tuning process, the pre-trained model is restructured to encode the image-question pair into multi-modal embeddings. These embeddings are then given to answer decoder. Fine-tuning of the VQA model involves using the Language Model (LM) loss, with ground-truth answers used as the target. For more details on Image Captioning with BLIP, review the section 5.3 of the original-paper.

License

BSD 3-Clause License

Inference Samples

Inference type Python sample (Notebook) CLI with YAML
Real time visual-question-answering-online-endpoint.ipynb visual-question-answering-online-endpoint.sh
Batch visual-question-answering-batch-endpoint.ipynb visual-question-answering-batch-endpoint.sh

Sample input and output

Sample input

{
   "input_data":{
      "columns":[
         "image",
         "text"
      ],
      "index":[0, 1],
      "data":[
         ["image1", "What is in the picture?"],
         ["image2", "How many dogs are in the picture?"]
      ]
   }
}

Note:

  • "image1" and "image2" should be publicly accessible urls or strings in base64 format.

Sample output

[
   {
      "text": "sand"
   },
   {
      "text": "1"
   }
]

Visualization of inference result for a sample image

For sample image below and text prompt "What is in the picture?", the output text is "sand".

Salesforce-BLIP-vqa-base

Version: 6

Tags

Preview license : bsd-3-clause task : visual-question-answering SharedComputeCapacityEnabled huggingface_model_id : Salesforce/blip-vqa-base author : Salesforce hiddenlayerscanned inference_compute_allow_list : ['Standard_DS2_v2', 'Standard_D2a_v4', 'Standard_D2as_v4', 'Standard_DS3_v2', 'Standard_D4a_v4', 'Standard_D4as_v4', 'Standard_DS4_v2', 'Standard_D8a_v4', 'Standard_D8as_v4', 'Standard_DS5_v2', 'Standard_D16a_v4', 'Standard_D16as_v4', 'Standard_D32a_v4', 'Standard_D32as_v4', 'Standard_D48a_v4', 'Standard_D48as_v4', 'Standard_D64a_v4', 'Standard_D64as_v4', 'Standard_D96a_v4', 'Standard_D96as_v4', 'Standard_F4s_v2', 'Standard_FX4mds', 'Standard_F8s_v2', 'Standard_FX12mds', 'Standard_F16s_v2', 'Standard_F32s_v2', 'Standard_F48s_v2', 'Standard_F64s_v2', 'Standard_F72s_v2', 'Standard_FX24mds', 'Standard_FX36mds', 'Standard_FX48mds', 'Standard_E2s_v3', 'Standard_E4s_v3', 'Standard_E8s_v3', 'Standard_E16s_v3', 'Standard_E32s_v3', 'Standard_E48s_v3', 'Standard_E64s_v3', 'Standard_NC4as_T4_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC24ads_A100_v4', 'Standard_NC48ads_A100_v4', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2']

View in Studio: https://ml.azure.com/registries/azureml/models/Salesforce-BLIP-vqa-base/version/6

License: bsd-3-clause

Properties

SharedComputeCapacityEnabled: True

SHA: 99909119248dc49e49cd698ad685b3b646595a38

inference-min-sku-spec: 2|0|7|14

inference-recommended-sku: Standard_DS2_v2, Standard_D2a_v4, Standard_D2as_v4, Standard_DS3_v2, Standard_D4a_v4, Standard_D4as_v4, Standard_DS4_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_DS5_v2, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_F4s_v2, Standard_FX4mds, Standard_F8s_v2, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E2s_v3, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2

Clone this wiki locally