-
Notifications
You must be signed in to change notification settings - Fork 126
models facebook deit base patch16 224
DeiT (Data-efficient image Transformers) is an image transformer that do not require very large amounts of data for training. This is achieved through a novel distillation procedure using teacher-student strategy, which results in high throughput and accuracy. DeiT is pre-trained and fine-tuned on ImageNet-1k (1 million images, 1,000 classes) at resolution 224x224. The model was first released in this repository, but the weights were converted to PyTorch from the timm repository by Ross Wightman.
An image is treated as a sequence of patches and it is processed by a standard Transformer encoder as used in NLP. These patches are linearly embedded, and a [CLS] token is added at the beginning of the sequence for classification tasks. The model also requires absolute position embeddings before feeding the sequence Transformer encoder. So the pre-training creates an inner representation of images that can be used to extract features that are useful for downstream tasks. For instance, if a dataset of labeled images is available, a linear layer can be placed on top of the pre-trained encoder, to train a standard classifier.
For more details on DeiT, Review the original-paper.
The DeiT model is pre-trained and fine-tuned on ImageNet 2012, consisting of 1 million images and 1,000 classes on a resolution of 224x224.
In the preprocessing step, images are resized to the same resolution 224x224. Different augmentations like Rand-Augment, and random erasing are used. For more details on transformations during training/validation refer this-link. At inference time, images are rescaled to the same resolution 256x256, center-cropped at 224x224 and then normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
The model was trained on a single 8-GPU node for 3 days. Training resolution is 224. For more details on hyperparameters refer to table 9 of the original-paper.
For more details on pre-training (ImageNet-1k) followed by supervised fine-tuning (ImageNet-1k) refer to the section 2 to 5 of the original-paper.
DeiT base model achieved top-1 accuracy of 81.8% and top-5 accuracy of 95.6% on ImageNet with 86M parameters with image size 224x224. For DeiT image classification benchmark results, refer to the table 5 of the original-paper.
It's important to note that during the fine-tuning process, superior performance is attained with a higher resolution, and enhancing the model size leads to improved performance.
apache-2.0
Inference type | Python sample (Notebook) | CLI with YAML |
---|---|---|
Real time | image-classification-online-endpoint.ipynb | image-classification-online-endpoint.sh |
Batch | image-classification-batch-endpoint.ipynb | image-classification-batch-endpoint.sh |
Task | Use case | Dataset | Python sample (Notebook) | CLI with YAML |
---|---|---|---|---|
Image Multi-class classification | Image Multi-class classification | fridgeObjects | fridgeobjects-multiclass-classification.ipynb | fridgeobjects-multiclass-classification.sh |
Image Multi-label classification | Image Multi-label classification | multilabel fridgeObjects | fridgeobjects-multilabel-classification.ipynb | fridgeobjects-multilabel-classification.sh |
Task | Use case | Dataset | Python sample (Notebook) |
---|---|---|---|
Image Multi-class classification | Image Multi-class classification | fridgeObjects | image-multiclass-classification.ipynb |
Image Multi-label classification | Image Multi-label classification | multilabel fridgeObjects | image-multilabel-classification.ipynb |
{
"input_data": ["image1", "image2"]
}
Note: "image1" and "image2" string should be in base64 format or publicly accessible urls.
[
[
{
"label" : "can",
"score" : 0.91
},
{
"label" : "carton",
"score" : 0.09
},
],
[
{
"label" : "carton",
"score" : 0.9
},
{
"label" : "can",
"score" : 0.1
},
]
]
Version: 19
Preview
huggingface_model_id : facebook/deit-base-patch16-224
license : apache-2.0
model_specific_defaults : ordereddict({'apply_deepspeed': 'true', 'apply_ort': 'true'})
task : image-classification
hiddenlayerscanned
training_dataset : imagenet-1k
SharedComputeCapacityEnabled
author : Meta
inference_compute_allow_list : ['Standard_DS3_v2', 'Standard_D4a_v4', 'Standard_D4as_v4', 'Standard_DS4_v2', 'Standard_D8a_v4', 'Standard_D8as_v4', 'Standard_DS5_v2', 'Standard_D16a_v4', 'Standard_D16as_v4', 'Standard_D32a_v4', 'Standard_D32as_v4', 'Standard_D48a_v4', 'Standard_D48as_v4', 'Standard_D64a_v4', 'Standard_D64as_v4', 'Standard_D96a_v4', 'Standard_D96as_v4', 'Standard_FX4mds', 'Standard_F8s_v2', 'Standard_FX12mds', 'Standard_F16s_v2', 'Standard_F32s_v2', 'Standard_F48s_v2', 'Standard_F64s_v2', 'Standard_F72s_v2', 'Standard_FX24mds', 'Standard_FX36mds', 'Standard_FX48mds', 'Standard_E2s_v3', 'Standard_E4s_v3', 'Standard_E8s_v3', 'Standard_E16s_v3', 'Standard_E32s_v3', 'Standard_E48s_v3', 'Standard_E64s_v3', 'Standard_NC4as_T4_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC24ads_A100_v4', 'Standard_NC48ads_A100_v4', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2']
evaluation_compute_allow_list : ['Standard_NC4as_T4_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2']
finetune_compute_allow_list : ['Standard_NC4as_T4_v3', 'Standard_NC6s_v3', 'Standard_NC8as_T4_v3', 'Standard_NC12s_v3', 'Standard_NC16as_T4_v3', 'Standard_NC24s_v3', 'Standard_NC64as_T4_v3', 'Standard_NC96ads_A100_v4', 'Standard_ND96asr_v4', 'Standard_ND96amsr_A100_v4', 'Standard_ND40rs_v2']
View in Studio: https://ml.azure.com/registries/azureml/models/facebook-deit-base-patch16-224/version/19
License: apache-2.0
SharedComputeCapacityEnabled: True
SHA: fb2c78a54a5637dec350432794f7b93e31f910c9
evaluation-min-sku-spec: 4|1|28|176
evaluation-recommended-sku: Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2
finetune-min-sku-spec: 4|1|28|176
finetune-recommended-sku: Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2
finetuning-tasks: image-classification
inference-min-sku-spec: 2|0|14|28
inference-recommended-sku: Standard_DS3_v2, Standard_D4a_v4, Standard_D4as_v4, Standard_DS4_v2, Standard_D8a_v4, Standard_D8as_v4, Standard_DS5_v2, Standard_D16a_v4, Standard_D16as_v4, Standard_D32a_v4, Standard_D32as_v4, Standard_D48a_v4, Standard_D48as_v4, Standard_D64a_v4, Standard_D64as_v4, Standard_D96a_v4, Standard_D96as_v4, Standard_FX4mds, Standard_F8s_v2, Standard_FX12mds, Standard_F16s_v2, Standard_F32s_v2, Standard_F48s_v2, Standard_F64s_v2, Standard_F72s_v2, Standard_FX24mds, Standard_FX36mds, Standard_FX48mds, Standard_E2s_v3, Standard_E4s_v3, Standard_E8s_v3, Standard_E16s_v3, Standard_E32s_v3, Standard_E48s_v3, Standard_E64s_v3, Standard_NC4as_T4_v3, Standard_NC6s_v3, Standard_NC8as_T4_v3, Standard_NC12s_v3, Standard_NC16as_T4_v3, Standard_NC24s_v3, Standard_NC64as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC96ads_A100_v4, Standard_ND96asr_v4, Standard_ND96amsr_A100_v4, Standard_ND40rs_v2