Llama-3.2-Vision study

This repository contains the scripts and results for the Llama-3.2-Vision study. The study consists of two main components: qualitative testing and throughput testing.

See futher details in the following sections:

Qualitative testing README
Throughput testing README

Results

Qualitative testing results

The Llama 11B model demonstrated reliable accuracy on simpler tasks, such as object detection, counting, and color identification, often providing responses that were direct and to the point. However, compared to the larger Llama 90B and other models like GPT-4o and Llava variants, the 11B model lacked descriptive detail and contextual awareness, making it less effective in generating nuanced interpretations of images.

Accurate at basic object detection, counting, and OCR with concise responses. (example)
Less effective at detailed scene understanding. (example)
Ideal for low-resource tasks needing quick, direct answers.

The Llama 90B model stood out for its rich and comprehensive descriptions, consistently offering more context and background information about the scenes depicted in the images. This level of detail often surpassed other models like GPT-4o and Llava, which either provided concise answers or occasionally struggled with specific object recognition tasks. The 90B model's ability to infer additional elements, such as environmental details and object positions, highlighted its superior scene understanding compared to its counterparts.

Excels in detailed scene analysis and context-rich descriptions. (example)
Stronger commonsense reasoning and nuanced interpretations. (example)
Best for tasks requiring deep understanding, despite higher resource needs.

Throughput testing results

For the llama-3.2-11B-Vision model with 1 H100 SXM GPU, we found the highest throughput of 0.79 images per second, with a batch size of 1

For the llama-3.2-90B-Vision model with 4 H100 SXM GPUs, we found the highest throughput of 0.42 images per second, with a batch size of 16. Given we used 4 GPUs, the throughput per GPU was 0.11 images per second, roughly 14% of the throughput of the 11B model when equalizing for the number of GPUs.

Setup

Install the required dependencies by running:

pip install -r requirements.txt

Run qualitative testing:

python test_vision_tasks.py --model <model_name>

More detailed instructions can be found in the Qualitative testing README section.

Run throughput testing:

python test_throughput.py --num-images 256 --model <model_name> --batch-size <batch_size>

More detailed instructions can be found in the Throughput testing README section.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.idea		.idea
qualitative		qualitative
throughput		throughput
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Llama-3.2-Vision study

Results

Qualitative testing results

Throughput testing results

Setup

About

Releases

Packages

Contributors 2

Languages

LambdaLabsML/Llama-3.2-vision-study

Folders and files

Latest commit

History

Repository files navigation

Llama-3.2-Vision study

Results

Qualitative testing results

Throughput testing results

Setup

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages