VLMs zero-to-hero

coming: january 2025...

hello

Welcome to VLMs Zero to Hero! This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.

tutorials

notebook	open in colab	video	paper
01.01. Word2Veq: Distributed Representations of Words and Phrases and their Compositionality	link	soon	link

roadmap

natural language processing (NLP) fundamentals

Word2Veq: Efficient Estimation of Word Representations in Vector Space (2013) and Distributed Representations of Words and Phrases and their Compositionality (2013)
Seq2Seq: Sequence to Sequence Learning with Neural Networks (2014)
Attention Is All You Need (2017)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)
GPT: Improving Language Understanding by Generative Pre-Training (2018)

computer vision (CV) fundamentals

AlexNet: ImageNet Classification with Deep Convolutional Neural Networks (2012)
VGG: Very Deep Convolutional Networks for Large-Scale Image Recognition (2014)
ResNet: Deep Residual Learning for Image Recognition (2015)

early vision-language models

Show and Tell: A Neural Image Caption Generator (2014) and Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015)
A Picture is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020)
CLIP: Learning Transferable Visual Models from Natural Language Supervision (2021)

scale and efficiency

Scaling Laws for Neural Language Models (2020)
LoRA: Low-Rank Adaptation of Large Language Models (2021)
QLoRA: Efficient Fine-tuning of Quantized LLMs (2023)

modern vision-language models

Flamingo: A Visual Language Model for Few-Shot Learning (2022)
LLaVA: Visual Instruction Tuning (2023)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (2023)
PaliGemma: A versatile 3B VLM for transfer (2024)

extra

BLEU: a Method for Automatic Evaluation of Machine Translation (2002)

contribute and suggest more papers

Are there important papers, models, or techniques we missed? Do you have a favorite breakthrough in vision-language research that isn't listed here? We’d love to hear your suggestions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

VLMs zero-to-hero

hello

tutorials

roadmap

natural language processing (NLP) fundamentals

computer vision (CV) fundamentals

early vision-language models

scale and efficiency

modern vision-language models

extra

contribute and suggest more papers

Files

README.md

Latest commit

History

README.md

File metadata and controls

VLMs zero-to-hero

hello

tutorials

roadmap

natural language processing (NLP) fundamentals

computer vision (CV) fundamentals

early vision-language models

scale and efficiency

modern vision-language models

extra

contribute and suggest more papers