A collection of recent papers on Text-Image Self-supervised Learning.
- Masked Autoencoders Are Scalable Vision Learners (MAE)
- Learning Transferable Visual Models From Natural Language Supervision (CLIP)
- Zero-Shot Text-to-Image Generation (DALLE)
- Hierarchical Text-Conditional Image Generation with CLIP Latents (DALLE2)
- M3AE: Multimodal Masked Autoencoders Learn Transferable Representations
- Training Vision-Language Transformers from Captions Alone
- An Empirical Study of Training End-to-End Vision-and-Language Transformers
- MulT: An End-to-End Multitask Learning Transformer
- MultiMAE: Multi-modal Multi-task Masked Autoencoders
- How Much Can CLIP Benefit Vision-and-Language Tasks?
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
- mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
- VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
- SelfDoc: Self-Supervised Document Representation Learning
- L-Verse: Bidirectional Generation Between Image and Text
- Unsupervised Learning From Incomplete Measurements for Inverse Problems