University of Tehran | Department of Electrical and Computer Engineering
Course : Deep Generative Models | Instructor : Dr. Mostafa Tavasoli | Term : Fall 1403
Author : Taha Majlesi
Email : [email protected] | [email protected]
Profiles : LinkedIn | GitHub | Hugging Face
- Introduction
- Course Information
- Assignment Details
- Sections Overview
- Implementation Details
- Mathematical Derivations
- Training and Experimentation
- Results and Analysis
- Submission Guidelines
- License
- Project Structure
This repository contains Homework 4 for the Deep Generative Models course at the University of Tehran . The assignment explores cutting-edge generative models , focusing on:
- Vision-Language Models (VLMs) , particularly PaliGemma
- Fine-tuning large-scale models for Image-Question Answering (IQA)
- Evaluating generative models using ROUGE Score
- Flow Matching for continuous-time generative modeling
- Optimal Transport in generative models
This assignment provides both theoretical and practical components, allowing students to explore state-of-the-art generative techniques .
- University : University of Tehran
- Department : Electrical and Computer Engineering
- Course : Deep Generative Models
- Instructor : Dr. Mostafa Tavasoli
- Term : Fall 1403
This homework consists of two major sections :
- Understanding multimodal learning (vision + language)
- Fine-tuning PaliGemma for image-based question answering
- Optimizing memory usage with LoRA and QLoRA
- Evaluating models using ROUGE Score
- Mathematical derivation of Flow Matching
- Understanding Optimal Transport in Flow-Based Generative Models
- Implementing Continuous Normalizing Flows (CNFs) for data generation
VLMs integrate image and text to perform tasks such as image-based question answering, caption generation, and visual reasoning .
- Understanding Vision-Language Models (VLMs)
- Explain how PaliGemma differs from standard text-based models.
- Compare PaliGemma, DALL·E, and Imagen architectures.
- Fine-Tuning PaliGemma for Image-Question Answering
- Fine-tune PaliGemma-3B using the CLEVR dataset .
- Utilize LoRA and QLoRA for memory-efficient fine-tuning.
- Evaluating Performance using ROUGE Score
- Compute ROUGE Score for evaluating generated text responses .
- Compare model performance before and after fine-tuning.
- Memory Optimization in Fine-Tuning
- Compare full fine-tuning vs. LoRA fine-tuning in memory usage.
- Explain the benefits of Quantization (NF4 datatype) in reducing model size.
Flow Matching models use continuous-time transformations to map a simple distribution (e.g., Gaussian noise) to a complex data distribution.
- Mathematical Analysis of Flow Matching
- Derive the Flow Matching equation .
- Explain why Flow Matching avoids iterative sampling .
- Optimal Transport in Flow Matching
- Describe how Optimal Transport improves Flow Matching.
- Compare Flow Matching to Diffusion Models .
- Implementing Flow Matching Models
- Implement a Flow Matching generative model .
- Train the model using ODE-based continuous transformations .
- Vision-Language Models : CLEVR dataset (for image-question answering).
- Flow Matching Models : Synthetic data with optimal transport properties .
Component | Details |
---|---|
Base Model | PaliGemma-3B |
Optimizer | AdamW |
Fine-Tuning | LoRA, QLoRA |
ROUGE Evaluation | Yes |
Parameter | Value |
---|---|
Learning Rate | 0.0002 |
Batch Size | 64 |
Training Steps | 100,000 |
- Flow Matching Equations
- Show how Flow Matching avoids iterative sampling in normalizing flows.
- Why Use Optimal Transport?
- Explain how Optimal Transport improves generative performance.
- Computing ROUGE Score
- ROUGE measures text similarity between generated responses and ground truth: ROUGE=overlapping wordstotal words in referenceROUGE = \frac{\text{overlapping words}}{\text{total words in reference}}
- Fine-tune PaliGemma and evaluate performance on CLEVR dataset .
- Compare LoRA vs. QLoRA for efficient fine-tuning.
- Train a Flow Matching model and evaluate generated samples.
This project is licensed under the MIT License .
For more details, see the LICENSE file.