Skip to content

Latest commit

 

History

History
50 lines (34 loc) · 908 Bytes

File metadata and controls

50 lines (34 loc) · 908 Bytes

Image Caption Generator

This project uses LLaVA (Large Language-and-Vision Assistant) , an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.

llava generates the description of the image and the description is the fed to llama3 to generate the caption of the image.

Installation

  1. Clone the repo

    git clone <URL>
  2. Activate a virtual env

    python3 -m venv cenv
    source cenv/bin/activate
    
  3. Install requirements

    pip install -r requirements.txt
  4. Download the llms using the following command

    ollama pull llama3
    ollama pull llava
  5. Start the local ollama server

    ollama serve
  6. Run the backend server

    uvicorn main:app --reload
  7. Run the code

    streamlit run app.py