Skip to content

Latest commit

 

History

History
80 lines (47 loc) · 3.95 KB

README.md

File metadata and controls

80 lines (47 loc) · 3.95 KB

multimodal-LLM

Multi-Modal Language Modeling with Image, Audio and Text Integration, included multi-images and multi-audio in a single multiturn.

The goal

  1. We want in one turn, user can feed multi-images or multi-audio.
  2. In multi-turn, at N position, user can feed an image / an audio, and at N + k position, user can feed another an image or an audio.

dataset

We are from Malaysia, so our dataset focused on Malaysian context, but we will train bi-lingual.

Audio instruction

Uploaded at https://huggingface.co/collections/mesolitica/audio-malaysian-llm-6590b69ee7c71d6d9e209104

  1. We crawled youtube videos, and convert to mp3.
  2. Pseudolabel using Whisper Large V3, postfilter based on score threshold.
  3. Use Mixtral to generate multiturn.

Vision instruction

Uploaded at https://huggingface.co/collections/mesolitica/vision-malaysian-llm-653a16214037a1bc4417eb3a

how-to

All pretrained and finetuned models published at https://huggingface.co/collections/mesolitica/multimodal-malaysian-llm-65c6f893e03f78fa9e5c8859, instructions inside each model cards README.

1. Vision Alignment

  1. Prepare vision alignment dataset first prepare-dataset/prepare-vision-alignment.ipynb, this is the exact alignment dataset used by LLAVA and we also included malay translated dataset.

  2. Train vision alignment, we trained on 3 different models,

  1. Prepare instruction dataset, must follow the order,
  1. Finetune on instruction dataset,

But we only finetuned vision instruction dataset for Qwen 0.5B and TinyLLama only, Mistral we intended for vision and audio.

2. Audio Alignment

  1. Prepare audio alignment dataset first prepare-dataset/prepare-audio-alignment.ipynb, this is pseudolabel from Whisper Large V3 and first assistant answer.

  2. Train audio alignment, we trained on 2 different models,

  1. Prepare instruction dataset, must follow the order,

3. Vision and Audio finetuning

  1. You must combine pretrained vision and audio alignment models first,
  1. Prepare dataset, prepare-dataset/combine-mosaic.ipynb, this to combine vision and audio dataset in one mosaic dataset, we only trained on 50% of the dataset due to lack of resources.

  2. Finetune on instruction dataset,