dpo_vlm.py #2563

liuchaohu · 2025-01-12T16:48:45Z

System Info

trl env

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder
My own task or dataset (give details below)

Reproduction

# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
accelerate launch examples/scripts/dpo_vlm.py \
    --dataset_name HuggingFaceH4/rlaif-v_formatted \
    --model_name_or_path HuggingFaceM4/idefics2-8b \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 32 \
    --dataset_num_proc 32 \
    --output_dir dpo_idefics_rlaif-v \
    --bf16 \
    --torch_dtype bfloat16 \
    --gradient_checkpointing \
    --use_peft \
    --lora_target_modules=all-linear
"""
import os
import torch
from datasets import load_dataset, features
from transformers import AutoModelForVision2Seq, AutoProcessor

from trl import (
    DPOConfig,
    # DPOTrainer,
    ModelConfig,
    ScriptArguments,
    TrlParser,
    get_kbit_device_map,
    get_peft_config,
    get_quantization_config,
)

from dpo_trainer import DPOTrainer


if __name__ == "__main__":
    parser = TrlParser((ScriptArguments, DPOConfig, ModelConfig))
    script_args, training_args, model_args = parser.parse_args_and_config()

    ################
    # Model & Tokenizer
    ################
    torch_dtype = (
        model_args.torch_dtype if model_args.torch_dtype in ["auto", None] else getattr(torch, model_args.torch_dtype)
    )
    quantization_config = get_quantization_config(model_args)

    model_kwargs = dict(
        revision=model_args.model_revision,
        attn_implementation=model_args.attn_implementation,
        torch_dtype=torch_dtype,
        device_map=get_kbit_device_map() if quantization_config is not None else None,
        quantization_config=quantization_config,
        low_cpu_mem_usage=True,
    )
    model = AutoModelForVision2Seq.from_pretrained(
        model_args.model_name_or_path,
        trust_remote_code=model_args.trust_remote_code,
        **model_kwargs,

    )
    peft_config = get_peft_config(model_args)
    if peft_config is None:
        ref_model = AutoModelForVision2Seq.from_pretrained(
            model_args.model_name_or_path,
            trust_remote_code=model_args.trust_remote_code,
            **model_kwargs,
        )
    else:
        ref_model = None
    processor = AutoProcessor.from_pretrained(
        model_args.model_name_or_path, trust_remote_code=model_args.trust_remote_code, do_image_splitting=False
    )
    tokenizer = processor.tokenizer

    # Set up the chat template
    if model.config.model_type == "idefics2":
        pass  # the processor already has a valid chat template
    elif model.config.model_type == "paligemma":
        processor.chat_template = """{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}<|im_start|>{% if message['role'] == 'user' %}USER: {% else %}ASSISTANT: {% endif %}{% for item in message['content'] if item['type'] == 'text' %}{{ item['text'] }}<|im_end|>{% endfor %}{% if message['role'] == 'user' %} {% else %}{{eos_token}}{% endif %}{% endfor %}{% if add_generation_prompt %}ASSISTANT: {% endif %}"""
    elif model.config.model_type == "llava":
        processor.chat_template = """{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{% if message['role'] == 'user' %}USER: {% else %}ASSISTANT: {% endif %}{% for item in message['content'] %}{% if item['type'] == 'text' %}{{ item['text'] }}{% elif item['type'] == 'image' %}<image>{% endif %}{% endfor %}{% if message['role'] == 'user' %} {% else %}{{eos_token}}{% endif %}{% endfor %}{% if add_generation_prompt %}ASSISTANT: {% endif %}"""

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    if script_args.ignore_bias_buffers:
        # torch distributed hack
        model._ddp_params_and_buffers_to_ignore = [
            name for name, buffer in model.named_buffers() if buffer.dtype == torch.bool
        ]

    ################
    # Dataset
    ################
    # dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config)

    def format(examples):
        """
        Convert prompt from "xxx" to [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "xxx"}]}]
        and chosen and rejected from "xxx" to [{"role": "assistant", "content": [{"type": "text", "text": "xxx"}]}].
        Images are wrapped in a list.
        """
        output = {"images": [], "prompt": [], "chosen": [], "rejected": []}
        for image, question, chosen, rejected in zip(examples["image"], examples["question"], examples["chosen"], examples["rejected"]):
            prompt = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": question}]}]
            chosen = [{"role": "assistant", "content": [{"type": "text", "text": chosen}]}]
            rejected = [{"role": "assistant", "content": [{"type": "text", "text": rejected}]}]
            output["images"].append([image])
            output["prompt"].append(prompt)
            output["chosen"].append(chosen)
            output["rejected"].append(rejected)
        return output


    dataset = load_dataset("data/openbmb/RLAIF-V-Dataset", split="train", num_proc=os.cpu_count())
    dataset = dataset.select(range(100))
    cols = dataset.column_names
    print(os.cpu_count())
    dataset = dataset.map(format, batched=True, writer_batch_size=4, batch_size=4, remove_columns=cols, num_proc=24)
    f = dataset.features
    f["images"] = features.Sequence(features.Image(decode=True))  # to avoid bytes
    dataset = dataset.cast(f)
    dataset = dataset.train_test_split(test_size=0.05)



    ################
    # Training
    ################
    trainer = DPOTrainer(
        model,
        ref_model,
        args=training_args,
        train_dataset=dataset[script_args.dataset_train_split],
        eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
        processing_class=processor,
        peft_config=peft_config,
    )

    trainer.train()

    # Save and push to hub
    trainer.save_model(training_args.output_dir)
    if training_args.push_to_hub:
        trainer.push_to_hub(dataset_name=script_args.dataset_name)

{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "Python: Debug with Arguments",
      "type": "python",
      "request": "launch",
      "program": "${file}", 
      "console": "integratedTerminal",
      "args": [
        "--dataset_name", "../data/openbmb/RLAIF-V-Dataset",
        "--model_name_or_path", "../data/llava-hf/llava-v1.6-vicuna-7b-hf",
        "--per_device_train_batch_size", "1",
        "--gradient_accumulation_steps", "1",
        "--output_dir", "dpo_idefics_rlaif-v",
        "--bf16",
        "--torch_dtype", "bfloat16",
        "--learning_rate", "1e-5",
        "--rpo_alpha", "0.1",
        "--gradient_checkpointing",
        "--use_peft",
        "--lora_target_modules=all-linear",
        "--dataset_num_proc", "1",
        "--attn_implementation", "flash_attention_2",
        "--logging_steps", "1",
        "--output_dir", "results/debug",
      ]
    }
  ]
}

The model is llava-v1.6-vicuna-7b-hf, and the dataset is RLAIF-V-Dataset.

In the dpo_trainer.py, self.train_dataset includes

Dataset({
    features: ['images', 'prompt_input_ids', 'pixel_values', 'chosen_input_ids', 'rejected_input_ids', 'image_sizes'],
    num_rows: 100
})

However, when the program runs to DataCollatorForPreference, pixel_values disappears.
This leads to the fact that in subsequent training, the model cannot receive pixel_values, but it can receive image_sizes, which is very strange.

Expected behavior

I hope to know why the dataset itself contains pixel_values but why it disappears during data_collect

Checklist

I have checked that my issue isn't already filed (see open issues)
I have included my system information
Any code provided is minimal, complete, and reproducible (more on MREs)
Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
Any traceback provided is complete

The text was updated successfully, but these errors were encountered:

liuchaohu · 2025-01-13T09:50:13Z

I know the reason why pixel_values disappears.
We should run the code the param "--remove_unused_columns false", otherwise pixel_values will be eliminated.

August-murr added 🐛 bug Something isn't working 🏋 DPO Related to DPO 👁️ VLM Related to Visual Language Models labels Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dpo_vlm.py #2563

dpo_vlm.py #2563

liuchaohu commented Jan 12, 2025

liuchaohu commented Jan 13, 2025

dpo_vlm.py #2563

dpo_vlm.py #2563

Comments

liuchaohu commented Jan 12, 2025

System Info

Information

Tasks

Reproduction

Expected behavior

Checklist

liuchaohu commented Jan 13, 2025