Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use chat templates for vision models #173

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

DePasqualeOrg
Copy link
Contributor

This is a test of my PR to Swift Jinja, which should enable chat templates to be used for vision language models that have a chat template. I've started to set things up, but I need some pointers on how to integrate the image into the messages.

@DePasqualeOrg DePasqualeOrg force-pushed the vision-chat-templates branch from 5c8ccfa to 3ac6296 Compare January 9, 2025 18:53
@DePasqualeOrg
Copy link
Contributor Author

@davidkoski, I made some changes, and it seems to work in VLMEval. Do you have any thoughts on this?

@DePasqualeOrg DePasqualeOrg force-pushed the vision-chat-templates branch from 3ac6296 to 4547cf1 Compare January 9, 2025 21:43
@DePasqualeOrg
Copy link
Contributor Author

DePasqualeOrg commented Jan 14, 2025

I think UserInput will need to be changed to include messages that look like this:

{
    'role': 'user',
    'content': [
        {'type': 'text', 'text': 'What is in this image?'},
        {'type': 'image', 'image_url': 'example.jpg'}
    ]
}

@DePasqualeOrg
Copy link
Contributor Author

DePasqualeOrg commented Jan 14, 2025

The solution in my latest commit uses the chat template (correctly, I think) to create a prompt like this:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Describe the image in English<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

However, in order for the model to work, it looks like we need to replace the single <|image_pad|> with repeated padding like this for each image:

let mergeLength = config.mergeSize * config.mergeSize
let repeatedPadding = Array(repeating: "<|image_pad|>", count: thw.product / mergeLength).joined()

@DePasqualeOrg
Copy link
Contributor Author

I now have something that works, although it still needs to take into account the case where multiple images are included.

@DePasqualeOrg DePasqualeOrg force-pushed the vision-chat-templates branch 2 times, most recently from 0f68fd2 to ed03ae5 Compare January 15, 2025 09:19
@DePasqualeOrg
Copy link
Contributor Author

@davidkoski, I found it quite difficult to reason about the code because of how some of the variables and parameters were named. What do you think about calling an array of type [THW] frames?

@davidkoski
Copy link
Collaborator

@davidkoski, I found it quite difficult to reason about the code because of how some of the variables and parameters were named. What do you think about calling an array of type [THW] frames?

it sounds ok to me, though they aren't the frames themselves but the positions of the frames in one of the arrays (maybe not in the final array). I think try frames or framePositions and see how it goes

@davidkoski
Copy link
Collaborator

The solution in my latest commit uses the chat template (correctly, I think) to create a prompt like this:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Describe the image in English<|vision_start|><|image_pad|><|vision_end|><|im_end|>
<|im_start|>assistant

However, in order for the model to work, it looks like we need to replace the single <|image_pad|> with repeated padding like this for each image:

let mergeLength = config.mergeSize * config.mergeSize
let repeatedPadding = Array(repeating: "<|image_pad|>", count: thw.product / mergeLength).joined()

Right, that is this part:

I think the sequence from the python side is roughly:

  1. add image tokens (https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/prompt_utils.py)
  2. transformers / processing (expand image tokens)
  3. tokenize

One issue we have on the swift side is step 1 and step 3 occur in the same function in swift-transformers and we don't have a hook for step 2.

@DePasqualeOrg DePasqualeOrg force-pushed the vision-chat-templates branch 2 times, most recently from e9c7a02 to 8cb233b Compare January 19, 2025 14:18
@DePasqualeOrg DePasqualeOrg force-pushed the vision-chat-templates branch 4 times, most recently from 8959c45 to 3e50263 Compare January 27, 2025 19:32
@davidkoski
Copy link
Collaborator

@DePasqualeOrg it looks like the swift-transformers side (which includes Jinja) is ready to go and would solve some issues with text models.

Do you want to prepare a PR for picking that up (since it is mostly your work)? If you are busy I can get that ready.

@DePasqualeOrg
Copy link
Contributor Author

DePasqualeOrg commented Jan 28, 2025

I think #185 accomplishes that. Xcode is showing the latest patch versions of the packages when I open mlx-swift-examples. Or is there something I'm missing?

huggingface/swift-transformers#151 still needs to be merged before this PR, since it expands the type of a message from [String: String] to [String: Any].

@DePasqualeOrg DePasqualeOrg force-pushed the vision-chat-templates branch 2 times, most recently from 031e47f to db97052 Compare January 28, 2025 09:10
@DePasqualeOrg
Copy link
Contributor Author

DePasqualeOrg commented Jan 28, 2025

I've verified that this also works with multiple images, although I'll need to do further testing to check the model's performance. I noticed that Qwen 2 VL tends to respond in Mandarin unless prompted otherwise.

@davidkoski
Copy link
Collaborator

I've verified that this also works with multiple images, although I'll need to do further testing to check the model's performance. I noticed that Qwen 2 VL tends to respond in Mandarin unless prompted otherwise.

Yeah, I noticed that too. At least the responses seemed correct per google translate :-)

@DePasqualeOrg DePasqualeOrg force-pushed the vision-chat-templates branch 2 times, most recently from 0b746f4 to db883ff Compare January 30, 2025 21:16
@DePasqualeOrg DePasqualeOrg marked this pull request as ready for review January 30, 2025 21:17
@DePasqualeOrg
Copy link
Contributor Author

This is now ready for review.

@DePasqualeOrg
Copy link
Contributor Author

I now need to make some significant changes because the video pull request was merged before this one.

@DePasqualeOrg DePasqualeOrg marked this pull request as draft February 5, 2025 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants