Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run Qwen using Executorch? #7467

Open
Arya-Hari opened this issue Jan 2, 2025 · 6 comments
Open

How to run Qwen using Executorch? #7467

Arya-Hari opened this issue Jan 2, 2025 · 6 comments
Assignees
Labels
module: llm LLM examples and apps, and the extensions/llm libraries

Comments

@Arya-Hari
Copy link

📚 The doc issue

Hi! I just wanted to how, how would I go about running Qwen using executorch? I was able to create the .pte file for Qwen. The example for Llama had a step 'Create a llama runner for android'. Do we have to do something similar for Qwen by creating a custom runner? Also the Qwen repository on Hugging Face Hub does not have a 'tokenizer.model' file, but the Llama example requires it for running inference using the adb shell. How to navigate around this?

Suggest a potential alternative/fix

No response

@kimishpatel
Copy link
Contributor

i dont know the details how to run Qwen and whether there is any significant difference compared to llama as far as model's interface is concerned.

Also when you say you were abel to export the model, can you detail the steps you took. If you can run exported qwen model using https://github.com/pytorch/executorch/blob/main/examples/models/llama/runner/eager.py#L103 then highly likely that you can run via cpp runner. But you do need tokenizer, so not sure how hf runs this model

@kimishpatel kimishpatel added the module: llm LLM examples and apps, and the extensions/llm libraries label Jan 3, 2025
@SS-JIA SS-JIA moved this to To triage in ExecuTorch DevX improvements Jan 6, 2025
@SS-JIA SS-JIA self-assigned this Jan 6, 2025
@SS-JIA
Copy link
Contributor

SS-JIA commented Jan 6, 2025

@Arya-Hari for some more context, the llama_runner binary used in our examples is heavily tailored to the llama model architecture. So as Kimish mentioned, depending on the interface of Qwen compared to llama you may not be able to re-use the llama_runner binary. If you are familiar with the interface of the model, then the best way would be to fork or modify the llama_runner binary for the Qwen model; essentially creating a custom runner as you mentioned.

@mergennachin
Copy link
Contributor

@guangy10, Is there guidelines on how to leverage from recent Hugging Face (huggingface/transformers#32253, huggingface/transformers#34102) and optimum integrations (https://huggingface.co/docs/optimum/main/en/exporters/executorch/usage_guides/export_a_model)

@Arya-Hari
Copy link
Author

Hi @kimishpatel. So to generate the .pte file, I followed the instructions given on repository (the same one @mergennachin mentioned). Unlike what the repository states, the executorch export isn't actually a part of the latest library modules, so I had to clone the repository and pip install -e. Then I ran the same instructions given in the README, and the .pte file was generated. I tried it for both Qwen 2.5 0.5B and the OLMo. Qwen 2.5 is said to have a similar architecture to Llama so I'll try to run the .pte file with the script you mentioned.

@SS-JIA Ahh okay I understand. I actually tried running Qwen using the llama runner for android, and didn't face any error. However, the output wasn't very good, essentially gibberish, but that might boil down to the model itself not being very good and tokenizer issues.

Is there any specific type of tokenizer required for running the models? Like SentencePiece or BPE?

@SS-JIA
Copy link
Contributor

SS-JIA commented Jan 9, 2025

@Arya-Hari I see, the good news is that the runner can execute without error on the Qwen model. To me this suggests that the model interface is similar enough to Llama that you won't need to make any significant changes to the binary, if any at all.

Since you are seeing gibberish but the model executes successfully, I would suspect that the wrong tokenizer is being used. But also, 0.5B (assuming this is the number of parameters) means that it is a pretty small model so I wouldn't expect the quality to be that high, but would expect the output to still make some sense. I found this on the Qwen Github docs that describes the tokenizer that should be used for the model, so maybe you can try to refer to that to figure out what tokenizer should be used. Although that page is for the 7B model as a disclaimer.

@guangy10
Copy link
Contributor

guangy10 commented Jan 9, 2025

@guangy10, Is there guidelines on how to leverage from recent Hugging Face (huggingface/transformers#32253, huggingface/transformers#34102) and optimum integrations (https://huggingface.co/docs/optimum/main/en/exporters/executorch/usage_guides/export_a_model)

@Arya-Hari QWen is one of the popular models we have enabled on Hugging Face. You should be able to run Qwen 2.5 end-2-end via Optimum through common Hugging Face APIs in python. Here is the pointer of the e2e test you can take a look: https://github.com/huggingface/optimum/blob/b9fa9aa8de7772d00d96e8b8489f560218f4a865/tests/executorch/runtime/test_modeling_qwen2.py

Regarding running it with the c++ llama_runner for integration in Android/iOS applications, yes, there is a known gap (#6813) in converting arbitrary tokenizers to the format that is recognizable by the llama_runner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: llm LLM examples and apps, and the extensions/llm libraries
Projects
Status: To triage
Development

No branches or pull requests

5 participants