How to run Qwen using Executorch? #7467

Arya-Hari · 2025-01-02T07:16:56Z

📚 The doc issue

Hi! I just wanted to how, how would I go about running Qwen using executorch? I was able to create the .pte file for Qwen. The example for Llama had a step 'Create a llama runner for android'. Do we have to do something similar for Qwen by creating a custom runner? Also the Qwen repository on Hugging Face Hub does not have a 'tokenizer.model' file, but the Llama example requires it for running inference using the adb shell. How to navigate around this?

Suggest a potential alternative/fix

No response

kimishpatel · 2025-01-03T16:25:32Z

i dont know the details how to run Qwen and whether there is any significant difference compared to llama as far as model's interface is concerned.

Also when you say you were abel to export the model, can you detail the steps you took. If you can run exported qwen model using https://github.com/pytorch/executorch/blob/main/examples/models/llama/runner/eager.py#L103 then highly likely that you can run via cpp runner. But you do need tokenizer, so not sure how hf runs this model

SS-JIA · 2025-01-06T21:17:34Z

@Arya-Hari for some more context, the llama_runner binary used in our examples is heavily tailored to the llama model architecture. So as Kimish mentioned, depending on the interface of Qwen compared to llama you may not be able to re-use the llama_runner binary. If you are familiar with the interface of the model, then the best way would be to fork or modify the llama_runner binary for the Qwen model; essentially creating a custom runner as you mentioned.

mergennachin · 2025-01-07T03:11:35Z

@guangy10, Is there guidelines on how to leverage from recent Hugging Face (huggingface/transformers#32253, huggingface/transformers#34102) and optimum integrations (https://huggingface.co/docs/optimum/main/en/exporters/executorch/usage_guides/export_a_model)

Arya-Hari · 2025-01-09T16:17:09Z

Hi @kimishpatel. So to generate the .pte file, I followed the instructions given on repository (the same one @mergennachin mentioned). Unlike what the repository states, the executorch export isn't actually a part of the latest library modules, so I had to clone the repository and pip install -e. Then I ran the same instructions given in the README, and the .pte file was generated. I tried it for both Qwen 2.5 0.5B and the OLMo. Qwen 2.5 is said to have a similar architecture to Llama so I'll try to run the .pte file with the script you mentioned.

@SS-JIA Ahh okay I understand. I actually tried running Qwen using the llama runner for android, and didn't face any error. However, the output wasn't very good, essentially gibberish, but that might boil down to the model itself not being very good and tokenizer issues.

Is there any specific type of tokenizer required for running the models? Like SentencePiece or BPE?

SS-JIA · 2025-01-09T16:34:26Z

@Arya-Hari I see, the good news is that the runner can execute without error on the Qwen model. To me this suggests that the model interface is similar enough to Llama that you won't need to make any significant changes to the binary, if any at all.

Since you are seeing gibberish but the model executes successfully, I would suspect that the wrong tokenizer is being used. But also, 0.5B (assuming this is the number of parameters) means that it is a pretty small model so I wouldn't expect the quality to be that high, but would expect the output to still make some sense. I found this on the Qwen Github docs that describes the tokenizer that should be used for the model, so maybe you can try to refer to that to figure out what tokenizer should be used. Although that page is for the 7B model as a disclaimer.

guangy10 · 2025-01-09T18:01:23Z

@guangy10, Is there guidelines on how to leverage from recent Hugging Face (huggingface/transformers#32253, huggingface/transformers#34102) and optimum integrations (https://huggingface.co/docs/optimum/main/en/exporters/executorch/usage_guides/export_a_model)

@Arya-Hari QWen is one of the popular models we have enabled on Hugging Face. You should be able to run Qwen 2.5 end-2-end via Optimum through common Hugging Face APIs in python. Here is the pointer of the e2e test you can take a look: https://github.com/huggingface/optimum/blob/b9fa9aa8de7772d00d96e8b8489f560218f4a865/tests/executorch/runtime/test_modeling_qwen2.py

Regarding running it with the c++ llama_runner for integration in Android/iOS applications, yes, there is a known gap (#6813) in converting arbitrary tokenizers to the format that is recognizable by the llama_runner

kimishpatel added the module: llm LLM examples and apps, and the extensions/llm libraries label Jan 3, 2025

SS-JIA moved this to To triage in ExecuTorch DevX improvements Jan 6, 2025

SS-JIA added this to ExecuTorch DevX improvements Jan 6, 2025

SS-JIA self-assigned this Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run Qwen using Executorch? #7467

How to run Qwen using Executorch? #7467

Arya-Hari commented Jan 2, 2025

kimishpatel commented Jan 3, 2025

SS-JIA commented Jan 6, 2025

mergennachin commented Jan 7, 2025

Arya-Hari commented Jan 9, 2025

SS-JIA commented Jan 9, 2025

guangy10 commented Jan 9, 2025

How to run Qwen using Executorch? #7467

How to run Qwen using Executorch? #7467

Comments

Arya-Hari commented Jan 2, 2025

📚 The doc issue

Suggest a potential alternative/fix

kimishpatel commented Jan 3, 2025

SS-JIA commented Jan 6, 2025

mergennachin commented Jan 7, 2025

Arya-Hari commented Jan 9, 2025

SS-JIA commented Jan 9, 2025

guangy10 commented Jan 9, 2025