-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to run Qwen using Executorch? #7467
Comments
i dont know the details how to run Qwen and whether there is any significant difference compared to llama as far as model's interface is concerned. Also when you say you were abel to export the model, can you detail the steps you took. If you can run exported qwen model using https://github.com/pytorch/executorch/blob/main/examples/models/llama/runner/eager.py#L103 then highly likely that you can run via cpp runner. But you do need tokenizer, so not sure how hf runs this model |
@Arya-Hari for some more context, the |
@guangy10, Is there guidelines on how to leverage from recent Hugging Face (huggingface/transformers#32253, huggingface/transformers#34102) and optimum integrations (https://huggingface.co/docs/optimum/main/en/exporters/executorch/usage_guides/export_a_model) |
Hi @kimishpatel. So to generate the .pte file, I followed the instructions given on repository (the same one @mergennachin mentioned). Unlike what the repository states, the executorch export isn't actually a part of the latest library modules, so I had to clone the repository and @SS-JIA Ahh okay I understand. I actually tried running Qwen using the llama runner for android, and didn't face any error. However, the output wasn't very good, essentially gibberish, but that might boil down to the model itself not being very good and tokenizer issues. Is there any specific type of tokenizer required for running the models? Like SentencePiece or BPE? |
@Arya-Hari I see, the good news is that the runner can execute without error on the Qwen model. To me this suggests that the model interface is similar enough to Llama that you won't need to make any significant changes to the binary, if any at all. Since you are seeing gibberish but the model executes successfully, I would suspect that the wrong tokenizer is being used. But also, 0.5B (assuming this is the number of parameters) means that it is a pretty small model so I wouldn't expect the quality to be that high, but would expect the output to still make some sense. I found this on the Qwen Github docs that describes the tokenizer that should be used for the model, so maybe you can try to refer to that to figure out what tokenizer should be used. Although that page is for the 7B model as a disclaimer. |
@Arya-Hari QWen is one of the popular models we have enabled on Hugging Face. You should be able to run Qwen 2.5 end-2-end via Optimum through common Hugging Face APIs in python. Here is the pointer of the e2e test you can take a look: https://github.com/huggingface/optimum/blob/b9fa9aa8de7772d00d96e8b8489f560218f4a865/tests/executorch/runtime/test_modeling_qwen2.py Regarding running it with the c++ llama_runner for integration in Android/iOS applications, yes, there is a known gap (#6813) in converting arbitrary tokenizers to the format that is recognizable by the llama_runner |
📚 The doc issue
Hi! I just wanted to how, how would I go about running Qwen using executorch? I was able to create the .pte file for Qwen. The example for Llama had a step 'Create a llama runner for android'. Do we have to do something similar for Qwen by creating a custom runner? Also the Qwen repository on Hugging Face Hub does not have a 'tokenizer.model' file, but the Llama example requires it for running inference using the adb shell. How to navigate around this?
Suggest a potential alternative/fix
No response
The text was updated successfully, but these errors were encountered: