-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Issue while using Metal backend. #2659
Comments
BTW, if I infer this model with PyTorch it's very fast. But when I translate the model code to candle, it's about 5 times slower then torch model. Is it possible this issue can cause the result? |
Just to mention something that may be a bit obvious: metal is an asynchronous api (same as cuda) so in order to time things properly you have to insert some "synchronise" calls to ensure that the code has fully run up to the point where you make the time measurements. |
I have the same concern. Same model with Ollama works pretty fast. As I can see, here is a problem in method Is it possible to use it with cache? |
@LaurentMazare do you know of any example code around? Seems a few folks are hitting this, I see %3 GPU use, and a 9 second output response to "what is 2+2" to a 7b Qwen model. |
Hi there,
Candle is really a great project when using Rust. But currently I faced some performance Issue when I use it to infer my Model.
Here is some of my model code:
After I do this forward function manny times, the time consuming is increasing. Here is the log:
Does anyone know why the time consuming can change from 1ms to 500ms?
The text was updated successfully, but these errors were encountered: