-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized rope_rotation_llama and apply temperature to logits with vectorization #59
base: master
Are you sure you want to change the base?
Conversation
Hi @andresnowak thanks for your PR. I'll review it shortly.. |
Not really, at least when testing with the 15m and 110m models i didn't see a difference or the difference was to little, probably the matmul operation just takes the majority of the time and you can't see to much of a difference optimizing other operations (but i was checking if it would be possible to tile the matmul operation), but i think in bigger models there should be somewhat of a difference. But i can't also test to well for improvements, because each run gives very different toks/s, i sometimes get 330 and sometimes 410 in a ryzen 3600x. |
I vectorized ROPE as you have a while back but the networks were much too small to see any impact. I also did micro benchmarks focused on the ROPE function and even head size 25000 (512 * 512) had no improvement. That is already far bigger than any LLM uses. At bigger sizes it did get faster. I am just not sure it is relevant. But it wasn't slower as far as I could tell. My tests were on 4 core 32 GB ram Ubuntu and also the playground which is usually 32 cores. |
hmm interesting, but shouldn't vectorized be faster with a 1000 values already, or is strided load and store not as efficient because it has to access to separated parts in memory?, or is the speedup just not noticeable until you get to bigger iteration sizes |
I think this will get a performance improvement by removing the parallelize call of the loop over heads in ROPE and replacing with a simple for loop. Setting up threads with parallelize is time consuming and I think removing it is a net performance gain. Also there is a version of
|
Regarding the parallelize part, Why would it be faster to remove it, isn't parallelize supposedly using a cached runtime?, so it shouldn't be creating threads each time it is called no?. Or am I not understanding how the implementation works? |
Doing some tests in a 3600x @mikowals, i only saw a difference in the 15m model when comparing rope with parallelization and without it, for 110m and tiny_llama i didn't see a difference. And doing some isolated benchmarking for the rope function i saw that the version without parallelization was slower than the parallelized one when an amount of values was that was 5000 or more (ex. head_size = 1000, n_heads = 6), but with less values, the version without parallelization was faster, so if in general we always have less than 5000 values, then we could remove the parallelization |
Maybe i didn't understand correctly, in these benchmarks you are comparing the original implementation and the rope implementation with simd and without parallelization no?, what i was comparing was rope with simd and parallelization and rope with simd and no parallelization. But yeah as you say if the sizes are always in the range of the size of baby llama, then it can be better to remove parallelization. But one last thing, when you where doing the benchmarks did you compare parallelization using all cores on the m1 pro or only the performance cores, because i think using all cores in the m1 can be slower than using only the performance cores, because we are dividing the work exactly for the amount of cores in the machine, and from what i understand the efficiency cores are a lot slower than the performance cores |
I put the comparisons I did in this branch. The graphs above are done where "V1 is current master and V2 is with the two line change to remove parallelize and replace with a for-loop." I used lamatune and set V1 to current master and V2 to the branch I just linked to. The isolated benchmarks I ran are in this file. I run all tests on my system with My comments on this PR are really just that there seems to be easy performance improvement from removing |
@mikowals Doing the lamatune and doing your isolated benchmarks I have found the same conclusion as you, simple vanilla for loops are faster than parallelize, vectorize and vectorize parallelize, for all the available models. In the benchmarks (v1 is current master, v2 is your implementation removing the parallelization, and v3 is my implementation with parallelize and vectorize). And for the isolated benchmarks, vanilla for loops was faster than all the other implementations for all the models up to tinyllama, in tinyllama it was the same speed as vectorize_parallelize and parallelize, and for the imaginary networks the fastest one was parallelize. The only thing is that i don't understand why is vectorization slower than vanilla for loops, i don't know if it has to do with cache misses or that maybe simd_strided is a slower instruction, or i just don't know. But yeah, it seems it would be better to remove parallelization for rope_llama if we don't get to the sizes of the imaginary network, or maybe we can add an if inside the function to run rope_llama with parallelization or not depending on the size
|
Probably for simple loops Mojo already has some levels of optimizations by default. Similarly as gcc optimizes |
Hmm, but vectorize also uses for loops no?, so it should have the same optimizations no?, or is it not applying the same optimizations |
That's hard to say. Possibly we can manually tune the source code to make it "more optimal" than standard optimizations. But if there is no manually optimized loops, the compiler could attempt to do so. |
hhmmm, okay, I understand. |
No description provided.