-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use interleaved SHA-NI for SHA-256 on some CPUs lacking AVX-512 #5437
Comments
I experimented with this and the performance dropped dramatically. It was like 11M/s mixing SHA1-NI with AVX2 when using only AVX2 was like 49M/s. I also tested 3x interleaving for SHA1-NI and it improves performance by 5% on CLang on my AMD CPU, but it was dependent on the compiler used and I am not sure how it will work on other CPUs with different caches. I didn't try 3x interleaving for SHA256-NI. |
I don't have much historical data, but I bet you are correct. All of my last 5 (GitHub Actions) runs have been in:
|
I was testing it on 13900. According to agner.org we have the following latencies in cycles respectively for sha2rnds2, sha2msg1, sha2msg2: Intel This doesn't match 13900 at all, so I wanted to measure it myself, but I lost access to it for a while, so it will have to wait. |
If I measured it correctly on 14700k it is 3,2,2. 13'th gen should be similar. |
In #5435 (comment) @ukasz wrote:
As it happens, @alainesp was also experimenting with that just recently:
https://github.com/alainesp/fast-small-crypto
The preliminary results we have suggest that on some AMD CPUs, 2x interleaved SHA-NI can be almost twice faster than 1x, and ~75% faster than AVX2: https://github.com/alainesp/fast-small-crypto/actions/runs/7876924916/job/21491982542 (I only guess that this ran on an AMD CPU, but apparently it's similar to Alain's testing on his known AMD).
However, in my testing of Alain's code on Intel Tiger Lake (11th gen) and building with gcc 11, 1x and 2x SHA-NI are similar speed to each other, and are very slightly slower than AVX2, and almost 3 times slower than AVX-512.
There's no improvement from SHA-NI for SHA-1 anywhere we tested.
@ukasz What CPUs did you see improved latencies for, and what CPU are you testing on? Maybe things improved on newer Intel CPUs. If any of those lack AVX-512, it could be reasonable to use SHA-NI there as well.
I wonder if it would make sense to mix SHA-NI and AVX2 or AVX-512 instructions on any CPUs. I guess this depends on what execution ports these groups of instructions utilize.
Separately, I hear similar instructions for SHA-512 are coming in near future CPUs. I guess those will outperform AVX2, but not necessarily outperform AVX-512.
The text was updated successfully, but these errors were encountered: