-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve VectorUtil::xorBitCount perf on ARM #13545
Conversation
This commit improves the performance of VectorUtil::xorBitCount on ARM by ~4x. This change is effectively a workaround for the lack of vectorization of Long::bitCount on ARM. On x64 there is no issue, the long variant of xorBitCount outperforms the int variant by ~15%.
Hi, |
* For xorBitCount we stride over the values as either 64-bits (long) or 32-bits (int) at a time. | ||
* On ARM Long::bitCount is not vectorized, and therefore produces less than optimal code, when | ||
* compared to Integer::bitCount. While Long::bitCount is optimal on x64. TODO: include the | ||
* OpenJDK JIRA url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have the JIRA issue number already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. I just added it in 4baaeda
I reverted the addition of the file to 9.x branch: 86d080a |
@uschindler Apologies, I didn't notice this when cherrypicking. Thanks for reverting (while I was sleeping ;-) ) |
This commit improves the performance of
VectorUtil::xorBitCount
on ARM by ~4x.This change is effectively a workaround for the lack of vectorization of
Long::bitCount
on ARM, see https://github.com/ChrisHegarty/hammingBench/. I'll get an issue filed against Hotspot for this. ( JDK bug tracking this issue: https://bugs.openjdk.org/browse/JDK-8336000 )On x64 there is no issue, the long variant of xorBitCount outperforms the int variant by ~15%.
Before (measures throughput in seconds, so bigger numbers are better)
After