-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Java: AVX capability - check vector API JEP 414/417/426/438/448/460/469/489 in Java 16/17/18/19/20/21/22/23/24 - out of incubator status in Java 25 LTS #37
Comments
Quoted from JEP 489 for Java 25 START CITE - https://openjdk.org/jeps/489
(We assume that the array arguments are of the same length.) Here is an equivalent vector computation, using the Vector API:
To start, we obtain a preferred species whose shape is optimal for the current architecture from FloatVector. We store it in a static final field so that the runtime compiler treats the value as constant and can therefore better optimize the vector computation. The main loop then iterates over the input arrays in strides of the vector length, i.e., the species length. It loads float vectors of the given species from arrays a and b at the corresponding index, fluently performs the arithmetic operations, and then stores the result into array c. If any array elements are left over after the last iteration then the results for those tail elements are computed with an ordinary scalar loop. This implementation achieves optimal performance on large arrays. The HotSpot C2 compiler generates machine code similar to the following on an Intel x64 processor supporting AVX:
This is the output of a JMH micro-benchmark for the above code using the prototype of the Vector API and implementation found on the vectorIntrinsics branch of Project Panama's development repository. These hot areas of generated machine code show a clear translation to vector registers and vector instructions. We disabled loop unrolling (via the HotSpot option -XX:LoopUnrollLimit=0) in order to make the translation clearer; otherwise, HotSpot would unroll this code using existing C2 loop optimizations. All Java object allocations are elided. (HotSpot is capable of auto-vectorizing the scalar computation in this particular example, and it will generate a similar sequence of vector instructions. The main difference is that the auto-vectorizer generates a vector multiply instruction for the multiplication by -1.0f, whereas the Vector API implementation generates a vector XOR instruction that flips the sign bit. However, the key point of this example is to present the Vector API and show how its implementation generates vector instructions, rather than to compare it to the auto-vectorizer.) On platforms supporting predicate registers, the example above could be written more simply, without the scalar loop to process the tail elements, while still achieving optimal performance:
In the loop body we obtain a loop dependent mask for input to the load and store operations. When i < SPECIES.loopBound(a.length) the mask, m, declares all lanes are set. For the last iteration of the loop, when SPECIES.loopBound(a.length) <= i < a.length and (a.length - i) <= SPECIES.length(), the mask may declare a suffix of unset lanes. The load and store operations will not throw out-of-bounds exceptions since the mask prevents access to the array beyond its length. " |
https://openjdk.org/jeps/338
https://openjdk.org/jeps/414
https://openjdk.org/jeps/417
https://openjdk.org/jeps/426
https://openjdk.org/jeps/438
https://openjdk.org/jeps/448
https://openjdk.org/jeps/460
https://openjdk.org/jeps/469
https://openjdk.org/jeps/489
supports Arm Scalable Vector Extension - https://arxiv.org/pdf/1803.06185
via
https://en.wikipedia.org/wiki/Java_version_history#Java_SE_17
The text was updated successfully, but these errors were encountered: