Java: AVX capability - check vector API JEP 414/417/426/438/448/460/469/489 in Java 16/17/18/19/20/21/22/23/24 - out of incubator status in Java 25 LTS #37

obriensystems · 2025-01-18T18:51:17Z

https://openjdk.org/jeps/338
https://openjdk.org/jeps/414
https://openjdk.org/jeps/417
https://openjdk.org/jeps/426
https://openjdk.org/jeps/438
https://openjdk.org/jeps/448
https://openjdk.org/jeps/460
https://openjdk.org/jeps/469
https://openjdk.org/jeps/489
supports Arm Scalable Vector Extension - https://arxiv.org/pdf/1803.06185
via
https://en.wikipedia.org/wiki/Java_version_history#Java_SE_17

obriensystems · 2025-01-18T19:00:28Z

Quoted from JEP 489 for Java 25

START CITE - https://openjdk.org/jeps/489
"
Here is a simple scalar computation over elements of arrays:

void scalarComputation(float[] a, float[] b, float[] c) {
   for (int i = 0; i < a.length; i++) {
        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
   }
}

(We assume that the array arguments are of the same length.)

Here is an equivalent vector computation, using the Vector API:

static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;

void vectorComputation(float[] a, float[] b, float[] c) {
    int i = 0;
    int upperBound = SPECIES.loopBound(a.length);
    for (; i < upperBound; i += SPECIES.length()) {
        // FloatVector va, vb, vc;
        var va = FloatVector.fromArray(SPECIES, a, i);
        var vb = FloatVector.fromArray(SPECIES, b, i);
        var vc = va.mul(va)
                   .add(vb.mul(vb))
                   .neg();
        vc.intoArray(c, i);
    }
    for (; i < a.length; i++) {
        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
    }
}

To start, we obtain a preferred species whose shape is optimal for the current architecture from FloatVector. We store it in a static final field so that the runtime compiler treats the value as constant and can therefore better optimize the vector computation. The main loop then iterates over the input arrays in strides of the vector length, i.e., the species length. It loads float vectors of the given species from arrays a and b at the corresponding index, fluently performs the arithmetic operations, and then stores the result into array c. If any array elements are left over after the last iteration then the results for those tail elements are computed with an ordinary scalar loop.

This implementation achieves optimal performance on large arrays. The HotSpot C2 compiler generates machine code similar to the following on an Intel x64 processor supporting AVX:

0.43%  / │  0x0000000113d43890: vmovdqu 0x10(%r8,%rbx,4),%ymm0
  7.38%  │ │  0x0000000113d43897: vmovdqu 0x10(%r10,%rbx,4),%ymm1
  8.70%  │ │  0x0000000113d4389e: vmulps %ymm0,%ymm0,%ymm0
  5.60%  │ │  0x0000000113d438a2: vmulps %ymm1,%ymm1,%ymm1
 13.16%  │ │  0x0000000113d438a6: vaddps %ymm0,%ymm1,%ymm0
 21.86%  │ │  0x0000000113d438aa: vxorps -0x7ad76b2(%rip),%ymm0,%ymm0
  7.66%  │ │  0x0000000113d438b2: vmovdqu %ymm0,0x10(%r9,%rbx,4)
 26.20%  │ │  0x0000000113d438b9: add    $0x8,%ebx
  6.44%  │ │  0x0000000113d438bc: cmp    %r11d,%ebx
         \ │  0x0000000113d438bf: jl     0x0000000113d43890

This is the output of a JMH micro-benchmark for the above code using the prototype of the Vector API and implementation found on the vectorIntrinsics branch of Project Panama's development repository. These hot areas of generated machine code show a clear translation to vector registers and vector instructions. We disabled loop unrolling (via the HotSpot option -XX:LoopUnrollLimit=0) in order to make the translation clearer; otherwise, HotSpot would unroll this code using existing C2 loop optimizations. All Java object allocations are elided.

(HotSpot is capable of auto-vectorizing the scalar computation in this particular example, and it will generate a similar sequence of vector instructions. The main difference is that the auto-vectorizer generates a vector multiply instruction for the multiplication by -1.0f, whereas the Vector API implementation generates a vector XOR instruction that flips the sign bit. However, the key point of this example is to present the Vector API and show how its implementation generates vector instructions, rather than to compare it to the auto-vectorizer.)

On platforms supporting predicate registers, the example above could be written more simply, without the scalar loop to process the tail elements, while still achieving optimal performance:

void vectorComputation(float[] a, float[] b, float[] c) {
    for (int i = 0; i < a.length; i += SPECIES.length()) {
        // VectorMask<Float>  m;
        var m = SPECIES.indexInRange(i, a.length);
        // FloatVector va, vb, vc;
        var va = FloatVector.fromArray(SPECIES, a, i, m);
        var vb = FloatVector.fromArray(SPECIES, b, i, m);
        var vc = va.mul(va)
                   .add(vb.mul(vb))
                   .neg();
        vc.intoArray(c, i, m);
    }
}

In the loop body we obtain a loop dependent mask for input to the load and store operations. When i < SPECIES.loopBound(a.length) the mask, m, declares all lanes are set. For the last iteration of the loop, when SPECIES.loopBound(a.length) <= i < a.length and (a.length - i) <= SPECIES.length(), the mask may declare a suffix of unset lanes. The load and store operations will not throw out-of-bounds exceptions since the mask prevents access to the array beyond its length.

"
END CITE https://openjdk.org/jeps/489

obriensystems added the Java label Jan 18, 2025

obriensystems self-assigned this Jan 18, 2025

obriensystems changed the title ~~Java: check vector API JEP 414 in Java 17~~ Java: check vector API JEP 414/417 in Java 17 - out of incubator status Jan 18, 2025

obriensystems changed the title ~~Java: check vector API JEP 414/417 in Java 17 - out of incubator status~~ Java: check vector API JEP 414/417/426/438/448/460/469/489 in Java 16/17/18/19/20/21/22/23/24 - out of incubator status in Java 25 LTS Jan 18, 2025

obriensystems added 128 duplicate avx and removed duplicate labels Jan 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Java: AVX capability - check vector API JEP 414/417/426/438/448/460/469/489 in Java 16/17/18/19/20/21/22/23/24 - out of incubator status in Java 25 LTS #37

Java: AVX capability - check vector API JEP 414/417/426/438/448/460/469/489 in Java 16/17/18/19/20/21/22/23/24 - out of incubator status in Java 25 LTS #37

obriensystems commented Jan 18, 2025 •

edited

Loading

obriensystems commented Jan 18, 2025 •

edited

Loading

Java: AVX capability - check vector API JEP 414/417/426/438/448/460/469/489 in Java 16/17/18/19/20/21/22/23/24 - out of incubator status in Java 25 LTS #37

Java: AVX capability - check vector API JEP 414/417/426/438/448/460/469/489 in Java 16/17/18/19/20/21/22/23/24 - out of incubator status in Java 25 LTS #37

Comments

obriensystems commented Jan 18, 2025 • edited Loading

obriensystems commented Jan 18, 2025 • edited Loading

obriensystems commented Jan 18, 2025 •

edited

Loading

obriensystems commented Jan 18, 2025 •

edited

Loading