Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java: AVX capability - check vector API JEP 414/417/426/438/448/460/469/489 in Java 16/17/18/19/20/21/22/23/24 - out of incubator status in Java 25 LTS #37

Open
obriensystems opened this issue Jan 18, 2025 · 1 comment
Assignees

Comments

@obriensystems obriensystems self-assigned this Jan 18, 2025
@obriensystems obriensystems changed the title Java: check vector API JEP 414 in Java 17 Java: check vector API JEP 414/417 in Java 17 - out of incubator status Jan 18, 2025
@obriensystems obriensystems changed the title Java: check vector API JEP 414/417 in Java 17 - out of incubator status Java: check vector API JEP 414/417/426/438/448/460/469/489 in Java 16/17/18/19/20/21/22/23/24 - out of incubator status in Java 25 LTS Jan 18, 2025
@obriensystems obriensystems changed the title Java: check vector API JEP 414/417/426/438/448/460/469/489 in Java 16/17/18/19/20/21/22/23/24 - out of incubator status in Java 25 LTS Java: AVX capability - check vector API JEP 414/417/426/438/448/460/469/489 in Java 16/17/18/19/20/21/22/23/24 - out of incubator status in Java 25 LTS Jan 18, 2025
@obriensystems
Copy link
Member Author

obriensystems commented Jan 18, 2025

Quoted from JEP 489 for Java 25

START CITE - https://openjdk.org/jeps/489
"
Here is a simple scalar computation over elements of arrays:

void scalarComputation(float[] a, float[] b, float[] c) {
   for (int i = 0; i < a.length; i++) {
        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
   }
}

(We assume that the array arguments are of the same length.)

Here is an equivalent vector computation, using the Vector API:

static final VectorSpecies<Float> SPECIES = FloatVector.SPECIES_PREFERRED;

void vectorComputation(float[] a, float[] b, float[] c) {
    int i = 0;
    int upperBound = SPECIES.loopBound(a.length);
    for (; i < upperBound; i += SPECIES.length()) {
        // FloatVector va, vb, vc;
        var va = FloatVector.fromArray(SPECIES, a, i);
        var vb = FloatVector.fromArray(SPECIES, b, i);
        var vc = va.mul(va)
                   .add(vb.mul(vb))
                   .neg();
        vc.intoArray(c, i);
    }
    for (; i < a.length; i++) {
        c[i] = (a[i] * a[i] + b[i] * b[i]) * -1.0f;
    }
}

To start, we obtain a preferred species whose shape is optimal for the current architecture from FloatVector. We store it in a static final field so that the runtime compiler treats the value as constant and can therefore better optimize the vector computation. The main loop then iterates over the input arrays in strides of the vector length, i.e., the species length. It loads float vectors of the given species from arrays a and b at the corresponding index, fluently performs the arithmetic operations, and then stores the result into array c. If any array elements are left over after the last iteration then the results for those tail elements are computed with an ordinary scalar loop.

This implementation achieves optimal performance on large arrays. The HotSpot C2 compiler generates machine code similar to the following on an Intel x64 processor supporting AVX:

0.43%  / │  0x0000000113d43890: vmovdqu 0x10(%r8,%rbx,4),%ymm0
  7.38%  │ │  0x0000000113d43897: vmovdqu 0x10(%r10,%rbx,4),%ymm1
  8.70%  │ │  0x0000000113d4389e: vmulps %ymm0,%ymm0,%ymm0
  5.60%  │ │  0x0000000113d438a2: vmulps %ymm1,%ymm1,%ymm1
 13.16%  │ │  0x0000000113d438a6: vaddps %ymm0,%ymm1,%ymm0
 21.86%  │ │  0x0000000113d438aa: vxorps -0x7ad76b2(%rip),%ymm0,%ymm0
  7.66%  │ │  0x0000000113d438b2: vmovdqu %ymm0,0x10(%r9,%rbx,4)
 26.20%  │ │  0x0000000113d438b9: add    $0x8,%ebx
  6.44%  │ │  0x0000000113d438bc: cmp    %r11d,%ebx
         \ │  0x0000000113d438bf: jl     0x0000000113d43890

This is the output of a JMH micro-benchmark for the above code using the prototype of the Vector API and implementation found on the vectorIntrinsics branch of Project Panama's development repository. These hot areas of generated machine code show a clear translation to vector registers and vector instructions. We disabled loop unrolling (via the HotSpot option -XX:LoopUnrollLimit=0) in order to make the translation clearer; otherwise, HotSpot would unroll this code using existing C2 loop optimizations. All Java object allocations are elided.

(HotSpot is capable of auto-vectorizing the scalar computation in this particular example, and it will generate a similar sequence of vector instructions. The main difference is that the auto-vectorizer generates a vector multiply instruction for the multiplication by -1.0f, whereas the Vector API implementation generates a vector XOR instruction that flips the sign bit. However, the key point of this example is to present the Vector API and show how its implementation generates vector instructions, rather than to compare it to the auto-vectorizer.)

On platforms supporting predicate registers, the example above could be written more simply, without the scalar loop to process the tail elements, while still achieving optimal performance:

void vectorComputation(float[] a, float[] b, float[] c) {
    for (int i = 0; i < a.length; i += SPECIES.length()) {
        // VectorMask<Float>  m;
        var m = SPECIES.indexInRange(i, a.length);
        // FloatVector va, vb, vc;
        var va = FloatVector.fromArray(SPECIES, a, i, m);
        var vb = FloatVector.fromArray(SPECIES, b, i, m);
        var vc = va.mul(va)
                   .add(vb.mul(vb))
                   .neg();
        vc.intoArray(c, i, m);
    }
}

In the loop body we obtain a loop dependent mask for input to the load and store operations. When i < SPECIES.loopBound(a.length) the mask, m, declares all lanes are set. For the last iteration of the loop, when SPECIES.loopBound(a.length) <= i < a.length and (a.length - i) <= SPECIES.length(), the mask may declare a suffix of unset lanes. The load and store operations will not throw out-of-bounds exceptions since the mask prevents access to the array beyond its length.

"
END CITE https://openjdk.org/jeps/489

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant