-
Notifications
You must be signed in to change notification settings - Fork 43
Canonical sign-replication operation #437
Comments
@Maratyszcza Interesting. Do I understand correctly that an engine would have to check whether the shift count matches the magic value, and then emit the corresponding code? It seems that might hit branch mispredicts if mixed with regular non-magic shift operations? |
I would hope that engines would be able recognize that e.g. |
Engines already treat shifts with static count differently than shifts with dynamic count. What I suggest is an additional check + codegen specialization when shift count is static. Since this is a compile-time decision, no branches will be generated in the translated code. |
We already locked the instruction proposals in November, so any new instruction proposal would have to wait until SIMD post-MVP. |
We don't have a way to standardize certain immediate patterns as the standard for operations, this is something engines could optimize as described. I'll label this as perf documentation for now, and as we figure out the best way to surface this information. Perhaps in #436 if we run through other things already on the agenda.
For this particular case, the overhead of adding operations does not seem necessary as the optimization is a special case of the shift operation and can be handled in a relatively straightforward way in the engine. |
@Maratyszcza @dtig @tlively I understand, thanks for explaining. Documentation + internal optimization in engines sounds good! |
Sign-replication is an often-used operation that replicates the sign bit of a SIMD lane into all bits of the lane. There are two reasons why we need to pay attention to sign-replication. First, it can be encoded in WebAssembly SIMD in several ways:
i8x16.shr_s(v, -1)
/i16x8.shr_s(v, -1)
/i32x4.shr_s(v, -1)
/i64x2.shr_s(v, -1)
i8x16.shr_s(v, 7)
/i16x8.shr_s(v, 15)
/i32x4.shr_s(v, 31)
/i64x2.shr_s(v, 63)
i8x16.neg(i8x16.shr_u(v, -1))
/i16x8.neg(i16x8.shr_u(v, -1))
/i32x4.neg(i32x4.shr_s(v, -1))
/i64x2.neg(i64x2.shr_s(v, -1))
i8x16.neg(i8x16.shr_u(v, 7))
/i16x8.neg(i16x8.shr_u(v, 15))
/i32x4.neg(i32x4.shr_s(v, 31))
/i64x2.neg(i64x2.shr_s(v, 63))
i8x16.lt_s(v, v128.const(0))
/i16x8.lt_s(v, v128.const(0))
/i32x4.lt_s(v, v128.const(0))
/i64x2.lt_s(v, v128.const(0))
Secondly, sign-replication can be lowered in many ways depending on the data type and the target instruction set, as noted by @jan-wassenberg in Sign Select instructions #124.
My suggestion is:
i8x16.shr_s(v, -1)
/i16x8.shr_s(v, -1)
/i32x4.shr_s(v, -1)
/i64x2.shr_s(v, -1)
andi8x16.shr_s(v, 7)
/i16x8.shr_s(v, 15)
/i32x4.shr_s(v, 31)
/i64x2.shr_s(v, 63)
as the canonical sign-replication instructions, and recommend that WebAssembly engines lower them differently that other arithmetic shift instructions.Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX512F and AVX512VL instruction sets
y = i64x2.shr_s(x, 63)
is lowered toVPSRAQ xmm_y, xmm_x, 63
x86/x86-64 processors with AVX instruction set
y = i8x16.shr_s(x, 7)
(y
is NOTx
) is lowered to:VPXOR xmm_y, xmm_y, xmm_y
VPCMPGTB xmm_y, xmm_y, xmm_x
x = i8x16.shr_s(x, 7)
is lowered to:VPXOR xmm_tmp, xmm_tmp, xmm_tmp
VPCMPGTB xmm_x, xmm_tmp, xmm_x
y = i16x8.shr_s(x, 15)
is lowered toVPSRAW xmm_y, xmm_x, 15
y = i32x4.shr_s(x, 31)
is lowered toVPSRAD xmm_y, xmm_x, 31
y = i64x2.shr_s(x, 63)
is lowered to:VPSRAD xmm_y, xmm_x, 31
VPSHUFD xmm_y, xmm_y, 0xF5
x86/x86-64 processors with SSE2 instruction set
y = i8x16.shr_s(x, 7)
(y
is NOTx
) is lowered to:PXOR xmm_y, xmm_y
PCMPGTB xmm_y, xmm_x
x = i8x16.shr_s(x, 7)
is lowered to:MOVDQA xmm_tmp, xmm_x
PXOR xmm_x, xmm_x
PCMPGTB xmm_x, xmm_tmp
y = i16x8.shr_s(x, 15)
is lowered to:PXOR xmm_y, xmm_y
PCMPGTW xmm_y, xmm_x
x = i16x8.shr_s(x, 15)
is lowered toPSRAW xmm_x, 15
y = i32x4.shr_s(x, 31)
is lowered to:PXOR xmm_y, xmm_y
PCMPGTD xmm_y, xmm_x
x = i32x4.shr_s(x, 31)
is lowered toPSRAD xmm_x, 31
y = i64x2.shr_s(x, 63)
is lowered to:PSHUFD xmm_y, xmm_x, 0xF5
PSRAD xmm_y, 31
ARM64 processors
y = i8x16.shr_s(x, 7)
is lowered toCMLT Vy.16B, Vx.16B, #0
y = i16x8.shr_s(x, 15)
is lowered toCMLT Vy.8H, Vx.8H, #0
y = i32x4.shr_s(x, 31)
is lowered toCMLT Vy.4S, Vx.4S, #0
y = i64x2.shr_s(x, 63)
is lowered toCMLT Vy.2D, Vx.2D, #0
ARMv7 processors with NEON extension
y = i8x16.shr_s(x, 7)
is lowered toVCLT.S8 Qy, Qx, #0
y = i16x8.shr_s(x, 15)
is lowered toVCLT.S16 Qy, Qx, #0
y = i32x4.shr_s(x, 31)
is lowered toVCLT.S32 Qy, Qx, #0
y = i64x2.shr_s(x, 63)
is lowered toVSHR.S64 Qy, Qx, #63
The text was updated successfully, but these errors were encountered: