Skip to content

Commit

Permalink
x86/x64: Add more red zone checks to assembler backend.
Browse files Browse the repository at this point in the history
Thanks to Peter Cawley.

(cherry picked from commit d854d00)

Assembling some instructions (like `IR_CONV int.num`, for example) with
many mcode to be emitted may overflow the `MCLIM_REDZONE` (64) at once
due to the huge mcode emitting.

For example `IR_CONV` in this test requires 66 bytes of the
machine code:
|  cvttsd2si r15d, xmm5
|  xorps xmm9, xmm9
|  cvtsi2sd xmm9, r15d
|  ucomisd xmm5, xmm9
|  jnz 0x11edb00e5       ->37
|  jpe 0x11edb00e5       ->37
|  mov [rsp+0x80], r15d
|  mov r15, [rsp+0xe8]
|  movsd xmm9, [rsp+0xe0]
|  movsd xmm5, [rsp+0xd8]

The reproducer needs sufficient register pressure as to immediately
spill the result of the instruction to the stack and then reload the
three registers used by the instruction, and to have chosen enough
registers with numbers >=8 (because shaving off a REX prefix [1] or two
would get 66 back down to <= `MCLIM_REDZONE`), and to be using lots of
spill slots (because memory offsets <= 0x7f are shorter to encode
compared to those >= 0x80. So, each reload instruction consumes 9 bytes.
This makes this reproducer unstable (regarding the register allocator
changes). Thus, only original test case is added as a regression test.

This patch adds the red zone overflow checks more often for the IRs with
many instructions to be emitted.

Sergey Kaplun:
* added the description and the test for the problem

[1]: https://wiki.osdev.org/X86-64_Instruction_Encoding#REX_prefix

Part of tarantool/tarantool#10709
  • Loading branch information
Mike Pall authored and Buristan committed Jan 20, 2025
1 parent 58b013a commit e1d956c
Show file tree
Hide file tree
Showing 2 changed files with 124 additions and 2 deletions.
8 changes: 6 additions & 2 deletions src/lj_asm_x86.h
Original file line number Diff line number Diff line change
Expand Up @@ -794,6 +794,7 @@ static void asm_tointg(ASMState *as, IRIns *ir, Reg left)
emit_rr(as, XO_UCOMISD, left, tmp);
emit_rr(as, XO_CVTSI2SD, tmp, dest);
emit_rr(as, XO_XORPS, tmp, tmp); /* Avoid partial register stall. */
checkmclim(as);
emit_rr(as, XO_CVTTSD2SI, dest, left);
/* Can't fuse since left is needed twice. */
}
Expand Down Expand Up @@ -836,6 +837,7 @@ static void asm_conv(ASMState *as, IRIns *ir)
emit_rr(as, XO_SUBSD, dest, bias); /* Subtract 2^52+2^51 bias. */
emit_rr(as, XO_XORPS, dest, bias); /* Merge bias and integer. */
emit_rma(as, XO_MOVSD, bias, k);
checkmclim(as);
emit_mrm(as, XO_MOVD, dest, asm_fuseload(as, lref, RSET_GPR));
return;
} else { /* Integer to FP conversion. */
Expand Down Expand Up @@ -1151,6 +1153,7 @@ static void asm_href(ASMState *as, IRIns *ir, IROp merge)
asm_guardcc(as, CC_E);
else
emit_sjcc(as, CC_E, l_end);
checkmclim(as);
if (irt_isnum(kt)) {
if (isk) {
/* Assumes -0.0 is already canonicalized to +0.0. */
Expand Down Expand Up @@ -1210,7 +1213,6 @@ static void asm_href(ASMState *as, IRIns *ir, IROp merge)
#endif
}
emit_sfixup(as, l_loop);
checkmclim(as);
#if LJ_GC64
if (!isk && irt_isaddr(kt)) {
emit_rr(as, XO_OR, tmp|REX_64, key);
Expand All @@ -1237,6 +1239,7 @@ static void asm_href(ASMState *as, IRIns *ir, IROp merge)
emit_rr(as, XO_ARITH(XOg_SUB), dest, tmp);
emit_shifti(as, XOg_ROL, tmp, HASH_ROT3);
emit_rr(as, XO_ARITH(XOg_XOR), dest, tmp);
checkmclim(as);
emit_shifti(as, XOg_ROL, dest, HASH_ROT2);
emit_rr(as, XO_ARITH(XOg_SUB), tmp, dest);
emit_shifti(as, XOg_ROL, dest, HASH_ROT1);
Expand All @@ -1254,7 +1257,6 @@ static void asm_href(ASMState *as, IRIns *ir, IROp merge)
} else {
emit_rr(as, XO_MOV, tmp, key);
#if LJ_GC64
checkmclim(as);
emit_gri(as, XG_ARITHi(XOg_XOR), dest, irt_toitype(kt) << 15);
if ((as->flags & JIT_F_BMI2)) {
emit_i8(as, 32);
Expand Down Expand Up @@ -1525,6 +1527,7 @@ static void asm_ahuvload(ASMState *as, IRIns *ir)
if (irt_islightud(ir->t)) {
Reg dest = asm_load_lightud64(as, ir, 1);
if (ra_hasreg(dest)) {
checkmclim(as);
asm_fuseahuref(as, ir->op1, RSET_GPR);
emit_mrm(as, XO_MOV, dest|REX_64, RID_MRM);
}
Expand Down Expand Up @@ -1569,6 +1572,7 @@ static void asm_ahuvload(ASMState *as, IRIns *ir)
if (LJ_64 && irt_type(ir->t) >= IRT_NUM) {
lj_assertA(irt_isinteger(ir->t) || irt_isnum(ir->t),
"bad load type %d", irt_type(ir->t));
checkmclim(as);
#if LJ_GC64
emit_u32(as, LJ_TISNUM << 15);
#else
Expand Down
118 changes: 118 additions & 0 deletions test/tarantool-tests/lj-1116-redzones-checks.test.lua
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
local tap = require('tap')
-- Test file to demonstrate mcode area overflow during recording a
-- trace with the high FPR pressure.
-- See also, https://github.com/LuaJIT/LuaJIT/issues/1116.
--
-- XXX: Test fails with reverted fix and enabled GC64 mode.
local test = tap.test('lj-1116-redzones-checks'):skipcond({
['Test requires JIT enabled'] = not jit.status(),
})

test:plan(1)

jit.opt.start('hotloop=1')

-- XXX: This test snippet was originally created by the fuzzer.
-- See https://oss-fuzz.com/testcase-detail/5622965122170880.
--
-- Unfortunately, it's impossible to reduce the testcase further.
-- Before the patch, assembling some instructions (like `IR_CONV
-- int.num`, for example) with many mcode to be emitted may
-- overflow the `MCLIM_REDZONE` (64) at once due to the huge
-- mcode emitting.
-- For example `IR_CONV` in this test requires 66 bytes of the
-- machine code:
-- | cvttsd2si r15d, xmm5
-- | xorps xmm9, xmm9
-- | cvtsi2sd xmm9, r15d
-- | ucomisd xmm5, xmm9
-- | jnz 0x11edb00e5 ->37
-- | jpe 0x11edb00e5 ->37
-- | mov [rsp+0x80], r15d
-- | mov r15, [rsp+0xe8]
-- | movsd xmm9, [rsp+0xe0]
-- | movsd xmm5, [rsp+0xd8]
--
-- The reproducer needs sufficient register pressure as to
-- immediately spill the result of the instruction to the stack
-- and then reload the three registers used by the instruction,
-- and to have chosen enough registers with numbers >=8 (because
-- shaving off a REX prefix [1] or two would get 66 back down
-- to <= `MCLIM_REDZONE`), and to be using lots of spill slots
-- (because memory offsets <= 0x7f are shorter to encode compared
-- to those >= 0x80. So, each reload instruction consumes 9 bytes.
-- This makes this reproducer unstable (regarding the register
-- allocator changes). So, lets use this as a regression test.
--
-- [1]: https://wiki.osdev.org/X86-64_Instruction_Encoding#REX_prefix

_G.a = 0
_G.b = 0
_G.c = 0
_G.d = 0
_G.e = 0
_G.f = 0
_G.g = 0
_G.h = 0
-- Skip `i` -- it is used for the loop index.
_G.j = 0
_G.k = 0
_G.l = 0
_G.m = 0
_G.n = 0
_G.o = 0
_G.p = 0
_G.q = 0
_G.r = 0
_G.s = 0
_G.t = 0
_G.u = 0
_G.v = 0
_G.w = 0
_G.x = 0
_G.y = 0
_G.z = 0

-- XXX: Need here not 4, but 4.5 top border of the cycle to create
-- FPR pressure.
for i = 1, 4.5 do
_G.a = _G.a + 1
_G.b = _G.b + 1
_G.c = _G.c + 1
_G.d = _G.d + 1
for _ = i, 2 do
_G.e = _G.e + 1
end
-- Here we emit `IR_CONV int.num`. This loop is inlined.
-- Assertion failed after emitting the variant part of the
-- big loop.
for _ = 2, i do
_G.f = _G.f + 1
_G.g = _G.g + 1
_G.h = _G.h + 1
_G.j = _G.j + 1
_G.k = _G.k + 1
_G.l = _G.l + 1
end
_G.m = _G.m + 1
_G.n = _G.n + 1
_G.o = _G.o + 1
_G.p = _G.p + 1
_G.q = _G.q + 1
_G.r = _G.r + 1
_G.s = _G.s + 1
_G.t = _G.t + 1
_G.u = _G.u + 1
_G.v = _G.v + 1
for _ = i, 2.1 do
_G.aa = _G.a
_G.w = _G.w + 1
_G.x = _G.x + 1
_G.y = _G.y + 1
_G.z = _G.z + 1
end
end

test:ok(true, 'no mcode limit assertion failed during recording')

test:done(true)

0 comments on commit e1d956c

Please sign in to comment.