Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Timeouts For Lua Scripts #983

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open

Conversation

kevin-montrose
Copy link
Contributor

@kevin-montrose kevin-montrose commented Jan 30, 2025

For perf reasons this became a little more complicated than I would like, but I think it's justifiable.

Introduces LuaTimeoutManager which spins up a dedicated Thread. When SessionScriptCaches are populated (either for the first time, or after a previous Clear), they register themselves with LuaTimeoutManager. That registration is used to track when a Lua script is actually running, and to trigger best-effort timeouts.

The timeouts themselves use Lua's debug interface which provides a thread-safe way to clear/set hooks. This hook is set dynamically, because always having a hook registered causes a significant performance hit.

Performance issues are also why LuaTimeoutManager spawns a Thread. Using built-in timers or Tasks can hurt perf by putting more pressure on the shared thread pool - and most .NET timers also have some internal locking that hurts in practice.

To claw back some of the unavoidable overhead, I made three unrelated performance improvements:

Correctness

There are a couple concerns to handle in this design.

The first is to make sure we're not timing out spuriously - that's done by conceptually splitting the timeout into 10ths and waiting ~11 / 10ths before triggering a timeout. This deals with the case where registration happens in the middle of one of the 10ths.

The second is timeouts trigger after a Lua script has finished but before LuaTimeoutManager has been informed of that completion. If handled incorrectly, this could cause a subsequent script to be spuriously cancelled. This is handled by generating a "cookie" value when a script starts, and having it passed back on cancellation - if the cookies do not match expected values, the cancellation signal is ignored.

I added a stress test that runs 16 sessions in parallel with timeouts for an hour, and ran it for ~12 hours, to check the implementation. So it doesn't seem obviously broken, at least. This test is [Ignored] by default.

Performance

All runs are as of d148f51740d79548ad78dc19008fca0e58c0de43 for main and 28e1129eaba16cf60fdc101ed91d455e4a123df3 for luaTimeLimits.

ScriptOperations

Here timeouts are disabled for both branches, reporting just Native workloads.

There's a little bit of motion in either direction, but I'd say it's all within the margin of error.

main

Method Params Mean Error StdDev Gen0 Gen1 Gen2 Allocated
ScriptLoad Native,None 84.44 us 0.484 us 0.429 us - - - 9600 B
ScriptExistsTrue Native,None 19.27 us 0.313 us 0.384 us - - - -
ScriptExistsFalse Native,None 18.66 us 0.140 us 0.131 us - - - -
Eval Native,None 62.04 us 0.461 us 0.408 us - - - -
EvalSha Native,None 27.86 us 0.184 us 0.163 us - - - -
SmallScript Native,None 52.87 us 0.550 us 0.488 us - - - -
LargeScript Native,None 4,147.26 us 25.975 us 23.026 us - - - -
ArrayReturn Native,None 108.74 us 0.692 us 0.613 us - - - -

luaTimeLimits

Method Params Mean Error StdDev Gen0 Gen1 Gen2 Allocated
ScriptLoad Native,None 83.41 us 0.579 us 0.542 us - - - 9600 B
ScriptExistsTrue Native,None 19.67 us 0.177 us 0.157 us - - - -
ScriptExistsFalse Native,None 17.95 us 0.195 us 0.182 us - - - -
Eval Native,None 63.32 us 0.581 us 0.544 us - - - -
EvalSha Native,None 26.76 us 0.121 us 0.107 us - - - -
SmallScript Native,None 55.63 us 0.936 us 0.875 us - - - -
LargeScript Native,None 4,079.82 us 42.402 us 39.663 us - - - 8 B
ArrayReturn Native,None 111.97 us 1.296 us 1.149 us - - - -

LuaTimeouts

A new benchmark focused on just timeouts, meant for checking improvements there. I'm benchmarking a 1 sec timeout (so checking for timeouts ever 100ms) and a 1 minute (so every 6 seconds there's a check), about the bounds of "reasonable" for timeouts.

These runs have 500K EVALSHAs (to tamp down variance). Naively there was about a 10% loss - but with some profiling and the aforementioned performance improvements I clawed that back. With timeouts enabled we're basically the same, with them disabled this is a ~2% improvement.

main

This was run by copying the benchmark on top of d148f51740d79548ad78dc19008fca0e58c0de43. Obviously the two with timeout cases are not included.

Method Params Mean Error StdDev
RunScript Native,None 162.3 ms 1.53 ms 1.43 ms

luaTimeLimits

Method Params Mean Error StdDev
RunScript Native,None,- 158.7 ms 1.09 ms 0.85 ms
RunScript Native,None,00:00:01 159.6 ms 2.55 ms 2.26 ms
RunScript Native,None,00:01:00 163.3 ms 1.78 ms 1.57 ms

… and expensive, so sketch out an on-demand option
…kless; there's some finese in the concurrency around Lua, but the Debug interfaces are threadsafe; all tests passing; still needs benchmarking
…ists and whatnot, since sessions script is already synchronized
… attempting to upper case the command; speed that up to claw back some of the timeout overhead
@kevin-montrose kevin-montrose marked this pull request as ready for review February 3, 2025 15:17
@kevin-montrose kevin-montrose requested a review from vazois February 3, 2025 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant