Your test duration measurement is inaccurate #143

dead-claudia · 2024-10-18T23:06:41Z

Related: #65

Performance measurement is unfortunately not as simple as performance.now() in browsers. Further, operating systems can and do sometimes have resolution limits of their own.

Here's some considerations that need to be addressed, in general:

Browsers are required by spec to coarsen resolution to a minimum of 5 microseconds (200k ticks/sec) in isolated contexts and 100 microseconds (10k) in non-isolated contexts, for security and privacy reasons.
Most browsers reduce resolution even further, but to varying degrees per Reducing the precision of the DOMHighResTimeStamp resolution w3c/hr-time#56
- Firefox: 20us (50k ticks/sec)
- Chrome: 100us + 100us jitter (5-10k)
- Safari: 1ms (1k)
- Edge pre-Chromium: 20us + 20us jitter (25-50k)
- Brave: 1ms (1k) per Round off high-resolution timers brave/brave-core#15309
On Windows, all browsers' performance.now() tick during sleep. In other platforms, they don't, in violation of the spec: Suggestions: ticking during sleep, comparison across contexts, time origin + now semantics, and skew definition w3c/hr-time#115 (comment)
- The ramifications mostly apply to async tests and tests using atomics.

The benchmarks currently naively use start and end performance.now()/process.hrtime.bigint() calls. The precision issues can give you bad data, but it's not insurmountable:

Any test run that completes in less than half of the minimum timer precision is statistically indistinguishable from a test where the JIT optimized everything out. You could compare that to a null test, count the difference in timer ticks, and use that to build your time estimate, but even then, you'd need a massive number of runs to get much confidence from that. And for tests like ours where we need to measure frame times, in browsers like Brave and Safari where timer resolution is 1ms, this could require a very long time for some tests. (This isn't theoretical - one of our render tests in this PR ran 147169 times at an average of 0.03 ms/run or ~33k ops/sec, and that many frames at 60 FPS would come out to about 41 minutes.)
If the test completes in more than half a unit of precision but less than 1.5 units, you could use similar tricks to gain more confidence, but you'd still need a decent number of runs if you're only measuring spans. You could just expect roughly half the runs since you have a roughly 100% chance of at least one tick happening in the code block.
Even the act of measurement entails overhead (in the order of nanoseconds), so a null test is needed for accurately measuring at a resolution of more than about 100k ops/sec.

It doesn't appear your benchmark execution code currently takes any of this into account, hence this issue.

The text was updated successfully, but these errors were encountered:

jerome-benoit · 2024-10-19T09:05:16Z

That's why the API:

allows to define a custom timestamping function to fit environment needs, such as
- () => $.agent.monotonicNow()
- () => $262.agent.monotonicNow()
- ...
allows to define the benchmark behavior such as minimum benchmark iterations and time
integrates JIT deoptimization
integrates advanced statistics to refine accuracy via the API tunables

There's no one-size-fits-all benchmarking design in the JS ecosystem. But instead APIs allowing to refine the benchmark accordingly and offering sane defaults.

What is actually missing in tinybench API to adapt it to browsers?

dead-claudia · 2024-10-19T20:40:13Z

@jerome-benoit I didn't realize you also used confidence - that means the jitter isn't of much concern.

But for the rest, in short, duration is inaccurate if the code executes faster than timer resolution:

let start = now()
fn()
let end = now()
let duration = end - start

The concrete fix is this:

Run multiple fn() iterations instead of just one, repeating until you sufficiently exceed the clock's granularity. (I chose 15x, but that was just a random guesstimate that turned out to he good enough for my needs.) The time per run becomes duration / iterations.
- Unfortunately, to avoid measuring beforeEach and afterEach, it'd be a breaking change. If you want to ensure accuracy while just doing one (to not break people), you'll need to alert people of this benchmarking limitation so they're aware, and possibly display a warning if it measures a zero-duration sample.
Use the iteration count as a weight for your statistical analysis, including your confidence analysis (for when to stop looping).

Changing the now() function is not a valid workaround - this issue is inherent to the granularity of that function's result. Not even adding a durationSince(timestamp) is sufficient.

dead-claudia · 2024-10-19T20:49:02Z

On a related note, maybe they should've called it performance.subtle.now(), not performance.now(), to align with crypto.subtle. Both seem to have a class of bugs that neither result in errors nor data returned that can be easily validated through automated means. 😅

jerome-benoit · 2024-10-19T22:01:11Z

The concrete fix is this:

Run multiple fn() iterations instead of just one, repeating until you sufficiently exceed the clock's granularity. (I chose 15x, but that was just a random guesstimate that turned out to he good enough for my needs.) The time per run becomes duration / iterations.

Benchmark warmup at Bench and Task level is supported and it has sane defaults for most case, that can be tuned.
The case where timestamping resolution is below the JS runtime timestamping is not solved by any means by incorrect measurement methodology such as measuring the latency of repeated benchmark function code and using the average as the latency of one function run:

using an average as a primary source for statistical analysis introduce non anecdotal bias, cf. below
a correct benchmark methodology must not modify the original experiment, it's a golden rule

Unfortunately, to avoid measuring beforeEach and afterEach, it'd be a breaking change.

Only the latency of the benchmarking function execution with JIT deoptimization is measured in tinybench.

If you want to ensure accuracy while just doing one (to not break people), you'll need to alert people of this benchmarking limitation so they're aware, and possibly display a warning if it measures a zero-duration sample.

A correct benchmark methodology means not modifying the experiment to time. Such as measuring the time a runner at doing 500m is not measuring the time and distance of one step repetitively and use the average of that measurement as a base to time his 500m course. It's utterly wrong in so many ways ...

I've seen benchmarking tool such as mitata using a similar totally flawed methology. Tinybench will never go that path as we care about using unbiased measurement methodology. That why I've forked mitata in tatami-ng because the maintainer was not inclined to external contributions about it. And now pushing the relevant bits of that fork to tinybench that will show up in version 3.x.x

Use the iteration count as a weight for your statistical analysis, including your confidence analysis (for when to stop looping).

Tinybench is meant to be a lean library using state of the art benchmarking methods and advanced statistics. The analysis of them such as determining if the margin of error is acceptable, the median absolute deviation is acceptable, ... and globally the statistical significance of the result will not be part of tinybench. It's up to the user to analyze them and eventually automate the detection of anomalies in the measurement.
The statistical indicators analysis can be documented and more state of the art statistical indicators can be added (z-score, IQR, ...) - we take PRs -.

Changing the now() function is not a valid workaround - this issue is inherent to the granularity of that function's result. Not even adding a durationSince(timestamp) is sufficient.

The analysis of the result is meant to tell if a measurement is correct or not: for example the presence of a lot of zero measurement will make the margin of error go high for latency => results cannot be trusted.

And using a totally flawed benchmarking methodology (and opening a wide door to the premature optimization disease) as a workaround to a too high resolution in the JS runtime timestamping is not an acceptable solution. The root cause must be fixed: not offering an optional mode with high resolution timer in a JS runtime is considered as a bug nowadays. And browsers can be started with high resolution timer for benchmarking purpose.

So I repeat: what is actually missing in tinybench to run accurate benchmark using state of the art methodology in browsers?

dead-claudia mentioned this issue Oct 18, 2024

Add an option to introduce an artificial delay between iterations #92

Closed

jerome-benoit added the question Further information is requested label Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Your test duration measurement is inaccurate #143

Your test duration measurement is inaccurate #143

dead-claudia commented Oct 18, 2024

jerome-benoit commented Oct 19, 2024 •

edited

Loading

dead-claudia commented Oct 19, 2024 •

edited

Loading

dead-claudia commented Oct 19, 2024

jerome-benoit commented Oct 19, 2024 •

edited

Loading

Your test duration measurement is inaccurate #143

Your test duration measurement is inaccurate #143

Comments

dead-claudia commented Oct 18, 2024

jerome-benoit commented Oct 19, 2024 • edited Loading

dead-claudia commented Oct 19, 2024 • edited Loading

dead-claudia commented Oct 19, 2024

jerome-benoit commented Oct 19, 2024 • edited Loading

jerome-benoit commented Oct 19, 2024 •

edited

Loading

dead-claudia commented Oct 19, 2024 •

edited

Loading

jerome-benoit commented Oct 19, 2024 •

edited

Loading