Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Your test duration measurement is inaccurate #143

Open
dead-claudia opened this issue Oct 18, 2024 · 4 comments
Open

Your test duration measurement is inaccurate #143

dead-claudia opened this issue Oct 18, 2024 · 4 comments
Labels
question Further information is requested

Comments

@dead-claudia
Copy link

Related: #65

Performance measurement is unfortunately not as simple as performance.now() in browsers. Further, operating systems can and do sometimes have resolution limits of their own.

Here's some considerations that need to be addressed, in general:

The benchmarks currently naively use start and end performance.now()/process.hrtime.bigint() calls. The precision issues can give you bad data, but it's not insurmountable:

  • Any test run that completes in less than half of the minimum timer precision is statistically indistinguishable from a test where the JIT optimized everything out. You could compare that to a null test, count the difference in timer ticks, and use that to build your time estimate, but even then, you'd need a massive number of runs to get much confidence from that. And for tests like ours where we need to measure frame times, in browsers like Brave and Safari where timer resolution is 1ms, this could require a very long time for some tests. (This isn't theoretical - one of our render tests in this PR ran 147169 times at an average of 0.03 ms/run or ~33k ops/sec, and that many frames at 60 FPS would come out to about 41 minutes.)
  • If the test completes in more than half a unit of precision but less than 1.5 units, you could use similar tricks to gain more confidence, but you'd still need a decent number of runs if you're only measuring spans. You could just expect roughly half the runs since you have a roughly 100% chance of at least one tick happening in the code block.
  • Even the act of measurement entails overhead (in the order of nanoseconds), so a null test is needed for accurately measuring at a resolution of more than about 100k ops/sec.

It doesn't appear your benchmark execution code currently takes any of this into account, hence this issue.

@jerome-benoit
Copy link
Collaborator

jerome-benoit commented Oct 19, 2024

That's why the API:

  • allows to define a custom timestamping function to fit environment needs, such as
    • () => $.agent.monotonicNow()
    • () => $262.agent.monotonicNow()
    • ...
  • allows to define the benchmark behavior such as minimum benchmark iterations and time
  • integrates JIT deoptimization
  • integrates advanced statistics to refine accuracy via the API tunables

There's no one-size-fits-all benchmarking design in the JS ecosystem. But instead APIs allowing to refine the benchmark accordingly and offering sane defaults.

What is actually missing in tinybench API to adapt it to browsers?

@dead-claudia
Copy link
Author

dead-claudia commented Oct 19, 2024

@jerome-benoit I didn't realize you also used confidence - that means the jitter isn't of much concern.

But for the rest, in short, duration is inaccurate if the code executes faster than timer resolution:

let start = now()
fn()
let end = now()
let duration = end - start

The concrete fix is this:

  1. Run multiple fn() iterations instead of just one, repeating until you sufficiently exceed the clock's granularity. (I chose 15x, but that was just a random guesstimate that turned out to he good enough for my needs.) The time per run becomes duration / iterations.
    • Unfortunately, to avoid measuring beforeEach and afterEach, it'd be a breaking change. If you want to ensure accuracy while just doing one (to not break people), you'll need to alert people of this benchmarking limitation so they're aware, and possibly display a warning if it measures a zero-duration sample.
  2. Use the iteration count as a weight for your statistical analysis, including your confidence analysis (for when to stop looping).

Changing the now() function is not a valid workaround - this issue is inherent to the granularity of that function's result. Not even adding a durationSince(timestamp) is sufficient.

@dead-claudia
Copy link
Author

On a related note, maybe they should've called it performance.subtle.now(), not performance.now(), to align with crypto.subtle. Both seem to have a class of bugs that neither result in errors nor data returned that can be easily validated through automated means. 😅

@jerome-benoit
Copy link
Collaborator

jerome-benoit commented Oct 19, 2024

The concrete fix is this:

  1. Run multiple fn() iterations instead of just one, repeating until you sufficiently exceed the clock's granularity. (I chose 15x, but that was just a random guesstimate that turned out to he good enough for my needs.) The time per run becomes duration / iterations.

Benchmark warmup at Bench and Task level is supported and it has sane defaults for most case, that can be tuned.
The case where timestamping resolution is below the JS runtime timestamping is not solved by any means by incorrect measurement methodology such as measuring the latency of repeated benchmark function code and using the average as the latency of one function run:

  • using an average as a primary source for statistical analysis introduce non anecdotal bias, cf. below
  • a correct benchmark methodology must not modify the original experiment, it's a golden rule
  • Unfortunately, to avoid measuring beforeEach and afterEach, it'd be a breaking change.

Only the latency of the benchmarking function execution with JIT deoptimization is measured in tinybench.

If you want to ensure accuracy while just doing one (to not break people), you'll need to alert people of this benchmarking limitation so they're aware, and possibly display a warning if it measures a zero-duration sample.

A correct benchmark methodology means not modifying the experiment to time. Such as measuring the time a runner at doing 500m is not measuring the time and distance of one step repetitively and use the average of that measurement as a base to time his 500m course. It's utterly wrong in so many ways ...

I've seen benchmarking tool such as mitata using a similar totally flawed methology. Tinybench will never go that path as we care about using unbiased measurement methodology. That why I've forked mitata in tatami-ng because the maintainer was not inclined to external contributions about it. And now pushing the relevant bits of that fork to tinybench that will show up in version 3.x.x

  1. Use the iteration count as a weight for your statistical analysis, including your confidence analysis (for when to stop looping).

Tinybench is meant to be a lean library using state of the art benchmarking methods and advanced statistics. The analysis of them such as determining if the margin of error is acceptable, the median absolute deviation is acceptable, ... and globally the statistical significance of the result will not be part of tinybench. It's up to the user to analyze them and eventually automate the detection of anomalies in the measurement.
The statistical indicators analysis can be documented and more state of the art statistical indicators can be added (z-score, IQR, ...) - we take PRs -.

Changing the now() function is not a valid workaround - this issue is inherent to the granularity of that function's result. Not even adding a durationSince(timestamp) is sufficient.

The analysis of the result is meant to tell if a measurement is correct or not: for example the presence of a lot of zero measurement will make the margin of error go high for latency => results cannot be trusted.

And using a totally flawed benchmarking methodology (and opening a wide door to the premature optimization disease) as a workaround to a too high resolution in the JS runtime timestamping is not an acceptable solution. The root cause must be fixed: not offering an optional mode with high resolution timer in a JS runtime is considered as a bug nowadays. And browsers can be started with high resolution timer for benchmarking purpose.

So I repeat: what is actually missing in tinybench to run accurate benchmark using state of the art methodology in browsers?

@jerome-benoit jerome-benoit added the question Further information is requested label Oct 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants