Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double-check harness and warm up heuristics #316

Open
maximecb opened this issue Sep 23, 2024 · 12 comments
Open

Double-check harness and warm up heuristics #316

maximecb opened this issue Sep 23, 2024 · 12 comments

Comments

@maximecb
Copy link
Contributor

maximecb commented Sep 23, 2024

Kokubun's compilation time investigation made me want to check if we're including compilation time during benchmarking iterations on yjit-bench. The answer is that for the lobster benchmark, which is the largest of our headline benchmarks, yes we are. We're using a fixed number of warm-up iterations (15), and this is enough for smaller benchmarks such as railsbench, but not for lobsters.

Just by printing the delta in compiled_block_count after each iteration for railsbench, we can see that it warms up in just 3 iterations:

itr #1: 4830ms
compiled blocks delta: 32303
itr #2: 1265ms
compiled blocks delta: 61
itr #3: 1376ms
compiled blocks delta: 392
itr #4: 1248ms
compiled blocks delta: 0
itr #5: 1287ms
compiled blocks delta: 0
itr #6: 1249ms
compiled blocks delta: 0
itr #7: 1243ms
compiled blocks delta: 0
itr #8: 1302ms
compiled blocks delta: 0
itr #9: 1253ms
compiled blocks delta: 0
itr #10: 1318ms

The situation is the same for many small benchmarks.

However, for lobsters, this value doesn't reach a stable zero for a long while. It slowly trends towards zero, but there's a looong tail with a few blocks being compiled even after 300 iterations. My guess is that there is a long tail of low-probability branch stubs that keep slowly being hit.

Lobsters blocks compiled per iteration
itr #1: 8372ms
compiled blocks delta: 51708
itr #2: 3024ms
compiled blocks delta: 8638
itr #3: 1042ms
compiled blocks delta: 1165
itr #4: 961ms
compiled blocks delta: 1208
itr #5: 733ms
compiled blocks delta: 357
itr #6: 931ms
compiled blocks delta: 865
itr #7: 887ms
compiled blocks delta: 900
itr #8: 1068ms
compiled blocks delta: 1590
itr #9: 863ms
compiled blocks delta: 683
itr #10: 757ms
compiled blocks delta: 424
itr #11: 689ms
compiled blocks delta: 143
itr #12: 789ms
compiled blocks delta: 360
itr #13: 698ms
compiled blocks delta: 199
itr #14: 684ms
compiled blocks delta: 106
itr #15: 767ms
compiled blocks delta: 403
itr #16: 737ms
compiled blocks delta: 115
itr #17: 638ms
compiled blocks delta: 5
itr #18: 739ms
compiled blocks delta: 327
itr #19: 657ms
compiled blocks delta: 17
itr #20: 879ms
compiled blocks delta: 673
itr #21: 655ms
compiled blocks delta: 60
itr #22: 641ms
compiled blocks delta: 8
itr #23: 645ms
compiled blocks delta: 0
itr #24: 859ms
compiled blocks delta: 501
itr #25: 648ms
compiled blocks delta: 45
itr #26: 654ms
compiled blocks delta: 15
itr #27: 641ms
compiled blocks delta: 18
itr #28: 669ms
compiled blocks delta: 32
itr #29: 694ms
compiled blocks delta: 1
itr #30: 792ms
compiled blocks delta: 475
itr #31: 668ms
compiled blocks delta: 56
itr #32: 641ms
compiled blocks delta: 9
itr #33: 703ms
compiled blocks delta: 32
itr #34: 660ms
compiled blocks delta: 28
itr #35: 654ms
compiled blocks delta: 26
itr #36: 646ms
compiled blocks delta: 4
itr #37: 694ms
compiled blocks delta: 18
itr #38: 637ms
compiled blocks delta: 7
itr #39: 647ms
compiled blocks delta: 22
itr #40: 1074ms
compiled blocks delta: 1414
itr #41: 721ms
compiled blocks delta: 6
itr #42: 667ms
compiled blocks delta: 23
itr #43: 703ms
compiled blocks delta: 2
itr #44: 691ms
compiled blocks delta: 13
itr #45: 680ms
compiled blocks delta: 3
itr #46: 646ms
compiled blocks delta: 11
itr #47: 634ms
compiled blocks delta: 0
itr #48: 633ms
compiled blocks delta: 0
itr #49: 701ms
compiled blocks delta: 6
itr #50: 649ms
compiled blocks delta: 3
itr #51: 641ms
compiled blocks delta: 14
itr #52: 636ms
compiled blocks delta: 3
itr #53: 672ms
compiled blocks delta: 1
itr #54: 641ms
compiled blocks delta: 1
itr #55: 633ms
compiled blocks delta: 2
itr #56: 626ms
compiled blocks delta: 0
itr #57: 640ms
compiled blocks delta: 8
itr #58: 695ms
compiled blocks delta: 1
itr #59: 651ms
compiled blocks delta: 2
itr #60: 719ms
compiled blocks delta: 234
itr #61: 651ms
compiled blocks delta: 7
itr #62: 695ms
compiled blocks delta: 10
itr #63: 649ms
compiled blocks delta: 14
itr #64: 645ms
compiled blocks delta: 1
itr #65: 628ms
compiled blocks delta: 1
itr #66: 689ms
compiled blocks delta: 2
itr #67: 638ms
compiled blocks delta: 2
itr #68: 634ms
compiled blocks delta: 0
itr #69: 633ms
compiled blocks delta: 1
itr #70: 651ms
compiled blocks delta: 0
itr #71: 681ms
compiled blocks delta: 0
itr #72: 650ms
compiled blocks delta: 3
itr #73: 641ms
compiled blocks delta: 8
itr #74: 651ms
compiled blocks delta: 0
itr #75: 693ms
compiled blocks delta: 0
itr #76: 631ms
compiled blocks delta: 0
itr #77: 626ms
compiled blocks delta: 0
itr #78: 638ms
compiled blocks delta: 0
itr #79: 629ms
compiled blocks delta: 0
itr #80: 675ms
compiled blocks delta: 0
itr #81: 622ms
compiled blocks delta: 0
itr #82: 636ms
compiled blocks delta: 1
itr #83: 634ms
compiled blocks delta: 4
itr #84: 666ms
compiled blocks delta: 2
itr #85: 622ms
compiled blocks delta: 0
itr #86: 648ms
compiled blocks delta: 0
itr #87: 634ms
compiled blocks delta: 6
itr #88: 676ms
compiled blocks delta: 7
itr #89: 633ms
compiled blocks delta: 0
itr #90: 638ms
compiled blocks delta: 0
itr #91: 640ms
compiled blocks delta: 0
itr #92: 633ms
compiled blocks delta: 0
itr #93: 702ms
compiled blocks delta: 0
itr #94: 652ms
compiled blocks delta: 0
itr #95: 644ms
compiled blocks delta: 0
itr #96: 638ms
compiled blocks delta: 0
itr #97: 685ms
compiled blocks delta: 0
itr #98: 650ms
compiled blocks delta: 0
itr #99: 641ms
compiled blocks delta: 0
itr #100: 643ms
compiled blocks delta: 12

@rwstauner My first question would be: what are we currently doing for warm-up in yjit-metrics? Are we using the same fixed 15 warm-up iterations?

@k0kubun The second question is what do we want to do about this? We may want to implement the same solution in both yjit-metrics and yjit-bench, even though they don't use the same harness.

In theory we can just bump the number of iterations for warm-up to a bigger value, but this doesn't seem great, because it will slow down smaller benchmarks. Another solution is to have some kind of adaptive mechanism. We could decide that a benchmark is warmed up when it doesn't compile any new blocks for 5 iterations, for example, and also make sure that benchmarks run for at least 8-10 warm up iterations minimum. That may actually speed up things for smaller benchmarks. I guess as a first step I will check with all of the headline benchmarks that this heuristic would actually work.

@nirvdrum a third question is what are we actually compiling in lobsters after 300+ iterations? Am I correct in assuming that it's branch stubs being hit? This is a question to be answered using the compilation log, and once we have that answer, we may want to think of heuristics to stop compiling new code.

@maximecb
Copy link
Contributor Author

maximecb commented Sep 23, 2024

Update: doing some quick empirical testing on our headline benchmarks, it looks like an adaptive heuristic which goes something like "we are done warming up after a minimum of 8 iterations, and after we've compiled zero blocks for 4 iterations" would probably work decently well in practice. Though it wouldn't be 100% perfect.

The hexapdf benchmark exhibits a similar long-tail behavior as lobsters. It has a weird spike in compiled blocks at iteration #31, and this spike occurs after 8 iterations with zero blocks compiled.

I'm actually seeing that a number of benchmarks have a spike in compiled blocks after 30 iterations, which makes sense given our call threshold heuristic. An alternative way to do adaptive warm-up could be to benchmark for a minimum of 30 iterations, and then we can check in-hindsight if we stopped compiling after the 10th iteration. If not, count the first 30 iterations as warm-up and then run for another 30 iterations, then check how much we compiled during those 30 new iterations, keep going until we get a minimum number of benchmarking iterations with zero or near-zero new blocks compiled.

A third way to go would be to just increase the number of warm-up iterations to 40 and call it good enough (though that could obviously slow down some benchmarks).

@k0kubun
Copy link
Member

k0kubun commented Sep 23, 2024

Just by printing the delta in compiled_block_count after each iteration for railsbench

off topic: I often need to manually modify yjit-bench locally to achieve things like this. Now that you're doing it too, I filed a PR to add that feature Shopify/yjit-bench#330. This helped the following investigation:

The second question is what do we want to do about this?

At first, I thought the two-step call threshold might be causing this. However, forcing it to --yjit-call-threshold=10 did not fix the problem; it just kept compiling encode_basic@(eval at /opt/rubies/ruby/lib/ruby/gems/3.4.0+0/gems/htmlentities-4.3.4/lib/htmlentities/encoder.rb:81) even with an increased number of benchmark iterations.

Given that it's coming from an eval, I suspect that the warmup of lobsters may never end (until it hits the --yjit-exec-mem-size limit). If that's the case, we can't fix this from yjit-bench's side. We should see if we can fix how they use htmlentities.gem.

Hexapdf does stop compiling after a reasonably small number of iterations, so it makes sense to do something to increase the number of warmup iterations for hexapdf if necessary.

@maximecb
Copy link
Contributor Author

I guess I'd prefer not defining the number of warmup iterations by hand for each benchmark as it seems failure-prone. We could change how YJIT decides to compile code and break our manually set iteration counts without noticing. I could try to experiment with different heuristics for adaptable warm-up.

Given that it's coming from an eval, I suspect that the warmup of lobsters may never end (until it hits the --yjit-exec-mem-size limit). If that's the case, we can't fix this from yjit-bench's side. We should see if we can fix how they use htmlentities.gem.

Hmm yeah. If there was a way to get rid of that eval it would be ideal. If not, then we would have to settle for "still compiling but not compiling much proportionally" to be good enough.

@maximecb
Copy link
Contributor Author

@rwstauner can you provide some context on how warm-up is handled in yjit-metrics?

@rwstauner
Copy link
Contributor

it just kept compiling encode_basic@(eval at /opt/rubies/ruby/lib/ruby/gems/3.4.0+0/gems/htmlentities-4.3.4/lib/htmlentities/encoder.rb:81)

I think this is what Jean fixed:
lobsters/lobsters#1320
That PR has merged, but we haven't synchronized that over to yjit-bench yet.

@rwstauner
Copy link
Contributor

The current default that yjit-metrics is using is a static 10 warmups:

"--warmup-itrs=10 --min-bench-time=30.0 --min-bench-itrs=10") +

@maximecb
Copy link
Contributor Author

Thank you Randy. Still not sure what to do here. I feel like ideally we should figure out a better criteria and make things more uniform between yjit-bench and yjit-metrics (use the same warm up criteria). I may experiment with some other warm-up approaches tomorrow.

@rwstauner
Copy link
Contributor

rwstauner commented Sep 26, 2024

Back when we were using the "variable warmup report" (prior to May 2024) it would default to 30 warmups, which made the whole thing take several more hours

@maximecb
Copy link
Contributor Author

Ok maybe there is a way that we can do something like 30 warm-up iterations or maximum 1 minute or maximum 30 seconds or something like that.

@rwstauner
Copy link
Contributor

I think it's worth looking at the differences between the two repos.
I'd love to make the yjit-bench harness modular enough to handle whatever differences we need here, but I haven't compared them yet (and I know we want to keep the harness slim to avoid altering memory usage, etc).

@rwstauner
Copy link
Contributor

Looking at the 2024-09-18 results for arm, these are the results where warmup (10 iterations) is already taking more than 30s:

                     max        avg     total
       30k_methods: 5.46        3.85    38.51
     fannkuchredux: 6.46        4.30    43.01
        psych-load: 4.62        4.56    45.60
           fluentd: 5.76        4.63    46.30
           hexapdf: 6.97        5.42    54.21
        railsbench: 7.50        5.99    59.88
         optcarrot: 8.18        7.98    79.83
         ruby-json: 8.23        8.22    82.15
           graphql: 9.67        9.30    93.01
           rubykon: 18.87       18.60   185.97

Only 4 of those are currently taking more than 1 minute to do 10 warmup itrs.

Most warmup itrs currently look to be under 1s, so doing up to 30 warmups or 60 seconds would probably get us pretty far.

@maximecb
Copy link
Contributor Author

With respect to Rubykon, that benchmark seems to take especially long to run with CRuby, but it is 3x faster with YJIT. Maybe one simple thing that we can do is cap the warmup iterations for CRuby to 4 or 5 iterations, and use a variable warm-up for YJIT?

I'm also going to go look if I can edit Rubykon to have a faster per-iteration time, because it is super duper slow on CRuby.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants