[RFC] Considerations for "Skill Level" #3635

Sopel97 · 2021-07-31T08:23:06Z

There was some discussion recently about possible improvements to the implementation of "Skill Level" in Stockfish on discord. @vondele suggested to try eval randomization, in particular interpolation of the NNUE eval with N(0, RookValueEg) using some parameter. I've implemented it and tested briefly, but it requires more work to assess the quality of the games played and calibrate the Elo rating. The initial results suggest that it might be a good direction.

The experiment I performed was to use pure NNUE evaluation at fixed nodes and vary the interpolation parameter. 46 configurations played a round-robin tournament with 50 games in each pair. The following c-chess-cli command was used:

#!/bin/bash
 
c-chess-cli \
    -concurrency 16 \
    -rounds 1 \
    -games 50 \
    -openings file=/home/sopel/nnue/c-chess-cli/noob_3moves.epd order=random -repeat -resign 3 700 -draw 8 10 \
    -pgn tournament.pgn 2 \
    -each tc=1000+1 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_100k option.RandomEvalPerturb=0 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_90k option.RandomEvalPerturb=0 nodes=90000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_80k option.RandomEvalPerturb=0 nodes=80000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_70k option.RandomEvalPerturb=0 nodes=70000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_60k option.RandomEvalPerturb=0 nodes=60000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_50k option.RandomEvalPerturb=0 nodes=50000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_40k option.RandomEvalPerturb=0 nodes=40000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_30k option.RandomEvalPerturb=0 nodes=30000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_20k option.RandomEvalPerturb=0 nodes=20000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_10k option.RandomEvalPerturb=0 nodes=10000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_5k option.RandomEvalPerturb=0 nodes=5000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_4k option.RandomEvalPerturb=0 nodes=4000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_3k option.RandomEvalPerturb=0 nodes=3000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_2k option.RandomEvalPerturb=0 nodes=2000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_0_1k option.RandomEvalPerturb=0 nodes=1000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_1_100k option.RandomEvalPerturb=1 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_2_100k option.RandomEvalPerturb=2 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_3_100k option.RandomEvalPerturb=3 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_4_100k option.RandomEvalPerturb=4 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_5_100k option.RandomEvalPerturb=5 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_6_100k option.RandomEvalPerturb=6 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_7_100k option.RandomEvalPerturb=7 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_8_100k option.RandomEvalPerturb=8 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_9_100k option.RandomEvalPerturb=9 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_10_100k option.RandomEvalPerturb=10 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_12_100k option.RandomEvalPerturb=12 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_14_100k option.RandomEvalPerturb=14 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_16_100k option.RandomEvalPerturb=16 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_18_100k option.RandomEvalPerturb=18 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_20_100k option.RandomEvalPerturb=20 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_25_100k option.RandomEvalPerturb=25 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_30_100k option.RandomEvalPerturb=30 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_35_100k option.RandomEvalPerturb=35 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_40_100k option.RandomEvalPerturb=40 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_45_100k option.RandomEvalPerturb=45 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_50_100k option.RandomEvalPerturb=50 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_55_100k option.RandomEvalPerturb=55 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_60_100k option.RandomEvalPerturb=60 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_65_100k option.RandomEvalPerturb=65 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_70_100k option.RandomEvalPerturb=70 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_75_100k option.RandomEvalPerturb=75 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_80_100k option.RandomEvalPerturb=80 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_85_100k option.RandomEvalPerturb=85 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_90_100k option.RandomEvalPerturb=90 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_95_100k option.RandomEvalPerturb=95 nodes=100000 \
    -engine cmd=./engines/stockfish/stockfish name=stockfish_pure_100_100k option.RandomEvalPerturb=100 nodes=100000

Naming: stockfish_pure_{RandomEvalPerturb}_{nodes}
Code: Sopel97@56a8a4f
Experiment results: https://drive.google.com/drive/folders/14SZEV6TICedYtNZ2Ym2sFCQynoTBMeQ0?usp=sharing (includes a .pgn with moves saved).
Ordo:

   # PLAYER                     :   RATING  ERROR  POINTS  PLAYED   (%)  CFS(%)
   1 stockfish_pure_1_100k      :      9.8   19.7  1826.0    2250    81      84
   2 stockfish_pure_0_100k      :      0.0   ----  1810.5    2250    80      79
   3 stockfish_pure_2_100k      :     -8.1   19.6  1797.5    2250    80      51
   4 stockfish_pure_4_100k      :     -8.4   20.0  1797.0    2250    80      68
   5 stockfish_pure_0_90k       :    -13.1   19.7  1789.5    2250    80      65
   6 stockfish_pure_3_100k      :    -16.8   19.0  1783.5    2250    79      75
   7 stockfish_pure_5_100k      :    -23.2   19.7  1773.0    2250    79      93
   8 stockfish_pure_6_100k      :    -37.8   19.1  1749.0    2250    78      72
   9 stockfish_pure_0_80k       :    -43.2   19.0  1740.0    2250    77      56
  10 stockfish_pure_8_100k      :    -44.7   19.2  1737.5    2250    77      86
  11 stockfish_pure_7_100k      :    -54.8   19.0  1720.5    2250    76      85
  12 stockfish_pure_0_70k       :    -65.2   19.4  1703.0    2250    76      67
  13 stockfish_pure_9_100k      :    -69.6   19.0  1695.5    2250    75      55
  14 stockfish_pure_10_100k     :    -70.8   20.1  1693.5    2250    75      98
  15 stockfish_pure_0_60k       :    -89.6   19.2  1661.5    2250    74      63
  16 stockfish_pure_12_100k     :    -92.8   19.9  1656.0    2250    74     100
  17 stockfish_pure_0_50k       :   -126.1   19.1  1599.0    2250    71      71
  18 stockfish_pure_14_100k     :   -131.6   19.2  1589.5    2250    71     100
  19 stockfish_pure_16_100k     :   -174.4   19.8  1517.0    2250    67      59
  20 stockfish_pure_0_40k       :   -176.5   19.4  1513.5    2250    67     100
  21 stockfish_pure_18_100k     :   -210.8   19.9  1457.0    2250    65     100
  22 stockfish_pure_0_30k       :   -246.6   19.5  1400.0    2250    62      64
  23 stockfish_pure_20_100k     :   -250.2   20.2  1394.5    2250    62     100
  24 stockfish_pure_0_20k       :   -352.1   21.2  1248.5    2250    55      87
  25 stockfish_pure_25_100k     :   -365.2   20.6  1231.5    2250    55     100
  26 stockfish_pure_30_100k     :   -513.1   24.1  1067.0    2250    47     100
  27 stockfish_pure_0_10k       :   -587.6   25.5   999.5    2250    44     100
  28 stockfish_pure_35_100k     :   -669.2   27.1   933.5    2250    41     100
  29 stockfish_pure_40_100k     :   -831.1   30.0   816.5    2250    36      63
  30 stockfish_pure_0_5k        :   -836.2   29.1   813.0    2250    36     100
  31 stockfish_pure_0_4k        :   -927.1   31.5   752.0    2250    33      99
  32 stockfish_pure_45_100k     :   -966.2   31.4   726.5    2250    32     100
  33 stockfish_pure_0_3k        :  -1031.4   33.1   685.0    2250    30     100
  34 stockfish_pure_50_100k     :  -1161.1   34.6   607.0    2250    27     100
  35 stockfish_pure_0_2k        :  -1209.6   36.0   579.5    2250    26     100
  36 stockfish_pure_55_100k     :  -1333.4   38.0   514.0    2250    23     100
  37 stockfish_pure_0_1k        :  -1426.8   41.5   469.5    2250    21      97
  38 stockfish_pure_60_100k     :  -1464.0   42.1   453.0    2250    20     100
  39 stockfish_pure_65_100k     :  -1690.6   49.2   368.5    2250    16     100
  40 stockfish_pure_70_100k     :  -1936.4   62.2   298.5    2250    13     100
  41 stockfish_pure_75_100k     :  -2076.9   68.5   263.5    2250    12     100
  42 stockfish_pure_80_100k     :  -2360.9   84.0   201.0    2250     9     100
  43 stockfish_pure_85_100k     :  -2581.0   93.3   158.5    2250     7     100
  44 stockfish_pure_90_100k     :  -2913.0  115.3   106.0    2250     5     100
  45 stockfish_pure_95_100k     :  -3417.6  175.1    44.0    2250     2     100
  46 stockfish_pure_100_100k    :  -3677.9  186.7    10.0    2250     0     ---
 
White advantage = 8.06 +/- 1.85
Draw rate (equal opponents) = 55.79 % +/- 0.45

Plots:

The text was updated successfully, but these errors were encountered:

vondele · 2021-07-31T08:30:54Z

I've played a few games against it for RandomEvalPerturb in the upper quarter, and could at least win.

Probably instead of percent I would use per mill so it can be a bit finer tuned.

If we see that as an alternative for Skill Level (which has clear deficiencies), would be interesting to know if other places rely on Skill Level (e.g. Lichess?).

It would be nice if some people try to play a few games against it to see if that would be an interesting opponent for a suitable value of RandomEvalPerturb.

scchess · 2021-07-31T21:26:11Z

Thanks! I'm going to test it a few games and write back here. Also, we need to establish an anchor chess engine to measure the ELO. What about measure it against one of the Maia ELO calibrated versions?

scchess · 2021-07-31T22:22:49Z

I played a few games. I could easily win the level 100, but suffered losses to level 50 after some close games. It looked like the level 50 is somewhere around 2000 to 2200 level. Overall it looks cool.

Sopel97 · 2021-07-31T22:46:03Z

100 has effectively completely random evaluation, so it's about as expected. Did you play the games at some time control or fixed nodes like in my tests? Anything particularily visible about the "style"?

scchess · 2021-07-31T23:50:33Z

I tried it with fixed 5 minutes games, not nodes. As a human vs AI, I could only do only a few games. It was clearly still "computer style", meaning there was strong consistency throughout the games. Not sure there is anything you can do. However, the ability to assess the position also became a little worse, which is welcome. Next, we should calibrate against an engine with a well defined human rating like Maia.

tillchess · 2021-08-01T07:27:02Z

If we see that as an alternative for Skill Level (which has clear deficiencies), would be interesting to know if other places rely on Skill Level (e.g. Lichess?).

Lichess playing vs computer levels 1-8 relies on search depth, UCI_Elo and move time values.
Therefore since UCI_ELo and UCI_LimitStrength is used Lichess relies on Skill Level

…s her to connect the RandomEvalPerturb value and loosely connect it to a desired Elo based on his quick test results. Through testig , iwas ab;e to dertermine the rnadom function produce more reliable results using 400K nodes/move per move (as opposed to 100K nodes/move). Adjusted accordingly. This will be used in my Android Beth Harmon Chess app ( clone of Droidfish). The beauty of this method is that is uses move randomization so that a book is not needed, but still can be used, Also, turned off pure mode as that did have some undesired weakness at weaker levels. Calculated RandomEvalPerturb (REP) value based on Elo {A) {B) {C) (D) (E) (F) (G) (H) (I) (J) Factor -1 UCI_Elo (A)-(B) Factor 2 (C)/(D) Factor 3 (E)/(F) Factor 4 (B)/(H) (G)+(I) Sopel Tests official-stockfish#3635 3200 3000 200 2.8 71 10 7 225 13 20 23 stockfish_pure_20_100k 20 3011 3200 2900 300 2.8 107 10 10 225 12 22 3200 2800 400 2.8 142 10 14 225 12 26 25 stockfish_pure_25_100k 25 2896 3200 2700 500 2.8 178 10 17 225 12 29 26 stockfish_pure_30_100k 30 2748 3200 2600 600 2.8 214 10 21 225 11 32 3200 2500 700 2.8 250 10 25 225 11 36 28 stockfish_pure_35_100k 35 2592 3200 2400 800 2.8 285 10 28 225 10 38 3200 2300 900 2.8 321 10 32 225 10 42 29 stockfish_pure_40_100k 40 2430 3200 2200 1000 2.8 357 10 35 225 9 44 32 stockfish_pure_45_100k 45 2295 3200 2100 1100 2.8 392 10 39 225 9 48 3200 2000 1200 2.8 428 10 42 225 8 50 34 stockfish_pure_50_100k 50 2100 * Elo 2000 Achored to REP 50 3200 1900 1300 2.8 464 10 46 225 8 54 36 stockfish_pure_55_100k 55 1928 3200 1800 1400 2.8 500 10 50 225 8 58 3200 1700 1500 2.8 535 10 53 225 7 60 38 stockfish_pure_60_100k 60 1797 3200 1600 1600 2.8 571 10 57 225 7 64 3200 1500 1700 2.8 607 10 60 225 6 66 39 stockfish_pure_65_100k 65 1570 3200 1400 1800 2.8 642 10 64 225 6 70 40 stockfish_pure_70_100k 70 1325 3200 1300 1900 2.8 678 10 67 225 5 72 3200 1200 2000 2.8 714 10 71 225 5 76 41 stockfish_pure_75_100k 75 1184 3200 1100 2100 2.8 750 10 75 225 4 79 3200 1000 2200 2.8 785 10 78 225 4 82

based on Sopel's initial implementation discussed in official-stockfish#3635 in this new scheme, the strenght is of the engine is limited by replacing a (varying) part of the evaluation, with a random perturbation. This scheme is easier to implement than our current skill level implementation, and has the advantage that it has a wider Elo range, being both weaker than skill level 1 and stronger than skill level 19. The skill level option is removed, and instead UCI_Elo and UCI_LimitStrength are the only options available. UCI_Elo is calibrated such that 1500 Elo is equivalent in strength to the engine maia1 (https://lichess.org/@/maia1) which has a blitz rating on lichess of 1500 (based on nearly 600k human games). The full Elo range (750 - 5200) is obtained by playing games between engines roughly 100-200 elo apart with the perturbation going from 0 to 1000, and fitting the ordo results. With this fit, a conversion from UCI_Elo to the magnitude of the random perturbation is possible. All games are played at lichess blitz TC (5m+3s), and playing strenght is different at different TC. Indeed, maia1 is a fixed 1 node leela 'search', independent from TC, whereas this scheme searches normally, and improves with TC. There are a few caveats, it is unclear how the playing style of the engine is, the old skill level was not really satisfactory, it needs to be seen if this is fixed with this approach. Furthermore, while in the engine - engine matches maia1 and SF@1500Elo are equivalent in strength (at blitz TC), it is not sure if its rating against humans will be the same (engine Elo and human Elo can be very different). No functional change

lucasart · 2021-09-24T00:43:46Z

My experience is as follows:

blitz rating 1793 on lichess.
I score about 50% against SF level 5 on lichess. That's playing in real blitz conditions 5'+3" no take backs, no restarting games when blundering (just counter attack until SF counter blunders, same as I would against a human).

Most of the game, computer plays ok. Weird moves, but nothing terrible. I win playing the way humans should play against a computer, by stiffling the opponent like a boa constrictor.

Eventually, when I break through, and the materialistic computer can't find a way out without losing material, it stops resisting, and just throws away its pieces.

That is lame. A key skill in chess is learning to win in a winning position, in the most irrefutable way, leaving no counter play to the opponent (ie. so that even full strength SF will eventually be forced to bend the knee). And you don't learn that if the opponent goes hara-kiri on you.

Sopel97 · 2021-09-24T12:00:48Z

@lucasart have you tried playing the version described in this issue to see whether there's a difference in style?

lucasart · 2021-09-27T03:29:24Z

@Sopel97 I will try when my computer is fixed. One thing I notice is that random eval perturbation + depth (or nodes?) limit should suffice. Running a multi PV search behind the scene complicates for no reason. I can't see the purpose of this once random eval is introduced. This could be an important simplification.

*nodes might be better than depth limits, because of endgame, where branching factors reduce and even humans can out-calculate depth limits easily.

vondele · 2021-09-27T06:05:03Z

There is a more up-to-date branch which gets rid of multipv and has a calibrated scheme to set user Elo here :
https://github.com/vondele/Stockfish/commits/rep
It is calibrated at 1500 Elo against the maia engine at lichess.

lucasart · 2021-09-27T12:55:17Z

@vondele I tried it at 1500 elo: 5+3 blitz, no takebacks, FRC (to reduce opening knowledge factor and focus purely on gameplay, SF is bookless here). As expected, SF played slightly odd moves (positionally), but destroyed me tactically. I think we still need limits on depth or nodes in combination to eval pollution.

Another problem is aspiration search. Because the search returns polluted scores going all over the place, there is a lot of search inconsistency and re-searches going on. When using skill level, it probably makes sense to switch off aspiration, and just search +/-INF window directly. The user doesn't see the logs, so you might say it doesn't matter. But it changes time management, and that's the part visible to the user (ie. lots of searhc instability make SF use time more aggressively).

[Date "2021.09.27"]
[White "Human"]
[Black "vondele_rep"]
[Result "0-1"]
[FEN "bbqnnrkr/pppppppp/8/8/8/8/PPPPPPPP/BBQNNRKR w KQkq - 0 1"]
[GameDuration "00:11:41"]
[GameEndTime "2021-09-27T20:51:29.068 HKT"]
[GameStartTime "2021-09-27T20:39:47.861 HKT"]
[PlyCount "30"]
[SetUp "1"]
[Termination "time forfeit"]
[TimeControl "300+3"]

1. e4 Ne6 {-0.35/21 31s} 2. Nf3 {17s} Nf4 {0.00/21 28s} 3. Re1 {20s}
f5 {+0.74/20 26s} 4. exf5 {19s} e6 {+1.09/20 32s} 5. fxe6 {8.3s}
dxe6 {+1.30/19 25s} 6. Ne3 {15s} b5 {+1.10/20 17s} 7. Qd1 {35s} h5 {0.00/19 39s}
8. d4 {73s} Nd6 {+1.31/20 37s} 9. b3 {6.4s} Rh6 {+0.77/19 35s} 10. O-O {65s}
Ne4 {+3.19/20 23s} 11. c4 {35s} c5 {+0.96/20 29s} 12. dxc5 {2.8s}
Nh3+ {+1.57/17 8.3s} 13. gxh3 {20s} Rg6+ {+4.78/14 0.45s} 14. Ng2 {8.9s}
Rxg2+ {+1.94/19 6.1s} 15. Kxg2 {3.4s} Ng5 {+2.44/15 0.43s, White loses on time}
0-1

Strong chess players are probably laughing at this game ;-)

For the sake of comparison, here is the last game I played against maia5 on lichess in blitz 5+3: https://lichess.org/9pNVarMAimZ6. Basically a draw, and the engine blundered the king+pawn endgame.

vondele · 2021-09-27T13:17:34Z

@lucasart thanks for playing. Yes, I agree tactically the engine is still pretty strong, i.e. will likely notice if a piece is hanging, and definitely will see if the is a mate in N available (with N much more than e.g. a 1500Elo player will see). I've also seen the time management aspect.

Maybe limiting depth is an option, one could also think about excessive pruning/LMR.

The Elo calibration is really tricky.. 1500Elo matches maia5 in a direct match quite well, however, for humans it seems clearly stronger.

tillchess · 2021-09-28T08:19:32Z

@vondele I played your version at 1500 with a fixed 1sec / move time control. At this TC it blundered pieces very often . At 1700 UCI_Elo the engine played better .

lucasart · 2021-09-29T09:38:08Z

@tillchess in this context we are referring to 1500 elo for the lichess blitz rating scale. This translates to 1000 elo blitz on chess.com. perhaps you are much stronger than that ?

kayn1208 · 2021-09-29T11:26:55Z

Here's my game against UCI_Elo=1900 https://lichess.org/IGGdlkJB

lucasart · 2021-09-29T13:50:03Z

Here's my game against UCI_Elo=1900 https://lichess.org/IGGdlkJB

25... Qg3?? is a sepuku move again. You were already winning, of course, but at least the computer was making you work for it, creating threats on your king and preventing exchanges to hinder your progress.

kayn1208 · 2021-09-29T14:27:47Z

@lucasart I noticed that it uses more time than usual (It got really low on time on move 20+). Also the pawn blunders in the opening is kinda strange, I don't think a 1900 would miss that.

lucasart · 2021-10-07T01:58:31Z

I've done some experimentation in Demolito, and got some fairly satisfying results with the following scheme:

Level L go from 1 to 15. Not using UCI_Elo, but a transofmration function (or mapping table) could be calibrated.
depth limit: L <= 10 ? L : 2 * L - 10. prevents low levels from finding inhuman tactics and deep mates. also, this means that much less eval noise is needed to achieve the same elo, resulting in a more "normal" playing style.
nodes limit: 2 ^ (L + 5). this is only meaningful for high levels L >= 9 or so. That's because nodes limits are only checked every 5ms in Demolito (in a separate thread). Stockfish checks nodes differently, but still, I don't think it can enforce small node limits, so depth and nodes are needed. The point of nodes is that between L=9 and L=15, we want node count to only double whereas depth are incremented by 2. So less depth limit, more node limit, which mean more adaptive depth (ie. long sequences can be calculated in king+pawn endgames with low effective branching factor, but not so long in complex middle games, this somewhat emulates human tactical abilities a bit more).
noise added to each eval, drawn from a logistic distribution with mean zero, and scale factor s = PawnEndgameValue * 0.8 ^ (L - 1) * phaseFactor, where phaseFactor = 1.0 when the side to move has all pieces, and phaseFactor=0.5 when stm has no pieces (excl. King and Pawns), linearly in between. The point of phaseFactor is to strengthen endgame play relative to middle game, to mittigate the fact that depth limitation makes the engine weaker in endgame relative to middlegame. The reason for using logistic is that it has fat tails. So you don't need to make the scale absurdly large to draw large tail values. That way, instead of polluting most evals, you only pollute a few, which is more human like (sporadic large mistakes, rather than pervasive medium mistakes). There are many distributions out there, this one is just the simplest one I know to formulate mathematically that does the job (fat tails).
another important thing is that it must be non-deterministic, so you want the seed of the PRNG to be unpredictable, and not reset to a fixed value on every game. that way you get an opponent which always plays differently, and you don't just end-up repeating the same opening moves all the time, until you can win by trial and error (restart game when you blunder, memorize, repeat). Of course, you could make the argument that non-determinism should come from the GUI+OpeningBook random selection. But I wanted it to work nicely without opening book. After all, opening book knowledge is part of chess skill in human play. So you don't want level 1 to play with state of the art opening theory, then utter garbage when out of book.

Now, we must have realistic expectations about this. It will never be as good as Maia at being human-like. The point here is just to code something that is simple, and somewhat reasonable. Applying a Maia strategy is certainly very interesting, but is a separate project on its own.

niklasf · 2021-10-07T11:44:32Z

By the way playing against weak levels of Stockfish is extremely popular on lichess.org, especially with beginners. I think it's because facing an AI in non-competitive play adds much less pressure than facing a human opponent.

Even level 0 is currently too strong for players that are just starting out. So we're using a patched version that extends the range (fairy-stockfish/Fairy-Stockfish@2329160 + fairy-stockfish/Fairy-Stockfish@f451358).

lucasart · 2021-10-08T02:23:34Z

Yes, Skill Level=0 is too strong for beginners. It's hard to put ourselves in the shoes of beginners, once we have some experience in chess. So I've asked an actual beginner to play 2 games against Demolito Level=1, and she couldn't even beat it (although she came close). Stockfish Level=0 is >500 elo stronger than that...

Rank Name           Elo    +    - games score oppo. draws 
   1 Demolito_L4    789   30   30  1000   86%   396    5% 
   2 Stockfish_L1   618   30   30   800   68%   405    6% 
   3 Stockfish_L0   533   29   29   800   61%   405    5% 
   4 Demolito_L3    527   24   24  1000   56%   449    9% 
   5 Demolito_L2    303   27   27  1000   30%   493    7% 
   6 Demolito_L1      0   43   43  1000    5%   554    3%

tillchess · 2021-10-08T08:34:04Z

@lucasart This thread started with a new scheme to weaken Stockfish and remove the current pick_best method (which chooses a suboptimal move) .

vondele · 2022-04-30T10:53:14Z

For reference, this is the setup used on LiChess https://github.com/lichess-org/fishnet/blob/master/src/api.rs#L208

tillchess · 2022-04-30T15:45:01Z

For reference, this is the setup used on LiChess https://github.com/lichess-org/fishnet/blob/master/src/api.rs#L208

The Lichess website uses Fairy Stockfish with a different Skill Level range from -20 to 20.

niklasf · 2022-04-30T16:06:36Z

Skills 0 to 20 in Fairy-Stockfish are equivalent to Stockfish, the negative values extend the range in a straight-forward way.

We didn't put too much thought into the parameters on Lichess, basically just trying to get to a point where complete beginners have a shot at beating the lowest level, and then building a mostly arbitrary progression from the lowest to the strongest level.

The weakened play at -20 certainly doesn't feel particularly natural. Patches to improve it (like this one and maybe #3777, cc @xefoci7612) would be awesome. We could deploy patched versions on Lichess, for human testing, but I am not sure if we can get actionable feedback from that.

xefoci7612 · 2022-05-01T07:20:14Z

@niklasf in case you decide to give #3777 a try, please there is this little fix to apply.

PedanticHacker · 2022-06-13T16:49:54Z

It would be even more practical for people to have Stockfish skill levels offered in terms of Novice (1000 Elo), Experienced (1600 Elo), Master (2200 Elo), Grandmaster (2600 Elo), World Champion (maximum Elo). Something like that.

Does this idea sound interesting to you?

PedanticHacker · 2022-06-17T02:21:12Z

My rationale is that “Skill Level 0”, for example, doesn’t express clearly what kind of strength will Stockfish play at. Maybe “Skill Level: Novice (1000 Elo)” is more human-readable?

tillchess · 2022-06-17T09:39:53Z

My rationale is that “Skill Level 0”, for example, doesn’t express clearly what kind of strength will Stockfish play at. Maybe “Skill Level: Novice (1000 Elo)” is more human-readable?

Skill 0 isn't 1000 Elo at all and it makes more sense to leave skill values as numbers.

PedanticHacker · 2022-06-18T19:38:28Z

Okay, but then everyone will have to guess at what Elo does Stockfish play when in Skill Level XY mode. What Elo does, for example, Skill Level 0 represent? If not 1000, what then? This will quickly lead to confusion.

Also, I have a question just to clear a confusion of mine. Not related to skill level. Anyway, what is this?
option name UCI_Elo type spin default 1350 min 1350 max 2850
Does this option mean that Stockfish plays at Elo 1350 by default if the option is not overridden to, say, 2850?

dav1312 · 2022-06-18T20:06:34Z

What Elo does, for example, Skill Level 0 represent? If not 1000, what then?

Does this option mean that Stockfish plays at Elo 1350 by default

No, because UCI_LimitStrength is false by default and it needs to be set to true for UCI_elo to work

PedanticHacker · 2022-06-18T21:30:14Z

Ah, thank you for all of the given information. This is very helpful.

PedanticHacker · 2022-06-18T21:38:51Z

I have, however, one additional question now. Since Skill Level 19 is 2886 Elo and Stockfish’s maximum Elo is 2850 — how can Skill Level 19 be at such a high Elo to surpass Stockfish’s maximum?

GerryHickman · 2022-06-25T18:51:37Z

I'm using 'stockfish-15-1.el8.x86_64' rpm from my distro and 'gnome-chess' front end with skill set to "easy". I get a good game, sometimes winning, sometimes losing. BUT, when I start winning Stockfish seems to "give up", and starts throwing away pieces. This is counter intuitive, it would be good if it could play at normal strengh (or stronger) when it's losing. This would make it more realistic.

SchulzKilian · 2022-12-05T21:53:58Z

I know we are not only going for play "character" but more for strength, but a big difference I see between beginner play and very low level engines is that the engine moves seem almost random, while beginner play mostly has plans, just not thought out longterm. "How can I attack the opponents queen now, how do I say chess..." So I do think that the suggested way of limiting depth might simulate something in that direction.

chromi · 2023-04-23T01:21:48Z

I think we can learn something from how other (open source) engines approach strength limitation. One striking example is Rodent IV, an engine designed very specifically to exhibit recognisable playing styles (defined by personalities) rather than aiming for maximum playing strength. Unfettered, CCRL puts it right at 3000 Elo, which ain't bad, considering.

Rodent can also be configured through UCI options for an Elo rating as low as 800, which it implements by three primary mechanisms:

Limiting the depth of opening-book knowledge it may use. Basically the engine enforces dropping out of book at a game move dependent on the selected Elo. Lower-rated players tend to have less detailed opening knowledge, so this mechanism makes sense as part of strength reduction.
Once out of book, a search speed limit is implemented by inserting millisecond sleeps whenever the actual search rate exceeds the appropriate value. This value is calculated as an exponential function of the selected Elo rating, and ranges from the dozens to the millions of nodes per second over the supported Elo range, and results in a reduced search depth which varies fairly appropriately with the time control. It's perhaps noteworthy that this mechanism is inherently more energy-efficient than Stockfish's "artificial centipawn loss" method.
For very low ratings (below about 1500), "evaluation blurring" is also applied at the leaves of the search tree, rather than at the root as Stockfish does. This is accomplished by adding a function of the Zobrist hash to the normal evaluation, so it's consistent even on re-searches of the same node. Again, the scale of the blurring is a function of the selected Elo.

Rodent does not support tablebases, but instead has specific knowledge of certain simple endgames (KQK, KRK, KPK, KBNK, and KBBK, IIRC). Since this knowledge is implemented by way of specialised evaluation functions, it is affected to some extent by the second and third strength-limitation mechanisms noted above.

The above mechanisms work fairly well, I think, for limiting strength to the club-player range. Club players have usually developed a definite sense of how to play chess, but are limited in how deeply they can analyse a position or recall their theory. So an engine that plays good, but not so deeply analysed, moves is a good match. Nobody paying attention would be fooled that the playing style is human-like, but it shouldn't feel outright wrong to play against.

Novice players are I think a different matter, and could require a different approach entirely for satisfactory results. Novices might not even be familiar with all the basic rules of chess yet, such as en-passant capture, underpromotion of pawns, or even castling; the only opening theory they might know is one or two recommended first moves for White (and what then?). Perhaps they've heard some vague guidelines such as the basic material values, pushing pawns forward and arranging them diagonally for protection, moving pieces into the centre, and keeping the king safely in the corner. A whole lot of the nuances that are encoded into a strong engine's evaluation functions just wouldn't occur to them.

So, for the weakest levels of play, I think you would get a much more realistic style of play with some of the following ideas:

Exclude underpromotion, en-passant capture, and/or castling from the move generator during the search (unless they are somehow the only legal moves), but still accept them as legal moves when supplied or booked. En-passant and underpromotion occur fairly rarely in any case, so castling is probably the "special" move to cut out at the lowest Elo threshold relative to the others.
Consider only "natural" moves: immediate checks, captures, threats of undefended pieces, pawn advances, defences of hanging pieces, or simply developing pieces into the centre or towards the opposing king. Don't bother looking for moves which control weak squares or increase mobility; those are advanced concepts. This mirrors the kinds of moves that a novice would focus on, before they learn about things like piece coordination or positional play.
Assume the opponent plays some proportion of null-moves during the search; use a consistent PRNG such as derivation from the Zobrist hash. This allows the engine to exhibit a coherent "plan" several moves deep, while still neglecting threats and responses that an engine would otherwise find difficult to ignore.
Do not consult tablebases at all. Rely on the search and the middlegame heuristics to resolve the endgame.
Opening books should be shallow, and be tuned to lead to the kinds of openings taught in introductory texts, not to grandmaster lines.

Whether it even makes sense to graft such things into Stockfish, I have no idea.

Fix native build on linux Eliminated Stockfish handicap mode and replaced with another better one, based on an idea of and Michael Byrne Thanks to Tomasz Sobczyk official-stockfish/Stockfish#3635 Michael Byrne MichaelB7/Stockfish@18480ca

vondele mentioned this issue Oct 18, 2021

Simplify Skill implementation #3737

Closed

Disservin added the feature/functionality label Aug 10, 2023

[RFC] Considerations for "Skill Level" #3635

[RFC] Considerations for "Skill Level" #3635

Comments

Sopel97 commented Jul 31, 2021

vondele commented Jul 31, 2021

scchess commented Jul 31, 2021 • edited Loading

scchess commented Jul 31, 2021 • edited Loading

Sopel97 commented Jul 31, 2021

scchess commented Jul 31, 2021 • edited Loading

tillchess commented Aug 1, 2021 • edited Loading

lucasart commented Sep 24, 2021 • edited Loading

Sopel97 commented Sep 24, 2021

lucasart commented Sep 27, 2021 • edited Loading

vondele commented Sep 27, 2021

lucasart commented Sep 27, 2021 • edited Loading

vondele commented Sep 27, 2021

tillchess commented Sep 28, 2021

lucasart commented Sep 29, 2021

kayn1208 commented Sep 29, 2021

lucasart commented Sep 29, 2021 • edited Loading

kayn1208 commented Sep 29, 2021

lucasart commented Oct 7, 2021 • edited Loading

niklasf commented Oct 7, 2021

lucasart commented Oct 8, 2021

tillchess commented Oct 8, 2021

vondele commented Apr 30, 2022

tillchess commented Apr 30, 2022

niklasf commented Apr 30, 2022 • edited Loading

xefoci7612 commented May 1, 2022

PedanticHacker commented Jun 13, 2022 • edited Loading

PedanticHacker commented Jun 17, 2022

tillchess commented Jun 17, 2022

PedanticHacker commented Jun 18, 2022

dav1312 commented Jun 18, 2022 • edited Loading

PedanticHacker commented Jun 18, 2022

PedanticHacker commented Jun 18, 2022

GerryHickman commented Jun 25, 2022

SchulzKilian commented Dec 5, 2022

chromi commented Apr 23, 2023

scchess commented Jul 31, 2021 •

edited

Loading

scchess commented Jul 31, 2021 •

edited

Loading

scchess commented Jul 31, 2021 •

edited

Loading

tillchess commented Aug 1, 2021 •

edited

Loading

lucasart commented Sep 24, 2021 •

edited

Loading

lucasart commented Sep 27, 2021 •

edited

Loading

lucasart commented Sep 27, 2021 •

edited

Loading

lucasart commented Sep 29, 2021 •

edited

Loading

lucasart commented Oct 7, 2021 •

edited

Loading

niklasf commented Apr 30, 2022 •

edited

Loading

PedanticHacker commented Jun 13, 2022 •

edited

Loading

dav1312 commented Jun 18, 2022 •

edited

Loading