Implement weighted average Q backup operator #1041

DanielUranga · 2019-12-17T01:28:40Z

Changed the way Q averaging works so newer values have a higher weight than the older ones.
Added a new parameter WeightedAverageAlpha. Setting it to 1.0 the old behavior is used, higher values increase the weight newer values have over the older ones.

Attaching graphs showing how the two different backup algorithms adapt to a simulated visit distribution.

WeightedAverageAlpha=2.0:

WeightedAverageAlpha=5.0:

oscardssmith · 2019-12-17T02:13:43Z

I'm pretty sure that this change breaks theoretical convergence of mcts.

Tilps · 2019-12-17T09:16:48Z

I think I've seen a very similar PR tried before.

Unless you've got solid data showing how this actually improves elo vs optimized baseline, I doubt its going anywhere.

ASilver · 2019-12-17T14:15:55Z

I think I've seen a very similar PR tried before.

Unless you've got solid data showing how this actually improves elo vs optimized baseline, I doubt its going anywhere.

Obviously the idea is to not have it get locked in a specific move at deeper nodes due to the visits issue. I ran 1000 games at 6400 nodes with a 128x10 vs LD2 (not selfplay) and got a 9 Elo improvement vs defaults without. I then ran a similar test at 25,600 nodes, only 254 games though, and got a 22 Elo improvement. This was with the value set at 1.5. Other values or longer tests have not been conducted, but it certainly warranted more exploration.

dtracers · 2019-12-19T19:39:07Z

   # PLAYER                  :  RATING  ERROR  POINTS  PLAYED   (%)  CFS(%)    W     D    L  D(%)
   1 lc058613-alpha2.5       :    28.4   15.0   540.0    1000    54      80  303   474  223    47
   2 lc058613-alpha1.5       :    18.6   16.4   534.0    1015    53      55  298   472  245    47
   3 lc058613                :    17.1   15.6   532.5    1016    52      98  317   431  268    42
   4 lc0LD2                  :     0.0   ----  1424.5    3031    47     ---  736  1377  918    45

White advantage = 38.00 +/- 4.66
Draw rate (equal opponents) = 46.09 % +/- 0.91

Tests by aart showing an improvement vs default value at 800 nodes.
So even in low nodes this is showing an improvement.
Direct Link to test results

ASilver · 2019-12-25T22:44:23Z

I decided to not spend a lot of time guessing and ran a CLOP for 1600 nodes of an old 10b against LD2. No more, no less. The only thing that was sought was the ideal value of this setting, and after 15000 samples it was more or less stabilized at 2.1. I then ran a lengthy match (also at 1600 nodes) to see the result.

1.0

Score of DU-lc0-q2 -10b vs lc0-v23 LD2: 333 - 505 - 748 [0.446]
Elo difference: -37.83 +/- 12.42

1586 of 1586 games finished.

2.1 CLOP

Score of DU-lc0-q2 -10b vs lc0-v23 LD2: 357 - 443 - 786 [0.473]
Elo difference: -18.86 +/- 12.13

1586 of 1586 games finished.

zz4032 · 2019-12-27T14:40:07Z

Didn't have much success in my test match vs. Stockfish with WeightedAverageAlpha=2.0.
Also in CLOP tuning values above 1.0 are continuously fading (showing no improvement over default).

   # ENGINE         :  RATING  ERROR  CFS(%)   GAMES  DRAWS(%)  SCORE(%)
   1 lc0-default    :    18.2   22.7    94.3     500      44.8      52.6
   2 stockfish      :     0.0   ----    69.5    1000      45.2      49.1
   3 lc0-alpha2.0   :    -5.6   21.5     ---     500      45.6      49.2

White advantage = 3.5 +/- 8.3
Draw rate (equal opponents) = 45.3 % +/- 1.6

Settings for both lc0: tc=3+0.03 / MinibatchSize=64 / CPuct=2.70 / FpuValue=0.50 / PolicyTemperature=1.5
Average (median) nodes/move: 4734 (lc0-alpha2.0), 5042 (lc0-default)

DanielUranga · 2019-12-27T14:48:15Z

It seems higher alpha value helps more in lower nodes/higher minibatch size situation (this effect was also observed with backup operators with a minimax component).
The optimal overall value is likely lower than 2.0, even lower than 1.5.

@zz4032 Thanks for posting the nps results! This PR is not optimized for speed at all, but it's good to know the difference it makes.

Tilps · 2019-12-27T14:54:42Z

Things helping better with lower nodes seems to be a pattern that is mostly explained by the fact that our default parameters are so bad for lower nodes. Its easy to accidentally emulate some aspect that regains some of the ~60 easy elo available on the table from using default instead of optimized parameters.
ZZ appears to be the first to provide results comparing against an optimized baseline that was optimized for the scenario tested. And unfortunately the do seem consistent with that theory.

DanielUranga · 2019-12-27T15:13:59Z

ZZ's result is very appreciated. The Elo loss is attributable to a combination of:
a) An excessively high alpha parameter.
b) The NPS slowdown (since it was not a fixed nodes match).
c) Time management algorithm assuming "regular" averaging (I'm not sure about this one).

If the newer visits a node gets represent more accurately the "true" value of a position than the older ones, then an alpha value higher than the default 1.0 should warrant a more accurate evaluation of the positions.

An excessive alpha value will weight old values too little in comparison to the newer ones causing the search to be too noisy.

dtracers · 2020-01-13T21:54:33Z

It sounds like this needs:
#1 optimized baseline against SF at nodes (not time)
#2 a change to timing to take into account the new weighting algorithm and how it will effect node catchup ability.

Naphthalin · 2020-04-28T12:29:58Z

As a weighted update is of results is relatively costly and the evidence in this PR is against this implementation (theoretical: violates the convergence assumptions of UCT, empirical: zz's test), this PR should be closed.

DanielUranga · 2020-04-28T14:27:47Z

I don't think this violates the convergence, but I agree this is not going to be ever merged. So closing the PR for cleanup.

Implement weighted average Q update

34610bf

DanielUranga requested a review from Tilps December 17, 2019 01:33

DanielUranga closed this Apr 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement weighted average Q backup operator #1041

Implement weighted average Q backup operator #1041

DanielUranga commented Dec 17, 2019

oscardssmith commented Dec 17, 2019

Tilps commented Dec 17, 2019

ASilver commented Dec 17, 2019

dtracers commented Dec 19, 2019 •

edited

Loading

ASilver commented Dec 25, 2019 •

edited

Loading

zz4032 commented Dec 27, 2019

DanielUranga commented Dec 27, 2019

Tilps commented Dec 27, 2019

DanielUranga commented Dec 27, 2019

dtracers commented Jan 13, 2020

Naphthalin commented Apr 28, 2020

DanielUranga commented Apr 28, 2020

Implement weighted average Q backup operator #1041

Implement weighted average Q backup operator #1041

Conversation

DanielUranga commented Dec 17, 2019

oscardssmith commented Dec 17, 2019

Tilps commented Dec 17, 2019

ASilver commented Dec 17, 2019

dtracers commented Dec 19, 2019 • edited Loading

ASilver commented Dec 25, 2019 • edited Loading

zz4032 commented Dec 27, 2019

DanielUranga commented Dec 27, 2019

Tilps commented Dec 27, 2019

DanielUranga commented Dec 27, 2019

dtracers commented Jan 13, 2020

Naphthalin commented Apr 28, 2020

DanielUranga commented Apr 28, 2020

dtracers commented Dec 19, 2019 •

edited

Loading

ASilver commented Dec 25, 2019 •

edited

Loading