Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement weighted average Q backup operator #1041

Closed

Conversation

DanielUranga
Copy link
Member

Changed the way Q averaging works so newer values have a higher weight than the older ones.
Added a new parameter WeightedAverageAlpha. Setting it to 1.0 the old behavior is used, higher values increase the weight newer values have over the older ones.

Attaching graphs showing how the two different backup algorithms adapt to a simulated visit distribution.

WeightedAverageAlpha=2.0:
w2_0

WeightedAverageAlpha=5.0:
w5_0

@oscardssmith
Copy link
Contributor

I'm pretty sure that this change breaks theoretical convergence of mcts.

@Tilps
Copy link
Contributor

Tilps commented Dec 17, 2019

I think I've seen a very similar PR tried before.

Unless you've got solid data showing how this actually improves elo vs optimized baseline, I doubt its going anywhere.

@ASilver
Copy link

ASilver commented Dec 17, 2019

I think I've seen a very similar PR tried before.

Unless you've got solid data showing how this actually improves elo vs optimized baseline, I doubt its going anywhere.

Obviously the idea is to not have it get locked in a specific move at deeper nodes due to the visits issue. I ran 1000 games at 6400 nodes with a 128x10 vs LD2 (not selfplay) and got a 9 Elo improvement vs defaults without. I then ran a similar test at 25,600 nodes, only 254 games though, and got a 22 Elo improvement. This was with the value set at 1.5. Other values or longer tests have not been conducted, but it certainly warranted more exploration.

@dtracers
Copy link

dtracers commented Dec 19, 2019

   # PLAYER                  :  RATING  ERROR  POINTS  PLAYED   (%)  CFS(%)    W     D    L  D(%)
   1 lc058613-alpha2.5       :    28.4   15.0   540.0    1000    54      80  303   474  223    47
   2 lc058613-alpha1.5       :    18.6   16.4   534.0    1015    53      55  298   472  245    47
   3 lc058613                :    17.1   15.6   532.5    1016    52      98  317   431  268    42
   4 lc0LD2                  :     0.0   ----  1424.5    3031    47     ---  736  1377  918    45

White advantage = 38.00 +/- 4.66
Draw rate (equal opponents) = 46.09 % +/- 0.91

Tests by aart showing an improvement vs default value at 800 nodes.
So even in low nodes this is showing an improvement.
Direct Link to test results

@ASilver
Copy link

ASilver commented Dec 25, 2019

I decided to not spend a lot of time guessing and ran a CLOP for 1600 nodes of an old 10b against LD2. No more, no less. The only thing that was sought was the ideal value of this setting, and after 15000 samples it was more or less stabilized at 2.1. I then ran a lengthy match (also at 1600 nodes) to see the result.

1.0

Score of DU-lc0-q2 -10b vs lc0-v23 LD2: 333 - 505 - 748 [0.446]
Elo difference: -37.83 +/- 12.42

1586 of 1586 games finished.

2.1 CLOP

Score of DU-lc0-q2 -10b vs lc0-v23 LD2: 357 - 443 - 786 [0.473]
Elo difference: -18.86 +/- 12.13

1586 of 1586 games finished.

@zz4032
Copy link
Contributor

zz4032 commented Dec 27, 2019

Didn't have much success in my test match vs. Stockfish with WeightedAverageAlpha=2.0.
Also in CLOP tuning values above 1.0 are continuously fading (showing no improvement over default).

   # ENGINE         :  RATING  ERROR  CFS(%)   GAMES  DRAWS(%)  SCORE(%)
   1 lc0-default    :    18.2   22.7    94.3     500      44.8      52.6
   2 stockfish      :     0.0   ----    69.5    1000      45.2      49.1
   3 lc0-alpha2.0   :    -5.6   21.5     ---     500      45.6      49.2

White advantage = 3.5 +/- 8.3
Draw rate (equal opponents) = 45.3 % +/- 1.6

Settings for both lc0: tc=3+0.03 / MinibatchSize=64 / CPuct=2.70 / FpuValue=0.50 / PolicyTemperature=1.5
Average (median) nodes/move: 4734 (lc0-alpha2.0), 5042 (lc0-default)

@DanielUranga
Copy link
Member Author

It seems higher alpha value helps more in lower nodes/higher minibatch size situation (this effect was also observed with backup operators with a minimax component).
The optimal overall value is likely lower than 2.0, even lower than 1.5.

@zz4032 Thanks for posting the nps results! This PR is not optimized for speed at all, but it's good to know the difference it makes.

@Tilps
Copy link
Contributor

Tilps commented Dec 27, 2019

Things helping better with lower nodes seems to be a pattern that is mostly explained by the fact that our default parameters are so bad for lower nodes. Its easy to accidentally emulate some aspect that regains some of the ~60 easy elo available on the table from using default instead of optimized parameters.
ZZ appears to be the first to provide results comparing against an optimized baseline that was optimized for the scenario tested. And unfortunately the do seem consistent with that theory.

@DanielUranga
Copy link
Member Author

ZZ's result is very appreciated. The Elo loss is attributable to a combination of:
a) An excessively high alpha parameter.
b) The NPS slowdown (since it was not a fixed nodes match).
c) Time management algorithm assuming "regular" averaging (I'm not sure about this one).

If the newer visits a node gets represent more accurately the "true" value of a position than the older ones, then an alpha value higher than the default 1.0 should warrant a more accurate evaluation of the positions.

An excessive alpha value will weight old values too little in comparison to the newer ones causing the search to be too noisy.

@dtracers
Copy link

It sounds like this needs:
#1 optimized baseline against SF at nodes (not time)
#2 a change to timing to take into account the new weighting algorithm and how it will effect node catchup ability.

@Naphthalin
Copy link
Contributor

As a weighted update is of results is relatively costly and the evidence in this PR is against this implementation (theoretical: violates the convergence assumptions of UCT, empirical: zz's test), this PR should be closed.

@DanielUranga
Copy link
Member Author

I don't think this violates the convergence, but I agree this is not going to be ever merged. So closing the PR for cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants