-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement weighted average Q backup operator #1041
Conversation
I'm pretty sure that this change breaks theoretical convergence of mcts. |
I think I've seen a very similar PR tried before. Unless you've got solid data showing how this actually improves elo vs optimized baseline, I doubt its going anywhere. |
Obviously the idea is to not have it get locked in a specific move at deeper nodes due to the visits issue. I ran 1000 games at 6400 nodes with a 128x10 vs LD2 (not selfplay) and got a 9 Elo improvement vs defaults without. I then ran a similar test at 25,600 nodes, only 254 games though, and got a 22 Elo improvement. This was with the value set at 1.5. Other values or longer tests have not been conducted, but it certainly warranted more exploration. |
Tests by aart showing an improvement vs default value at 800 nodes. |
I decided to not spend a lot of time guessing and ran a CLOP for 1600 nodes of an old 10b against LD2. No more, no less. The only thing that was sought was the ideal value of this setting, and after 15000 samples it was more or less stabilized at 2.1. I then ran a lengthy match (also at 1600 nodes) to see the result. 1.0 Score of DU-lc0-q2 -10b vs lc0-v23 LD2: 333 - 505 - 748 [0.446] 1586 of 1586 games finished. 2.1 CLOP Score of DU-lc0-q2 -10b vs lc0-v23 LD2: 357 - 443 - 786 [0.473] 1586 of 1586 games finished. |
Didn't have much success in my test match vs. Stockfish with
Settings for both lc0: |
It seems higher alpha value helps more in lower nodes/higher minibatch size situation (this effect was also observed with backup operators with a minimax component). @zz4032 Thanks for posting the nps results! This PR is not optimized for speed at all, but it's good to know the difference it makes. |
Things helping better with lower nodes seems to be a pattern that is mostly explained by the fact that our default parameters are so bad for lower nodes. Its easy to accidentally emulate some aspect that regains some of the ~60 easy elo available on the table from using default instead of optimized parameters. |
ZZ's result is very appreciated. The Elo loss is attributable to a combination of: If the newer visits a node gets represent more accurately the "true" value of a position than the older ones, then an alpha value higher than the default 1.0 should warrant a more accurate evaluation of the positions. An excessive alpha value will weight old values too little in comparison to the newer ones causing the search to be too noisy. |
As a weighted update is of results is relatively costly and the evidence in this PR is against this implementation (theoretical: violates the convergence assumptions of UCT, empirical: zz's test), this PR should be closed. |
I don't think this violates the convergence, but I agree this is not going to be ever merged. So closing the PR for cleanup. |
Changed the way Q averaging works so newer values have a higher weight than the older ones.
Added a new parameter
WeightedAverageAlpha
. Setting it to 1.0 the old behavior is used, higher values increase the weight newer values have over the older ones.Attaching graphs showing how the two different backup algorithms adapt to a simulated visit distribution.
WeightedAverageAlpha=2.0:
WeightedAverageAlpha=5.0: