-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AtRoot versions of FpuStrategy and FpuValue (and convert FpuReduction to FpuValue). #750
Conversation
That's intended only for test, not for merge, right? |
Could you elaborate a bit what you mean by "wider initially like a0"? |
This could be for merge if it's less desirable to implement virtual loss and batching more like AlphaZero on TPUs. Root FPU=1 makes it more explicit matching Matthew's self-play 800 visit expectation to visit every root children rather than leaving it to chance of asynchronous parallelism. @sergiovieri Matthew from DeepMind says Specifically for the stalemate situation, AZ would probably have looked at the top 16 prior moves, so the 99.90%, 0.05% and half of the 0.00% prior moves. With unvisited Q=loss and extremely close to 0% prior leading to U~=0, a sequential search would stick with the stalemate move because it thinks all other moves are losing; however, because AZ happens to include 15 other moves, it realizes Q=loss is wrong and Q=near-win. Root FPU=1 here makes all root children get visited allowing search to then realize every move other than the 99.90% prior is winning. |
This also seems good for training. Getting at least 1 visit will partially prevent self reinforcing policy sharpening. (policy shaping tree, shaping policy) |
Instead of setting FPU=1, AZ might have done something similar to LZ Go, namely setting the evaluation of expanding nodes (i.e. node whose evaluations/policy aren't ready yet) to a very low value to avoid selecting them (but instead select their siblings): (However, if you're gonna expand all root children, then you don't even need policy at root and can batch together evaluation of the root with evaluations of its children, which is handy.) |
Yup, AZ's batching behavior is very similar to root FPU=1 for Chess because their eager batching happens to grab most/all of the ~35 children whereas for Go, ~250 children probably aren't all visited. |
I updated PR realizing we don't want to affect t40 with a new client that didn't support the existing self-play. Edited into original comment: Added |
Ah, I guess we probably want to also support FPU=-1 for self-play similar to t40 except at root. So many configurations… ;) I guess cleanest is to just make a separate param that just does root FPU=1 while allowing absolute and reduction as they are now and keep existing noise checks, etc. |
(with |
Does this change gain significant Elo over the current search? |
@mooskagh Updated approach to add
I ran selfplay tournament with 50105 with mostly training run 2 params with additional
The training data is likely a bit improved, and here's some interesting games where the new flag allowed player1 to find very low prior moves that player2 didn't expect resulting in quick resigns. The training data would likely result in training these priors towards 80% instead of 0% that they had been. Yes, Dirichlet noise helps find these too, so the change here just makes it a bit more consistent. As usual with training data, the below are just some games that end quickly with wild swings but the outcome isn't the only interesting result. Even for already won or already lost positions, the network learns from MCTS providing better search probabilities to generalize. White thought it was winning (so black thought it was losing) and didn't expect black to move the rook for a discovery check to win a free rook. White resigns without another move.
White thought it was slightly behind and definitely didn't expect a 0.02% prior rook sacrifice to win a queen. White resigns without another move.
Both sides were thinking the game was drawn in this equal-material QRPP endgame, but white searching wide finds the lowest prior 0.01% winning move where black resigns a couple moves later.
White thought it was losing initially before searching wide to find this 0.09% rook sacrifice to win a queen. Black resigns without another move.
|
To be clear, the positions in the previous comment, the resigning side played a blundering move because it didn't expect the tactic. The idea is that training improves the priors for the tactical move so that the network learns to avoid putting itself in a bad situation. Overall behavior is that the network searches deep as best as it knows according to the network, and search wide at root to find the network's blindspots. Root is special because that's what is used to train future move probabilities. |
One potential open question is how early/urgent should visiting children be with this change? Right now I've set it to 1.0f but U for already visited winning positions would result in Q+U greater than that of very low prior unvisited moves. Making the value 2.0f would make it more likely but still not guaranteed, e.g., the stalemate capture promote to queen position has Q+U almost 4.
|
The only thing I don't like about this is that it doesn't really transfer to other games like Go. In go exploring every move at root would be bad, a balance would need to be found, which would probably be some kind of FPU= running parent Value. But in Chess I don't see any issues. |
Indeed, visiting all for Go isn't the behavior we noticed in #748. It's just that as a simplification of the AZ's behavior of visiting 26 root children out of 33 for that position is pretty close to visiting all without trying to emulate the TPU behavior more precisely. I suppose a closer approximation would be to allow some number of root children to always be visited, and that could then be more directly reused for Go.. But then unclear if this behavior is desired for Go anyway… |
This pr makes the search wider at the root which could affect badly with endgame temp as any move with visits could be picked by temp. On the other hand, searching wider in the root should help finding good moves with low prior.
In addition to these settings, master used
Then to test if this is caused by endgame temp I ran a match with
It indeed seems that pr750 might not play that well together with endgame temp but the effect is not big. |
Thanks for running the numbers with 32930. Just making sure, the latter numbers without endgame temperature for more match-like settings over 2500 games, a very mature network is able to gain 9 Elo when visiting root children earlier? There's many ways to adjust temperature depending on how the first set of games were lost. If they were purely from 1-visit moves getting picked, then a lower temperature or more negative visit-offset could be used. If they're from playing down the bad lines due to network overvaluing positions that would have been hidden by search not visiting low priors, then it's actually a beneficial outcome for the network to lessen the value of those positions for future networks. |
Yes 9+/-4 elo but the setting is stll training like, just with zero endgame temp. Also my next question was if the temp could be counter acted with visit offset -1 since presumably most of the root blunders get no more than one visit. I have a match running with offset -0.999 (For some reason -1 is not possible). |
@jkormu, are you sure you aren't testing one version against itself? 9 elo seems way too small |
@oscardssmith Double checked and it seems to be master vs pr750. What kind of Elo gains you are expecting with this pr in training like setting? Please note that in the above tests master and pr750 had identical setting (including all noise settings) with only difference in --fpu-strategy. The test did not state that dropping noise from 0.45 to 0.0 gains 9 Elo. In first match both used 0.45 endgame temp and in latter both used 0.0. |
It's not so much that I have specific expectations as I get suspicious when I see multiple tests within 10 elo, as I've been burned by this before. |
Results for temp
Test was with 800 nodes, repeated 2 move book and both engines shared following settings: The only setting that was different was |
Here's another 95+ LOS/CFS result from a more recent t50 network (50180) but this time with accelerated selfplay settings (lower visits, higher resign, temperature cutoff [all games were different])
|
@mooskagh Why is this marked "not for merge?" Are there other plans to implement a more AlphaZero-like search, which happens to improve training of tactics? It seems possible that lc0 has tried to work around tactical deficiency by additionally diverging from AlphaZero with 2.2 policy softmax temperature. So is there an explicit decision to not behave more like AlphaZero? |
It didn't have that behaviour switchable off in params, that's why I decided it was just for testing. |
…tion to FpuValue).
@mooskagh Updated PR with your suggestion of FpuStrategyAtRoot and FpuValueAtRoot; and at the same time converted FpuReduction to FpuValue. I made "same" ignore FpuValueAtRoot and documented help as such to avoid complexities of thinking "same" might mean.. "if strategy is reduction, reduce eval by --fpu-value for most nodes except reduce root children by --fpu-value-at-root instead." This maintains the default selfplay behavior of using fpu reduction 0, and for plain defaults, keeps the default fpu reduction 1.2 and adds fpu root absolute 1. This does change the behavior of only specifying --fpu-strategy=absolute, where before --fpu-value defaulted to -1 and now defaults to 1.2 (for the default reduction strategy), so both active training runs should use an explicit --fpu-value=-1 to maintain fpu absolute -1. @Tilps If the server wants to try training with fpu reduction of 0.5 while leaving root children unchanged, it should set |
@Tilps The server should probably set |
--fpu-value=-1.0 added to training parameters. |
@Mardak So are you saying that in your preliminary tests, FPUR at 0.5 in training conditions actually increases exploration and strength? |
r?@mooskagh
Wasn't sure if this should be conditional on self-play / noise. Or if it needs another param -- without, the logic simplifies a little bit.Edit: Addedfpu-strategy=root-absolute
to do absolute at root and reduction otherwise. Self-play maintains current previous defaultfpu-strategy=reduction
and still setsfpu-reduction=0
, so that's unaffected by the code simplification of not checking noise (but would prevent wanting to use reduction with noise except at root -- do we want to keep that?). To turn on new behavior for training, server would setfpu-strategy=root-absolute
(withfpu-value=1 fpu-reduction=0
still set on the client). Added--early-root-widening
flag which changes FPU behavior at root only for search (leaving verbose, temperature, otherGetFPU
callers unaffected).Edit: Added the suggested FpuStrategyAtRoot and FpuValueAtRoot; and at the same time converted FpuReduction to FpuValue. I made "same" ignore FpuValueAtRoot and documented help as such to avoid complexities of thinking "same" might mean.. "if strategy is reduction, reduce eval by --fpu-value for most nodes except reduce root children by --fpu-value-at-root instead."
Analyzing the position from page 81 (fix #748) again with 32930 with
minibatch-size=1 smart-pruning-factor=0
(expecting to explore more than top 4):This patch also happens to avoid @sergiovieri’s stalemate capture to queen promotion with t40 (41000 used here with
policy-softmax-temp=1
; expecting to find non-stalemate move):Bonus 32930 trying to find this “houdini tactic” from tactics list (fix #8; expecting Qxe7 sacrifice):