Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AtRoot versions of FpuStrategy and FpuValue (and convert FpuReduction to FpuValue). #750

Merged
merged 2 commits into from
Mar 9, 2019

Conversation

Mardak
Copy link
Contributor

@Mardak Mardak commented Feb 21, 2019

r?@mooskagh Wasn't sure if this should be conditional on self-play / noise. Or if it needs another param -- without, the logic simplifies a little bit. Edit: Added fpu-strategy=root-absolute to do absolute at root and reduction otherwise. Self-play maintains current previous default fpu-strategy=reduction and still sets fpu-reduction=0, so that's unaffected by the code simplification of not checking noise (but would prevent wanting to use reduction with noise except at root -- do we want to keep that?). To turn on new behavior for training, server would set fpu-strategy=root-absolute (with fpu-value=1 fpu-reduction=0 still set on the client). Added --early-root-widening flag which changes FPU behavior at root only for search (leaving verbose, temperature, other GetFPU callers unaffected).

Edit: Added the suggested FpuStrategyAtRoot and FpuValueAtRoot; and at the same time converted FpuReduction to FpuValue. I made "same" ignore FpuValueAtRoot and documented help as such to avoid complexities of thinking "same" might mean.. "if strategy is reduction, reduce eval by --fpu-value for most nodes except reduce root children by --fpu-value-at-root instead."

Analyzing the position from page 81 (fix #748) again with 32930 with minibatch-size=1 smart-pruning-factor=0 (expecting to explore more than top 4):
page81

position startpos moves g1f3 g8f6 c2c4 e7e6 b1c3 f8b4 d1c2 e8g8 a2a3 b4c3 c2c3 a7a5 b2b4 d7d6 e2e3 f6e4 c3c2 e4g5 b4b5 g5f3 g2f3 d8f6 d2d4 f6f3 h1g1 b8d7 f1e2 f3f6 c1b2 f6h4 g1g4 h4h2 g4g3 f7f5 e1c1 f8f7 e2f3 h2h4 d1h1 h4f6 c1b1 g7g6 g3g1 a5a4 b1a1 f7g7 e3e4 f5f4 c4c5 f6e7 g1c1 d7f6 e4e5 d6e5 h1e1 e5e4 f3e4 e7f8
go nodes 68

before:
…
c2c4  (264 ) N:   0 (+ 0) (P:  3.59%) (Q: -0.66368) (U: 0.88393) (Q+U:  0.22025) (V:  -.----) 
e4d3  (783 ) N:   9 (+ 0) (P: 16.80%) (Q:  0.01363) (U: 0.41335) (Q+U:  0.42698) (V:  0.1071) 
d4d5  (761 ) N:  10 (+ 0) (P: 12.22%) (Q:  0.09005) (U: 0.27343) (Q+U:  0.36348) (V:  0.6982) 
e4f3  (785 ) N:  22 (+ 0) (P: 11.60%) (Q:  0.29520) (U: 0.12411) (Q+U:  0.41931) (V:  0.2686) 
c5c6  (973 ) N:  26 (+ 0) (P: 13.87%) (Q:  0.28326) (U: 0.12639) (Q+U:  0.40965) (V:  0.3732) 

after:
c2b3  (258 ) N:   1 (+ 0) (P:  0.59%) (Q: -0.97691) (U: 0.07363) (Q+U: -0.90328) (V: -0.9769) 
e4d5  (795 ) N:   1 (+ 0) (P:  0.83%) (Q: -0.98659) (U: 0.10328) (Q+U: -0.88331) (V: -0.9866) 
e4f5  (797 ) N:   1 (+ 0) (P:  0.86%) (Q: -0.96944) (U: 0.10715) (Q+U: -0.86229) (V: -0.9694) 
e1e3  (111 ) N:   1 (+ 0) (P:  0.84%) (Q: -0.92453) (U: 0.10422) (Q+U: -0.82031) (V: -0.9245) 
c2a4  (262 ) N:   1 (+ 0) (P:  0.55%) (Q: -0.85741) (U: 0.06828) (Q+U: -0.78913) (V: -0.8574) 
e4c6  (799 ) N:   1 (+ 0) (P:  0.71%) (Q: -0.71045) (U: 0.08829) (Q+U: -0.62217) (V: -0.7105) 
e4g6  (803 ) N:   1 (+ 0) (P:  0.72%) (Q: -0.47933) (U: 0.08895) (Q+U: -0.39038) (V: -0.4793) 
a1b1  (0   ) N:   1 (+ 0) (P:  1.11%) (Q: -0.00074) (U: 0.13756) (Q+U:  0.13682) (V: -0.0007) 
b2c3  (231 ) N:   1 (+ 0) (P:  1.29%) (Q: -0.02199) (U: 0.15988) (Q+U:  0.13789) (V: -0.0220) 
a1a2  (7   ) N:   1 (+ 0) (P:  1.12%) (Q:  0.03331) (U: 0.13893) (Q+U:  0.17224) (V:  0.0333) 
c1b1  (48  ) N:   1 (+ 0) (P:  1.23%) (Q:  0.07281) (U: 0.15246) (Q+U:  0.22526) (V:  0.0728) 
e1f1  (101 ) N:   1 (+ 0) (P:  1.27%) (Q:  0.11431) (U: 0.15737) (Q+U:  0.27169) (V:  0.1143) 
b5b6  (941 ) N:   1 (+ 0) (P:  1.64%) (Q:  0.10347) (U: 0.20353) (Q+U:  0.30699) (V:  0.1035) 
e1d1  (100 ) N:   1 (+ 0) (P:  1.41%) (Q:  0.17411) (U: 0.17497) (Q+U:  0.34907) (V:  0.1741) 
c2c3  (259 ) N:   1 (+ 0) (P:  1.44%) (Q:  0.19155) (U: 0.17794) (Q+U:  0.36950) (V:  0.1916) 
c2d2  (252 ) N:   1 (+ 0) (P:  1.80%) (Q:  0.16386) (U: 0.22263) (Q+U:  0.38649) (V:  0.1639) 
c2b1  (246 ) N:   1 (+ 0) (P:  1.63%) (Q:  0.19083) (U: 0.20183) (Q+U:  0.39265) (V:  0.1908) 
c2d1  (248 ) N:   1 (+ 0) (P:  1.62%) (Q:  0.22839) (U: 0.20069) (Q+U:  0.42908) (V:  0.2284) 
e1e2  (106 ) N:   1 (+ 0) (P:  1.55%) (Q:  0.25640) (U: 0.19218) (Q+U:  0.44858) (V:  0.2564) 
c2d3  (260 ) N:   1 (+ 0) (P:  1.75%) (Q:  0.24711) (U: 0.21667) (Q+U:  0.46379) (V:  0.2471) 
e4b7  (804 ) N:   1 (+ 0) (P:  0.71%) (Q:  0.38252) (U: 0.08786) (Q+U:  0.47038) (V:  0.3825) 
c1d1  (49  ) N:   1 (+ 0) (P:  1.54%) (Q:  0.28285) (U: 0.19137) (Q+U:  0.47423) (V:  0.2829) 
e1h1  (103 ) N:   1 (+ 0) (P:  1.97%) (Q:  0.24280) (U: 0.24410) (Q+U:  0.48690) (V:  0.2428) 
e1g1  (102 ) N:   1 (+ 0) (P:  2.36%) (Q:  0.21439) (U: 0.29309) (Q+U:  0.50748) (V:  0.2144) 
f2f3  (346 ) N:   1 (+ 0) (P:  2.47%) (Q:  0.22276) (U: 0.30605) (Q+U:  0.52881) (V:  0.2228) 
e4g2  (781 ) N:   1 (+ 0) (P:  2.99%) (Q:  0.16545) (U: 0.37093) (Q+U:  0.53638) (V:  0.1654) 
c2e2  (253 ) N:   2 (+ 0) (P:  2.34%) (Q:  0.19584) (U: 0.19319) (Q+U:  0.38903) (V:  0.2921) 
e4h1  (776 ) N:   2 (+ 0) (P:  3.57%) (Q:  0.11043) (U: 0.29533) (Q+U:  0.40576) (V:  0.1988) 
c2c4  (264 ) N:   2 (+ 0) (P:  3.59%) (Q:  0.18332) (U: 0.29684) (Q+U:  0.48016) (V:  0.2804) 
e4f3  (785 ) N:   5 (+ 0) (P: 11.60%) (Q:  0.05569) (U: 0.47931) (Q+U:  0.53500) (V:  0.2686) 
d4d5  (761 ) N:   6 (+ 0) (P: 12.22%) (Q:  0.07366) (U: 0.43289) (Q+U:  0.50655) (V:  0.6982) 
e4d3  (783 ) N:   8 (+ 0) (P: 16.80%) (Q:  0.03414) (U: 0.46271) (Q+U:  0.49684) (V:  0.1071) 
c5c6  (973 ) N:  17 (+ 0) (P: 13.87%) (Q:  0.35679) (U: 0.19100) (Q+U:  0.54779) (V:  0.3732)

This patch also happens to avoid @sergiovieri’s stalemate capture to queen promotion with t40 (41000 used here with policy-softmax-temp=1; expecting to find non-stalemate move):
stalemate

position fen 5n2/4P2k/8/8/8/2K3Q1/8/8 w - - 0 1
go nodes 800

before:
…
e7f8r (1832) N:   0 (+ 0) (P:  0.05%) (Q: -1.19817) (U: 0.04482) (Q+U: -1.15335) (V:  -.----) 
e7f8q (1831) N: 799 (+ 0) (P: 99.90%) (Q:  0.00000) (U: 0.10854) (Q+U:  0.10854) (V:  0.0000) (T) 

after:
…
e7f8q (1831) N:  86 (+ 0) (P: 99.90%) (Q:  0.00000) (U: 0.99810) (Q+U:  0.99810) (V:  0.0000) (T) 
g3g1  (600 ) N: 115 (+ 1) (P:  0.00%) (Q:  0.99785) (U: 0.00000) (Q+U:  0.99785) (V:  0.9981) 
e7f8r (1832) N: 134 (+ 0) (P:  0.05%) (Q:  0.99622) (U: 0.00033) (Q+U:  0.99656) (V:  0.9994) 
e7e8q (1828) N: 244 (+ 0) (P:  0.00%) (Q:  0.99420) (U: 0.00000) (Q+U:  0.99420) (V:  0.9985) 

Bonus 32930 trying to find this “houdini tactic” from tactics list (fix #8; expecting Qxe7 sacrifice):
tactic

position startpos moves d2d4 e7e6 c2c4 f8b4 c1d2 b4e7 e2e4 d7d5 e4e5 c7c5 d1g4 e7f8 d4c5 h7h5 g4g3 h5h4 g3a3 b8d7 g1f3 f8c5 b2b4 c5b6 d2g5 g8e7 a3b2 h8h5 c4d5 e6d5 f1b5 e8f8 e1g1 d7e5 b2e5 f7f6 e5f4 b6c7 f4e3 f6g5 b1c3 d8d6 b5d3 c7b6 e3e2 h4h3 f1e1 g5g4 f3e5 h5g5 e5g6 g5g6 d3g6 c8d7 g6h5 a8c8 a1c1 c8c4
go nodes 800

before:
…
e2e7  (330 ) N:   0 (+ 0) (P:  0.81%) (Q: -1.44808) (U: 0.70426) (Q+U: -0.74382) (V:  -.----) 
…
c3a4  (483 ) N:  21 (+ 0) (P:  8.25%) (Q: -0.58938) (U: 0.32603) (Q+U: -0.26336) (V: -0.4660) 
e1d1  (100 ) N:  22 (+ 0) (P:  2.83%) (Q: -0.38214) (U: 0.10682) (Q+U: -0.27532) (V: -0.5075) 
g2h3  (375 ) N:  98 (+ 0) (P: 13.42%) (Q: -0.38267) (U: 0.11784) (Q+U: -0.26484) (V: -0.1010) 
e2d2  (311 ) N: 540 (+ 0) (P:  3.21%) (Q: -0.26595) (U: 0.00516) (Q+U: -0.26079) (V: -0.5581) 

after:
…
c3a4  (483 ) N:  16 (+ 0) (P:  8.25%) (Q: -0.46731) (U: 0.42192) (Q+U: -0.04539) (V: -0.4660) 
e1d1  (100 ) N:  21 (+ 0) (P:  2.83%) (Q: -0.24851) (U: 0.11168) (Q+U: -0.13683) (V: -0.5075) 
g2h3  (375 ) N:  38 (+ 0) (P: 13.42%) (Q: -0.37856) (U: 0.29913) (Q+U: -0.07944) (V: -0.1010) 
e2d2  (311 ) N: 200 (+ 0) (P:  3.21%) (Q: -0.18384) (U: 0.01390) (Q+U: -0.16994) (V: -0.5581) 
e2e7  (330 ) N: 437 (+ 1) (P:  0.81%) (Q:  0.76966) (U: 0.00160) (Q+U:  0.77127) (V: -0.2049) 

@mooskagh mooskagh added the not for merge Experimental code which is not intended to be merged into the master label Feb 21, 2019
@mooskagh
Copy link
Member

That's intended only for test, not for merge, right?

@sergiovieri
Copy link

Could you elaborate a bit what you mean by "wider initially like a0"?

@Mardak
Copy link
Contributor Author

Mardak commented Feb 22, 2019

This could be for merge if it's less desirable to implement virtual loss and batching more like AlphaZero on TPUs. Root FPU=1 makes it more explicit matching Matthew's self-play 800 visit expectation to visit every root children rather than leaving it to chance of asynchronous parallelism.

@sergiovieri Matthew from DeepMind says in very early stages we fill batches with nodes that we otherwise wouldn't have searched (from #748 (comment)) so in the very first batch, it'll happen to pick the first 16 highest prior moves to evaluate as the move pending evaluation assumes it is bad/loss/checkmate, so the next pick/simulation will skip over some moves to fill up the batch. Depending on the hardware and latency and threading, even more moves can get picked that a sequential MCTS would not have.

Specifically for the stalemate situation, AZ would probably have looked at the top 16 prior moves, so the 99.90%, 0.05% and half of the 0.00% prior moves. With unvisited Q=loss and extremely close to 0% prior leading to U~=0, a sequential search would stick with the stalemate move because it thinks all other moves are losing; however, because AZ happens to include 15 other moves, it realizes Q=loss is wrong and Q=near-win. Root FPU=1 here makes all root children get visited allowing search to then realize every move other than the 99.90% prior is winning.

@oscardssmith
Copy link
Contributor

This also seems good for training. Getting at least 1 visit will partially prevent self reinforcing policy sharpening. (policy shaping tree, shaping policy)

@alreadydone
Copy link

alreadydone commented Feb 22, 2019

Instead of setting FPU=1, AZ might have done something similar to LZ Go, namely setting the evaluation of expanding nodes (i.e. node whose evaluations/policy aren't ready yet) to a very low value to avoid selecting them (but instead select their siblings):
https://github.com/leela-zero/leela-zero/blob/master/src/UCTNode.cpp#L268

(However, if you're gonna expand all root children, then you don't even need policy at root and can batch together evaluation of the root with evaluations of its children, which is handy.)

@Mardak
Copy link
Contributor Author

Mardak commented Feb 22, 2019

Yup, AZ's batching behavior is very similar to root FPU=1 for Chess because their eager batching happens to grab most/all of the ~35 children whereas for Go, ~250 children probably aren't all visited.

@Mardak
Copy link
Contributor Author

Mardak commented Feb 22, 2019

I updated PR realizing we don't want to affect t40 with a new client that didn't support the existing self-play.

Edited into original comment: Added fpu-strategy=root-absolute to do absolute at root and reduction otherwise. Self-play maintains current previous default fpu-strategy=reduction and still sets fpu-reduction=0, so that's unaffected by the code simplification of not checking noise (but would prevent wanting to use reduction with noise except at root -- do we want to keep that?). To turn on new behavior for training, server would set fpu-strategy=root-absolute (with fpu-value=1 fpu-reduction=0 still set on the client).

@Mardak
Copy link
Contributor Author

Mardak commented Feb 22, 2019

Ah, I guess we probably want to also support FPU=-1 for self-play similar to t40 except at root. So many configurations… ;) I guess cleanest is to just make a separate param that just does root FPU=1 while allowing absolute and reduction as they are now and keep existing noise checks, etc.

@remdu
Copy link

remdu commented Feb 22, 2019

(with fpu-value=1 fpu-reduction=0 still set on the client).
If we have fpu-value=1 at root why not activate fpu-reduction with optimal tuning for non-root nodes ?

@ghost
Copy link

ghost commented Feb 22, 2019

Does this change gain significant Elo over the current search?

@Mardak
Copy link
Contributor Author

Mardak commented Feb 25, 2019

@mooskagh Updated approach to add --early-root-widening flag which changes FPU behavior at root only for search (leaving verbose, temperature, other GetFPU callers unaffected).

Does this change gain significant Elo over the current search?

I ran selfplay tournament with 50105 with mostly training run 2 params with additional --player2.no-early-root-widening, and it did gain some Elo but not significantly (?? but is nearly 100% likelihood of superiority).

tournamentstatus P1: +301 -261 =266 Win: 52.42% Elo: 16.80 LOS: 95.42% P1-W: +162 -118 =135 P1-B: +139 -143 =131

The training data is likely a bit improved, and here's some interesting games where the new flag allowed player1 to find very low prior moves that player2 didn't expect resulting in quick resigns. The training data would likely result in training these priors towards 80% instead of 0% that they had been. Yes, Dirichlet noise helps find these too, so the change here just makes it a bit more consistent.

As usual with training data, the below are just some games that end quickly with wild swings but the outcome isn't the only interesting result. Even for already won or already lost positions, the network learns from MCTS providing better search probabilities to generalize.


White thought it was winning (so black thought it was losing) and didn't expect black to move the rook for a discovery check to win a free rook. White resigns without another move.
https://lichess.org/irjPpfMD#60

position startpos moves g1f3 e7e6 d2d4 g8f6 c2c4 b7b6 g2g3 c8b7 f1g2 c7c5 d4d5 e6d5 f3h4 g7g6 b1c3 f8g7 e1g1 e8g8 c1g5 b8a6 c3d5 b7d5 g2d5 a8b8 d1d2 a6b4 d5f3 b6b5 g5f4 b5c4 f4b8 d8b8 a2a3 b4a6 a1c1 d7d5 f3d5 f8d8 e2e4 f6d5 e4d5 a6c7 c1c4 d8d5 d2c2 c7e6 b2b4 e6d4 c2d1 c5b4 a3b4 b8b5 c4c8 g7f8 f1e1 g8g7 g1g2 b5b7 c8c4

30… d5c5 (P:  0.93%) (Q: -0.27458)

irjppfmd


White thought it was slightly behind and definitely didn't expect a 0.02% prior rook sacrifice to win a queen. White resigns without another move.
https://lichess.org/4aD347R1#52

position startpos moves g1f3 g8f6 d2d4 e7e6 b2b3 f8e7 e2e3 d7d5 c1b2 c7c5 d4c5 e7c5 a2a3 e8g8 b1d2 a7a6 c2c4 c5e7 c4d5 f6d5 f1e2 b8c6 e1g1 e7f6 b2f6 d8f6 d1c1 d5c3 e2d3 f8d8 c1c2 h7h6 b3b4 c3d5 d3h7 g8h8 h7e4 c8d7 d2b3 c6e5 b3c5 d7b5 f1c1 a8c8 f3e5 f6e5 e4d5 d8d5 a3a4 b5c6 c2c3

26… d5d1 (P:  0.02%) (Q:  0.10799)

4ad347r1


Both sides were thinking the game was drawn in this equal-material QRPP endgame, but white searching wide finds the lowest prior 0.01% winning move where black resigns a couple moves later.
https://lichess.org/NcYepjgb#99

position startpos moves g1f3 e7e6 d2d4 f8e7 e2e3 g8f6 c2c4 e8g8 f1d3 d7d5 e1g1 c7c5 c4d5 e6d5 d4c5 e7c5 b1c3 b8c6 a2a3 a7a5 b2b3 d5d4 c3e4 f6e4 d3e4 d4e3 c1e3 c5e3 f2e3 d8e7 d1d3 h7h6 f3d4 c8d7 e4d5 c6d4 e3d4 d7c6 d5c6 b7c6 f1c1 e7f6 c1c4 f8e8 a1f1 f6d6 d3f3 d6a3 f3f7 g8h8 c4a4 a3d6 h2h3 e8f8 f7c4 f8f1 c4f1 a8f8 f1c4 d6g3 a4a1 f8f2 c4c6 g3e3 c6d5 f2f8 g1h1 e3d3 h1h2 d3d2 d5e5 f8f2 e5e8 h8h7 e8e4 h7g8 b3b4 d2b4 e4d5 g8h7 a1a5 b4b8 h2g1 b8f4 a5a1 f2d2 g1h1 d2d4 d5b3 d4b4 b3c2 h7g8 a1d1 b4b8 d1a1 f4d4 a1d1 d4b2

50 d1d8 (P:  0.01%) (Q:  0.01082)

ncyepjgb


White thought it was losing initially before searching wide to find this 0.09% rook sacrifice to win a queen. Black resigns without another move.
https://lichess.org/XP1bn6Pf#55

position startpos moves e2e4 e7e5 b1c3 g8f6 g1h3 f8c5 f1c4 e8g8 e1g1 c7c6 d2d3 b7b5 c4b3 a7a5 a2a3 d7d6 h3g5 c5b6 c3e2 h7h6 g5f3 f8e8 c2c4 b8a6 c4b5 c6b5 c1e3 b6e3 f2e3 a8a7 a1c1 d8b6 d3d4 f6e4 f3d2 e4d2 d1d2 b5b4 b3d5 c8e6 d5e6 f7e6 d2c2 a7c7 c2g6 e8e7 c1e1 b6c6 d4e5 d6e5 e1d1 c7c8 e2g3 c6c2

28 d1d8 (P:  0.09%) (Q: -0.11440)

xp1bn6pf

@Mardak Mardak changed the title Use FPU=1 at root for self-play and matches to explore wider initially like AlphaZero Add EarlyRootWidening to explore wider initially like AlphaZero Feb 25, 2019
@Mardak
Copy link
Contributor Author

Mardak commented Feb 25, 2019

To be clear, the positions in the previous comment, the resigning side played a blundering move because it didn't expect the tactic. The idea is that training improves the priors for the tactical move so that the network learns to avoid putting itself in a bad situation.

Overall behavior is that the network searches deep as best as it knows according to the network, and search wide at root to find the network's blindspots. Root is special because that's what is used to train future move probabilities.

@Mardak
Copy link
Contributor Author

Mardak commented Feb 25, 2019

One potential open question is how early/urgent should visiting children be with this change? Right now I've set it to 1.0f but U for already visited winning positions would result in Q+U greater than that of very low prior unvisited moves. Making the value 2.0f would make it more likely but still not guaranteed, e.g., the stalemate capture promote to queen position has Q+U almost 4.

e7f8q (P: 99.90%) (Q:  0.99146) (D:  0.000) (U: 2.99718) (Q+U:  3.98864)

@remdu
Copy link

remdu commented Feb 25, 2019

The only thing I don't like about this is that it doesn't really transfer to other games like Go. In go exploring every move at root would be bad, a balance would need to be found, which would probably be some kind of FPU= running parent Value. But in Chess I don't see any issues.

@Mardak
Copy link
Contributor Author

Mardak commented Feb 26, 2019

Indeed, visiting all for Go isn't the behavior we noticed in #748. It's just that as a simplification of the AZ's behavior of visiting 26 root children out of 33 for that position is pretty close to visiting all without trying to emulate the TPU behavior more precisely. I suppose a closer approximation would be to allow some number of root children to always be visited, and that could then be more directly reused for Go.. But then unclear if this behavior is desired for Go anyway…

@jkormu
Copy link

jkormu commented Feb 27, 2019

This pr makes the search wider at the root which could affect badly with endgame temp as any move with visits could be picked by temp. On the other hand, searching wider in the root should help finding good moves with low prior.
To test which one matters more, I ran a match with training like settings with endgame temp 0.45 between lc0 master and lc0 pr750 (EDIT: 800 nodes).
The shared settings:

arg="--backend=cudnn-fp16" arg="--weights=$PWD/weights_run2_32930.pb.gz" arg='--smart-pruning-factor=0.0' arg='--threads=1' arg='--minibatch-size=32' arg='--cpuct=2.5' arg='--cpuct-factor=0.00' arg='--fpu-reduction=0.00' arg='--fpu-value=-1.00' arg='--policy-softmax-temp=1.00' arg='--max-collision-events=1' arg='--max-collision-visits=1' arg="--temperature=1.1" arg="--temp-endgame=0.45" arg="--temp-cutoff-move=16" arg="--temp-visit-offset=-0.25" arg="--noise"

In addition to these settings, master used arg='--fpu-strategy=absolute' and lc0_pr750 arg='--fpu-strategy=root-absolute'.
Results with 2 ply repeated opening book show versions being pretty much equal or pr750 marginally worse:

   # PLAYER                 :  RATING  ERROR  POINTS  PLAYED   (%)  CFS(%)    W     D    L  D(%)
   1 lc0_master_20190221    :       0      4  1707.5    3372  50.6      84  966  1483  923  44.0
   2 lc0_pr750_20190224     :      -4      4  1664.5    3372  49.4     ---  923  1483  966  44.0

Then to test if this is caused by endgame temp I ran a match with --temp-endgame=0.0 for both engines and rest of the settings as above:

   # PLAYER                 :  RATING  ERROR  POINTS  PLAYED   (%)  CFS(%)    W     D    L  D(%)
   1 lc0_pr750_20190224     :       9      5  1280.5    2500  51.2      97  614  1333  553  53.3
   2 lc0_master_20190221    :       0      5  1219.5    2500  48.8     ---  553  1333  614  53.3

It indeed seems that pr750 might not play that well together with endgame temp but the effect is not big.

@Mardak
Copy link
Contributor Author

Mardak commented Feb 27, 2019

Thanks for running the numbers with 32930. Just making sure, the latter numbers without endgame temperature for more match-like settings over 2500 games, a very mature network is able to gain 9 Elo when visiting root children earlier?

There's many ways to adjust temperature depending on how the first set of games were lost. If they were purely from 1-visit moves getting picked, then a lower temperature or more negative visit-offset could be used. If they're from playing down the bad lines due to network overvaluing positions that would have been hidden by search not visiting low priors, then it's actually a beneficial outcome for the network to lessen the value of those positions for future networks.

@jkormu
Copy link

jkormu commented Feb 27, 2019

Yes 9+/-4 elo but the setting is stll training like, just with zero endgame temp. Also my next question was if the temp could be counter acted with visit offset -1 since presumably most of the root blunders get no more than one visit. I have a match running with offset -0.999 (For some reason -1 is not possible).

@oscardssmith
Copy link
Contributor

@jkormu, are you sure you aren't testing one version against itself? 9 elo seems way too small

@jkormu
Copy link

jkormu commented Feb 28, 2019

@oscardssmith Double checked and it seems to be master vs pr750. What kind of Elo gains you are expecting with this pr in training like setting? Please note that in the above tests master and pr750 had identical setting (including all noise settings) with only difference in --fpu-strategy. The test did not state that dropping noise from 0.45 to 0.0 gains 9 Elo. In first match both used 0.45 endgame temp and in latter both used 0.0.

@oscardssmith
Copy link
Contributor

It's not so much that I have specific expectations as I get suspicious when I see multiple tests within 10 elo, as I've been burned by this before.

@jkormu
Copy link

jkormu commented Feb 28, 2019

Results for temp --temp-visit-offset=-0.999 and --temp-endgame=0.45 show that offset of -1 can be used to counteract the negative effects of wider root search together with temp:

   # PLAYER                 :  RATING  ERROR  POINTS  PLAYED   (%)  CFS(%)     W     D     L  D(%)
   1 lc0_pr750_20190224     :      10      3  2465.5    4800  51.4     100  1360  2211  1229  46.1
   2 lc0_master_20190221    :       0      3  2334.5    4800  48.6     ---  1229  2211  1360  46.1

Test was with 800 nodes, repeated 2 move book and both engines shared following settings:
arg="--backend=cudnn-fp16" arg="--weights=$PWD/weights_run2_32930.pb.gz" arg='--smart-pruning-factor=0.0' arg='--threads=1' arg='--minibatch-size=32' arg='--cpuct=2.5' arg='--cpuct-factor=0.00' arg='--fpu-reduction=0.00' arg='--fpu-value=-1.00' arg='--policy-softmax-temp=1.00' arg='--max-collision-events=1' arg='--max-collision-visits=1' arg="--temperature=1.1" arg="--temp-endgame=0.45" arg="--temp-cutoff-move=16" arg="--temp-visit-offset=-0.999" arg="--noise"

The only setting that was different was --fpu-strategy: pr750 used here root-absolute and master absolute.

@Mardak
Copy link
Contributor Author

Mardak commented Mar 1, 2019

Here's another 95+ LOS/CFS result from a more recent t50 network (50180) but this time with accelerated selfplay settings (lower visits, higher resign, temperature cutoff [all games were different])

tournamentstatus P1: +281 -238 =369 Win: 52.42% Elo: 16.84 LOS: 97.05% P1-W: +153 -115 =175 P1-B: +128 -123 =194

--player2.no-early-root-widening --fpu-strategy=absolute --visits=1000 --cpuct=2.5 --resign-percentage=20 --temperature=1.2 --temp-endgame=0.45 --temp-cutoff-move=16 --temp-visit-offset=-.25 --temp-value-cutoff=1 --minimum-kldgain-per-node=0.000008 --resign-wdlstyle=true --no-share-trees

@Mardak
Copy link
Contributor Author

Mardak commented Mar 8, 2019

@mooskagh Why is this marked "not for merge?" Are there other plans to implement a more AlphaZero-like search, which happens to improve training of tactics? It seems possible that lc0 has tried to work around tactical deficiency by additionally diverging from AlphaZero with 2.2 policy softmax temperature. So is there an explicit decision to not behave more like AlphaZero?

@mooskagh mooskagh removed the not for merge Experimental code which is not intended to be merged into the master label Mar 8, 2019
@mooskagh mooskagh added the enhancement New feature or request label Mar 8, 2019
@mooskagh
Copy link
Member

mooskagh commented Mar 8, 2019

It didn't have that behaviour switchable off in params, that's why I decided it was just for testing.

src/mcts/params.cc Outdated Show resolved Hide resolved
@MelleKoning MelleKoning mentioned this pull request Mar 8, 2019
@Mardak Mardak changed the title Add EarlyRootWidening to explore wider initially like AlphaZero Add AtRoot versions of FpuStrategy and FpuValue (and convert FpuReduction to FpuValue). Mar 8, 2019
@Mardak
Copy link
Contributor Author

Mardak commented Mar 8, 2019

@mooskagh Updated PR with your suggestion of FpuStrategyAtRoot and FpuValueAtRoot; and at the same time converted FpuReduction to FpuValue. I made "same" ignore FpuValueAtRoot and documented help as such to avoid complexities of thinking "same" might mean.. "if strategy is reduction, reduce eval by --fpu-value for most nodes except reduce root children by --fpu-value-at-root instead."

This maintains the default selfplay behavior of using fpu reduction 0, and for plain defaults, keeps the default fpu reduction 1.2 and adds fpu root absolute 1. This does change the behavior of only specifying --fpu-strategy=absolute, where before --fpu-value defaulted to -1 and now defaults to 1.2 (for the default reduction strategy), so both active training runs should use an explicit --fpu-value=-1 to maintain fpu absolute -1.

@Tilps If the server wants to try training with fpu reduction of 0.5 while leaving root children unchanged, it should set --fpu-value=0.5 --fpu-strategy-at-root=reduction --fpu-value-at-root=0

src/mcts/search.cc Outdated Show resolved Hide resolved
src/mcts/params.cc Outdated Show resolved Hide resolved
src/mcts/search.cc Outdated Show resolved Hide resolved
@Mardak
Copy link
Contributor Author

Mardak commented Mar 8, 2019

@Tilps The server should probably set --fpu-value=-1 in addition to the existing --fpu-strategy=absolute before many people use a version with this.

@Tilps
Copy link
Contributor

Tilps commented Mar 9, 2019

--fpu-value=-1.0 added to training parameters.

@mooskagh mooskagh merged commit 45e353b into LeelaChessZero:master Mar 9, 2019
@ASilver
Copy link

ASilver commented Mar 9, 2019

@Mardak So are you saying that in your preliminary tests, FPUR at 0.5 in training conditions actually increases exploration and strength?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AlphaZero searches much wider than lc0 at low visits Improve training data for learning tactics
9 participants