Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for doing kldgain thresholding rather than absolute visit limiting. #721

Merged
merged 5 commits into from
Feb 10, 2019

Conversation

Tilps
Copy link
Contributor

@Tilps Tilps commented Feb 8, 2019

Primarily for Issue #684 but could be interesting to use a super small value for real games to 'give up' search and save time for other moves with more useful search in cases where there are equal options that mean the existing pruning logic can't fire.

Looking for candidate values for training I have tested averaging interval of 100 - and threshold of 5e-6. With a current T40 net average nodes is about 900, but 1600 for a very early net (although I only gathered a very small amount of data to start). Which probably makes sense - as nets mature more of the time they know the right search so it converges in agreement with the existing policy quickly. But there are still outliers which get a lot more nodes.

I have not done any investigation of whether the threshold is actually 'good' but I kind of doubt it'll be hard to measure usefully without trying in RL.

@remdu
Copy link

remdu commented Feb 8, 2019

Might be better than current early stop. But using 1600 visits for early nets seems like a waste of computation to me.

@Tilps
Copy link
Contributor Author

Tilps commented Feb 8, 2019

I ran a quick test of training condition game strength (with resign disabled)
lc0.exe selfplay --player1.visits=10000 --player2.visits=1000 --player1.minimum-kldgain-per-node=0.000005 --cpuct=2.5 --temperature=1.1 --temp-endgame=0.45 --temp-cutoff-move=16 --temp-visit-offset=-0.25 --fpu-strategy=absolute --no-share-trees --backend-opts=cudnn-fp16

Net used was 40740.

player 1 average visits over the last 2500 moves played was 766 compared to 996 for player 2 (this is below 1k due to early outs in one legal move positions). Maximum player1 visits was 5300, minimum (excluding forced moves) 200.

tournament results was:
tournamentstatus win 111 71 lose 17 35 draw 62 81 (or about +120 elo for player1)

@Tilps
Copy link
Contributor Author

Tilps commented Feb 9, 2019

To capture some discussion points that happened on dev on discord.
my formula is KL(old||new) and not KL(new||old) - I did test both and didn't see much difference (except due to a difference in my implementation that meant KL(new||old) was infinite any time there was a new edge first visit), but reversing it might be fine (so long as we use the current if to avoid infinities).

An alternative approach which seems 'more correct' would be to do KL or cross entropy with the current distribution vs the policy prior of the root node. But that would require capturing the policy prior before noise is applied - and such an implementation might be less sensitive to flux when policy is doing some kind of oscillation. As in the average information gain per node 'relative to the start point' can be decreasing, but the average information gain over the last 100 nodes could still be high - if the current information gain is moving back towards the original policy. And if you are subtracting two cross entropy relative to the original policy to try and avoid that, you probably end up relatively close to the KLD of the two samples just like we currently do (but with still needing to solve the issue of sourcing the original policy).

@Tilps
Copy link
Contributor Author

Tilps commented Feb 9, 2019

Based on results I think this is probably a decent candidate for a reinforcement learning test run as part of T50.

@Tilps Tilps requested a review from mooskagh February 9, 2019 00:10
@ghost
Copy link

ghost commented Feb 9, 2019

KL(new||old) seems preferable to KL(old||new) on philosophical grounds. If it doesn't make a difference, then I think for clarity's sake we should prefer the first. (Of course, if testing shows the latter is noticeably stronger, then there should be no objection to using it.)

@Tilps Tilps changed the title Add option for doing kdgain thresholding rather than absolute visit limiting. Add option for doing kldgain thresholding rather than absolute visit limiting. Feb 9, 2019
@Tilps
Copy link
Contributor Author

Tilps commented Feb 9, 2019

KL(new||old) using the current if statement to avoid infinities - sometimes outputs a negative value - which is just as problematic as infinity. So I think I'll have to stick with the current logic unless we want to never terminate if an edge has gotten its first visit in the last 100 nodes.

src/mcts/search.cc Outdated Show resolved Hide resolved
src/mcts/search.cc Outdated Show resolved Hide resolved
@Tilps
Copy link
Contributor Author

Tilps commented Feb 10, 2019

Also fixed some more cases where it still said kdgain instead of kldgain.

@Tilps Tilps merged commit 7e9190a into LeelaChessZero:master Feb 10, 2019
@Mardak
Copy link
Contributor

Mardak commented Feb 10, 2019

I was checking the behavior with positions from #8 to see how kld gain changes. For reference, here's the 10k visits behavior for -w 40740 --policy-softmax-temp=1 --minibatch-size=1 --smart-pruning-factor=0 --cpuct=2.5 --fpu-strategy=absolute with 10k visits:

40740          sctr info string a4h4  (666 ) N:       0 (+ 0) (P:  0.28%) (Q: -1.00000) (U: 0.93832) (Q+U: -0.06168) (V:  -.----) 
40740          wasp info string e6e3  (560 ) N:       0 (+ 0) (P:  0.26%) (Q: -1.00000) (U: 0.87207) (Q+U: -0.12793) (V:  -.----) 
40740       exchess info string d5f6  (751 ) N:    9706 (+ 0) (P:  6.35%) (Q:  0.33459) (U: 0.00218) (Q+U:  0.33677) (V:  0.1223) 
40740 hakkapeliitta info string f3d1  (1321) N:    8371 (+ 1) (P: 39.26%) (Q: -0.14551) (U: 0.01558) (Q+U: -0.12993) (V: -0.3738) 
40740           ice info string f3c6  (592 ) N:    9606 (+ 1) (P:  2.13%) (Q:  0.52777) (U: 0.00074) (Q+U:  0.52851) (V:  0.7914) 
40740        bobcat info string e3h6  (561 ) N:    7740 (+ 1) (P: 13.60%) (Q:  0.37149) (U: 0.00584) (Q+U:  0.37733) (V:  0.7038) 
40740       houdini info string e2e7  (330 ) N:    9479 (+ 1) (P:  1.18%) (Q:  0.94177) (U: 0.00041) (Q+U:  0.94219) (V:  0.2905) 
40740          naum info string h5d5  (882 ) N:    9711 (+ 1) (P:  1.55%) (Q:  0.09490) (U: 0.00053) (Q+U:  0.09543) (V: -0.5509) 
40740       scorpio info string e1e7  (120 ) N:    9227 (+ 1) (P:  2.93%) (Q:  0.30268) (U: 0.00106) (Q+U:  0.30374) (V:  0.3373) 
40740     protector info string d4f5  (763 ) N:       0 (+ 0) (P:  0.04%) (Q: -1.00000) (U: 0.13385) (Q+U: -0.86615) (V:  -.----) 
40740       vajolet info string f6g4  (590 ) N:     157 (+ 0) (P:  1.42%) (Q: -0.45042) (U: 0.02986) (Q+U: -0.42056) (V: -0.3061) 
40740         cheng info string h2f4  (1582) N:       0 (+ 0) (P:  0.04%) (Q: -1.00000) (U: 0.13889) (Q+U: -0.86111) (V:  -.----) 

Specifically for the "houdini" position, here's the kld gain for each 100 visits and move stats for the correct move:
screen shot 2018-06-04 at 4 35 11 pm

position startpos moves d2d4 e7e6 c2c4 f8b4 c1d2 b4e7 e2e4 d7d5 e4e5 c7c5 d1g4 e7f8 d4c5 h7h5 g4g3 h5h4 g3a3 b8d7 g1f3 f8c5 b2b4 c5b6 d2g5 g8e7 a3b2 h8h5 c4d5 e6d5 f1b5 e8f8 e1g1 d7e5 b2e5 f7f6 e5f4 b6c7 f4e3 f6g5 b1c3 d8d6 b5d3 c7b6 e3e2 h4h3 f1e1 g5g4 f3e5 h5g5 e5g6 g5g6 d3g6 c8d7 g6h5 a8c8 a1c1 c8c4

Visits:  200 KLDGain: 0.00146723  e2e7  (330 ) N:       0 (+ 0) (P:  1.18%) (Q: -1.00000) (U: 0.42047) (Q+U: -0.57953) (V:  -.----)
Visits:  300 KLDGain: 0.000494106 e2e7  (330 ) N:       0 (+ 0) (P:  1.18%) (Q: -1.00000) (U: 0.51733) (Q+U: -0.48267) (V:  -.----)
Visits:  400 KLDGain: 0.000387563 e2e7  (330 ) N:       0 (+ 0) (P:  1.18%) (Q: -1.00000) (U: 0.59983) (Q+U: -0.40017) (V:  -.----)
Visits:  500 KLDGain: 0.000277188 e2e7  (330 ) N:       0 (+ 0) (P:  1.18%) (Q: -1.00000) (U: 0.67327) (Q+U: -0.32673) (V:  -.----)
Visits:  600 KLDGain: 0.00161057  e2e7  (330 ) N:      89 (+ 1) (P:  1.18%) (Q:  0.85793) (U: 0.00814) (Q+U:  0.86607) (V:  0.2905)
Visits:  700 KLDGain: 0.000424912 e2e7  (330 ) N:     189 (+ 1) (P:  1.18%) (Q:  0.90037) (U: 0.00420) (Q+U:  0.90457) (V:  0.2905)
Visits:  800 KLDGain: 0.000188827 e2e7  (330 ) N:     289 (+ 1) (P:  1.18%) (Q:  0.90552) (U: 0.00296) (Q+U:  0.90848) (V:  0.2905)
Visits:  900 KLDGain: 0.000104413 e2e7  (330 ) N:     389 (+ 1) (P:  1.18%) (Q:  0.91752) (U: 0.00234) (Q+U:  0.91987) (V:  0.2905)
Visits: 1000 KLDGain: 6.47659e-05 e2e7  (330 ) N:     489 (+ 1) (P:  1.18%) (Q:  0.92030) (U: 0.00198) (Q+U:  0.92227) (V:  0.2905)
Visits: 1100 KLDGain: 4.32495e-05 e2e7  (330 ) N:     589 (+ 1) (P:  1.18%) (Q:  0.92291) (U: 0.00173) (Q+U:  0.92464) (V:  0.2905)
Visits: 1200 KLDGain: 3.04345e-05 e2e7  (330 ) N:     689 (+ 1) (P:  1.18%) (Q:  0.92888) (U: 0.00155) (Q+U:  0.93043) (V:  0.2905)
Visits: 1300 KLDGain: 2.228e-05   e2e7  (330 ) N:     789 (+ 1) (P:  1.18%) (Q:  0.93212) (U: 0.00141) (Q+U:  0.93353) (V:  0.2905)
Visits: 1400 KLDGain: 1.68252e-05 e2e7  (330 ) N:     889 (+ 1) (P:  1.18%) (Q:  0.93068) (U: 0.00131) (Q+U:  0.93199) (V:  0.2905)
Visits: 1500 KLDGain: 1.30298e-05 e2e7  (330 ) N:     989 (+ 1) (P:  1.18%) (Q:  0.93214) (U: 0.00122) (Q+U:  0.93336) (V:  0.2905)
Visits: 1600 KLDGain: 1.03034e-05 e2e7  (330 ) N:    1089 (+ 1) (P:  1.18%) (Q:  0.93369) (U: 0.00115) (Q+U:  0.93484) (V:  0.2905)
Visits: 1700 KLDGain: 8.2922e-06  e2e7  (330 ) N:    1189 (+ 1) (P:  1.18%) (Q:  0.93550) (U: 0.00109) (Q+U:  0.93659) (V:  0.2905)
Visits: 1800 KLDGain: 6.77497e-06 e2e7  (330 ) N:    1289 (+ 1) (P:  1.18%) (Q:  0.93501) (U: 0.00104) (Q+U:  0.93605) (V:  0.2905)
Visits: 1900 KLDGain: 5.60818e-06 e2e7  (330 ) N:    1389 (+ 1) (P:  1.18%) (Q:  0.93423) (U: 0.00099) (Q+U:  0.93522) (V:  0.2905)
Visits: 2000 KLDGain: 4.69577e-06 e2e7  (330 ) N:    1489 (+ 1) (P:  1.18%) (Q:  0.93514) (U: 0.00095) (Q+U:  0.93609) (V:  0.2905)
Visits: 2100 KLDGain: 3.97177e-06 e2e7  (330 ) N:    1589 (+ 1) (P:  1.18%) (Q:  0.93523) (U: 0.00092) (Q+U:  0.93615) (V:  0.2905)
Visits: 2200 KLDGain: 3.38978e-06 e2e7  (330 ) N:    1689 (+ 1) (P:  1.18%) (Q:  0.93502) (U: 0.00089) (Q+U:  0.93591) (V:  0.2905)
Visits: 2300 KLDGain: 2.91652e-06 e2e7  (330 ) N:    1789 (+ 1) (P:  1.18%) (Q:  0.93284) (U: 0.00086) (Q+U:  0.93370) (V:  0.2905)
Visits: 2400 KLDGain: 2.52765e-06 e2e7  (330 ) N:    1889 (+ 1) (P:  1.18%) (Q:  0.93440) (U: 0.00083) (Q+U:  0.93523) (V:  0.2905)
Visits: 2500 KLDGain: 2.20513e-06 e2e7  (330 ) N:    1989 (+ 1) (P:  1.18%) (Q:  0.93451) (U: 0.00081) (Q+U:  0.93532) (V:  0.2905)

So with the suggested minimum-kldgain-per-node=0.000005, the self-play would stop at 2000 visits and have training data of 74.5% for the tactic whereas normally with 800 visits, the training data would only be 36.1%.

A side note about the set of tactical positions above, looks like t40 so far has learned that some tactics should sometimes be played, e.g., "houdini," whereas other positions, e.g., "sctr", it has extremely low priors less than 0.3% thinking the move basically should never be played, so it'll need noise to bump up the priors to find the correct move and hopefully with the changes here, with luck getting an appropriate noise and this position leads to good training data to get out of the "never" into "sometimes" level.

Here's "sctr" same 100 visit analysis as above except with a favorable-enough --noise:
screen shot 2018-05-31 at 10 38 50 am

position startpos moves d2d4 d7d5 c1f4 g7g6 e2e3 g8f6 c2c4 c7c5 d4c5 f8g7 b1c3 d8a5 c4d5 f6d5 d1d5 g7c3 b2c3 a5c3 e1e2 c3a1 f4e5 a1b1 e5h8 c8e6 d5d3 b1a2 e2f3 f7f6 h8g7 b8d7 f3g3 a8c8 c5c6 c8c6 d3d4 c6d6 d4b4 d6b6 b4h4 d7c5 h2h3 b6b2 g1e2 a2d5 g3h2 d5e5 e2g3 h7h5 h4d4 e5d4 e3d4 c5b3 g7h6 h5h4 g3e4 g6g5 f1d3 b3d4 h1a1 a7a6 e4c5 b2f2 d3e4 e6f5 e4b7 f2c2 a1a4 d4e2 c5e4 f5e4 b7e4 c2c1 e4d3 e2f4 d3a6 f4h5

Visits:  200 KLDGain: 0.0056011   a4h4  (666 ) N:      82 (+ 0) (P:  3.41%) (Q:  0.76386) (U: 0.01462) (Q+U:  0.77848) (V:  0.4937)
Visits:  300 KLDGain: 0.000786082 a4h4  (666 ) N:     182 (+ 2) (P:  3.41%) (Q:  0.74512) (U: 0.00807) (Q+U:  0.75319) (V:  0.4937)
Visits:  400 KLDGain: 0.000219698 a4h4  (666 ) N:     282 (+ 1) (P:  3.41%) (Q:  0.72188) (U: 0.00610) (Q+U:  0.72798) (V:  0.4937)
Visits:  500 KLDGain: 9.13135e-05 a4h4  (666 ) N:     382 (+ 1) (P:  3.41%) (Q:  0.75494) (U: 0.00506) (Q+U:  0.76000) (V:  0.4937)
Visits:  600 KLDGain: 4.65154e-05 a4h4  (666 ) N:     482 (+ 1) (P:  3.41%) (Q:  0.73736) (U: 0.00441) (Q+U:  0.74177) (V:  0.4937)
Visits:  700 KLDGain: 5.45679e-05 a4h4  (666 ) N:     580 (+14) (P:  3.41%) (Q:  0.69817) (U: 0.00389) (Q+U:  0.70207) (V:  0.4937)
Visits:  800 KLDGain: 4.16925e-05 a4h4  (666 ) N:     678 (+ 1) (P:  3.41%) (Q:  0.68492) (U: 0.00366) (Q+U:  0.68857) (V:  0.4937)
Visits:  900 KLDGain: 2.22102e-05 a4h4  (666 ) N:     776 (+ 1) (P:  3.41%) (Q:  0.68404) (U: 0.00340) (Q+U:  0.68744) (V:  0.4937)
Visits: 1000 KLDGain: 9.26349e-06 a4h4  (666 ) N:     875 (+ 2) (P:  3.41%) (Q:  0.70031) (U: 0.00319) (Q+U:  0.70350) (V:  0.4937)
Visits: 1100 KLDGain: 6.19508e-06 a4h4  (666 ) N:     975 (+ 1) (P:  3.41%) (Q:  0.70712) (U: 0.00302) (Q+U:  0.71013) (V:  0.4937)
Visits: 1200 KLDGain: 2.11735e-05 a4h4  (666 ) N:    1073 (+ 0) (P:  3.41%) (Q:  0.71621) (U: 0.00288) (Q+U:  0.71909) (V:  0.4937)
Visits: 1300 KLDGain: 3.56555e-06 a4h4  (666 ) N:    1172 (+ 1) (P:  3.41%) (Q:  0.72742) (U: 0.00275) (Q+U:  0.73017) (V:  0.4937)
Visits: 1400 KLDGain: 4.60444e-06 a4h4  (666 ) N:    1270 (+ 0) (P:  3.41%) (Q:  0.73108) (U: 0.00264) (Q+U:  0.73372) (V:  0.4937)
Visits: 1500 KLDGain: 2.30266e-06 a4h4  (666 ) N:    1369 (+ 1) (P:  3.41%) (Q:  0.73558) (U: 0.00255) (Q+U:  0.73813) (V:  0.4937)
Visits: 1600 KLDGain: 1.93057e-06 a4h4  (666 ) N:    1469 (+ 1) (P:  3.41%) (Q:  0.73917) (U: 0.00246) (Q+U:  0.74163) (V:  0.4937)
Visits: 1700 KLDGain: 1.52269e-06 a4h4  (666 ) N:    1567 (+ 1) (P:  3.41%) (Q:  0.73915) (U: 0.00238) (Q+U:  0.74153) (V:  0.4937)
Visits: 1800 KLDGain: 1.83831e-06 a4h4  (666 ) N:    1665 (+ 0) (P:  3.41%) (Q:  0.73990) (U: 0.00232) (Q+U:  0.74222) (V:  0.4937)
Visits: 1900 KLDGain: 1.60121e-06 a4h4  (666 ) N:    1764 (+ 0) (P:  3.41%) (Q:  0.73988) (U: 0.00226) (Q+U:  0.74214) (V:  0.4937)
Visits: 2000 KLDGain: 2.1454e-06  a4h4  (666 ) N:    1862 (+ 1) (P:  3.41%) (Q:  0.73207) (U: 0.00220) (Q+U:  0.73427) (V:  0.4937)
Visits: 2100 KLDGain: 8.65633e-07 a4h4  (666 ) N:    1961 (+ 1) (P:  3.41%) (Q:  0.73374) (U: 0.00215) (Q+U:  0.73589) (V:  0.4937)
Visits: 2200 KLDGain: 7.65398e-07 a4h4  (666 ) N:    2060 (+ 1) (P:  3.41%) (Q:  0.73353) (U: 0.00210) (Q+U:  0.73562) (V:  0.4937)
Visits: 2300 KLDGain: 7.09194e-07 a4h4  (666 ) N:    2159 (+ 1) (P:  3.41%) (Q:  0.72318) (U: 0.00205) (Q+U:  0.72524) (V:  0.4937)
Visits: 2400 KLDGain: 1.99905e-06 a4h4  (666 ) N:    2257 (+ 1) (P:  3.41%) (Q:  0.72279) (U: 0.00201) (Q+U:  0.72480) (V:  0.4937)
Visits: 2500 KLDGain: 9.32765e-07 a4h4  (666 ) N:    2355 (+ 1) (P:  3.41%) (Q:  0.72424) (U: 0.00198) (Q+U:  0.72622) (V:  0.4937)

And indeed some noise bumping from P: 0.28% to P: 3.41% finds the correct move even with just 200 visits. (In fact, a single visit to this move shows the network knows it's a winning position but search avoids it because of the extremely low prior similar to being blind to a mate-in-1.) Looks like the kld gain is not continuously decreasing here with a slight bump at 1200 visits, but it would stop at 1300 visits with 90.2% training data.

So conditionally allowing more visits does seem to lead to higher visit percentage of the single "correct" move in these tactical positions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants