-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option for doing kldgain thresholding rather than absolute visit limiting. #721
Conversation
Might be better than current early stop. But using 1600 visits for early nets seems like a waste of computation to me. |
I ran a quick test of training condition game strength (with resign disabled) Net used was 40740. player 1 average visits over the last 2500 moves played was 766 compared to 996 for player 2 (this is below 1k due to early outs in one legal move positions). Maximum player1 visits was 5300, minimum (excluding forced moves) 200. tournament results was: |
To capture some discussion points that happened on dev on discord. An alternative approach which seems 'more correct' would be to do KL or cross entropy with the current distribution vs the policy prior of the root node. But that would require capturing the policy prior before noise is applied - and such an implementation might be less sensitive to flux when policy is doing some kind of oscillation. As in the average information gain per node 'relative to the start point' can be decreasing, but the average information gain over the last 100 nodes could still be high - if the current information gain is moving back towards the original policy. And if you are subtracting two cross entropy relative to the original policy to try and avoid that, you probably end up relatively close to the KLD of the two samples just like we currently do (but with still needing to solve the issue of sourcing the original policy). |
Based on results I think this is probably a decent candidate for a reinforcement learning test run as part of T50. |
KL(new||old) seems preferable to KL(old||new) on philosophical grounds. If it doesn't make a difference, then I think for clarity's sake we should prefer the first. (Of course, if testing shows the latter is noticeably stronger, then there should be no objection to using it.) |
KL(new||old) using the current if statement to avoid infinities - sometimes outputs a negative value - which is just as problematic as infinity. So I think I'll have to stick with the current logic unless we want to never terminate if an edge has gotten its first visit in the last 100 nodes. |
Also fixed some more cases where it still said kdgain instead of kldgain. |
I was checking the behavior with positions from #8 to see how kld gain changes. For reference, here's the 10k visits behavior for
Specifically for the "houdini" position, here's the kld gain for each 100 visits and move stats for the correct move:
So with the suggested A side note about the set of tactical positions above, looks like t40 so far has learned that some tactics should sometimes be played, e.g., "houdini," whereas other positions, e.g., "sctr", it has extremely low priors less than 0.3% thinking the move basically should never be played, so it'll need noise to bump up the priors to find the correct move and hopefully with the changes here, with luck getting an appropriate noise and this position leads to good training data to get out of the "never" into "sometimes" level. Here's "sctr" same 100 visit analysis as above except with a favorable-enough --noise:
And indeed some noise bumping from P: 0.28% to P: 3.41% finds the correct move even with just 200 visits. (In fact, a single visit to this move shows the network knows it's a winning position but search avoids it because of the extremely low prior similar to being blind to a mate-in-1.) Looks like the kld gain is not continuously decreasing here with a slight bump at 1200 visits, but it would stop at 1300 visits with 90.2% training data. So conditionally allowing more visits does seem to lead to higher visit percentage of the single "correct" move in these tactical positions. |
Primarily for Issue #684 but could be interesting to use a super small value for real games to 'give up' search and save time for other moves with more useful search in cases where there are equal options that mean the existing pruning logic can't fire.
Looking for candidate values for training I have tested averaging interval of 100 - and threshold of 5e-6. With a current T40 net average nodes is about 900, but 1600 for a very early net (although I only gathered a very small amount of data to start). Which probably makes sense - as nets mature more of the time they know the right search so it converges in agreement with the existing policy quickly. But there are still outliers which get a lot more nodes.
I have not done any investigation of whether the threshold is actually 'good' but I kind of doubt it'll be hard to measure usefully without trying in RL.