-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve training data for learning tactics #8
Comments
I think this is probably good enough. As it moves towards 16.3% it will accelerate and move up even faster. Also generally I think we should not do any of these sort of changes that try to improve on the paper until after we fix things that are probably wrong such as rule50. |
Yes, in fact from all the runs, once the prior for this particular move reaches P: 1.28%, it'll start driving at least 100 visits of 800 towards it. Similarly, once it gets to P: 2.25%, over 700 visits will go to it for near 90% average tactic training. I.e., networks are not in a virtuous cycle of self-learning for this tactic yet. However, if you look at the data showing the priors for this move across the various network ids, the prior has stayed around 0.3%. And from that same data, it shows that nearly all those networks would have put over 700 visits into the move if only search initially had 2 visits to it. That means even with the current "16.3% average tactic training," there is way more training data that drives the prior towards 0.3% instead of higher. This means across 250 network generations, the existing "16.3%" noise has been unable to get it to learn this tactic when 2 visits would have. The new network prior approaches |
@ASilver requested running with 3.1 PUCT, and I see that the latest lc0 master 2321011 uses that and gets 24.9%. The earlier runs were against then-next 5054269 with 16.3% average tactic training. Here's a graph of testing various PUCT at noised root: diff --git a/src/mcts/search.cc b/src/mcts/search.cc
--- a/src/mcts/search.cc
+++ b/src/mcts/search.cc
@@ -677 +677 @@ std::pair<Node*, bool> Search::PickNodeToExtend(Node* node,
- float factor = kCpuct * std::sqrt(std::max(node->GetChildrenVisits(), 1u));
+ float factor = (is_root_node && kNoise ? 3.5f : kCpuct) * std::sqrt(std::max(node->GetChildrenVisits(), 1u)); |
Here's some more analysis on other board states from glinscott/leela-chess#698 (comment): SCTR vs id359 game 1 (this is from OP)
id359 vs Wasp game 3
id351 vs EXchess game 1
id351 vs Hakkapeliitta game 1
iCE vs id351 game 1
id351 vs Bobcat game 2
Here's the "average tactic training" for various engine configurations and board states:
These are all tested with current master 2321011 where default is PUCT 3.1, α 0.3, ε 0.25, no twice visits. The patches for root PUCT and twice visits are in earlier comments, and noise changes are just changing I used the networks listed (id351 or id359) as notably, the latest id369 has learned the tactic from the Hakkapeliitta game where with PUCT 1.2 (lczero 0.6), an average tactic training of 57.6% was enough to learn it. Here's the priors for each of the expected moves:
So looks like noise is indeed working, and with PUCT change to 3.1, there'll be less of a need to do additional changes to improve training tactics. |
I analyzed all the CCLS id359 games to find low prior moves that were played to see if the same network would have found it with noise. Here's a first set that the other AI played that the network didn't really considered at all. Not all have major swings in win rate or even change the outcome, but at least this first one against Houdini, lczero thought it was 63% win rate but after the 0.15% prior move, it's actually the opponent with 75% win rate. For each unexpected played move, a screenshot and uci position and what id359 thought of it and the top alternate moves when forced to explore at least 10 visits of 800: Houdini Round 8 29. Qxe7+ 0.15%
Naum Round 3 33. …Qxd5 0.05%
Scorpio Round 2 22. Rexe7 0.07%
Protector Round 6 13. Nf5+ 0.10%
Vajolet Round 5 23. …Nxg4 0.17%
Cheng Round 1 24. …Bf4+ 0.01%
And same analysis as before with 50 noised games from the above board states to calculate the average training:
|
I rebased @ASilver's params from #46 onto 2321011 where I did the tests from earlier. Analyzing the same 12 games from earlier with the same networks each with 50 runs of noise, the average training does go up quite a bit. This is most likely from softmax, as it increases the policy quite a bit in each of these cases, where the usual priors are much lower. I've included the training numbers and move prior for default and with the adjusted params:
As @killerducky pointed out earlier, we probably shouldn't touch the training until other things are fixed, so these numbers are at least reassuring in that if the network never got so biased to avoid these moves to begin with, they would pretty naturally find these correct moves with the default noise settings. In terms of network progression, even with a clean start, priors could end up very low like these 0.0x% numbers because the value head hadn't learned to favor a position, so training search would give less visits. But I suppose that could be something revisited later if it seems to be stuck again and failing to generate useful training. |
There was a request to check the latest lc0 test id14 4ce96dba to see what it thought of each of these games. Looks like the network already avoids most of these moves except a couple games and has trouble generating training data to increase prior for the move. Below are the average training using default noise as well as the priors for the move:
Here's a check to see if the network would have found the move if forced to explore at least 20 visits out of 2000:
So the network with its value and policy could generate valuable training data if search was biased the right away in almost all the cases. Unclear if this is a limitation on the size of the network. I'm not sure if it's concerning yet that the network would generate average training quite a bit less compared to the previous comment with default training. |
Here's a look at the prior progression for moves from Scorpio vs id359 Round 2 from #8 (comment) using the lc0 test networks: For reference, here's the board and moves id359 would have found when forced:
|
I analyzed id396 from @scs-ben compared to id395 for these positions, and some have significantly better average tactics training even though the same training data resulted in similar priors for the correct move for both networks. The main difference is that the value evaluation for the expected move is generally favorable, which drives more visits during search. The Scorpio game in the previous comment shows the new network believes the move is winning (V: 3.05%) where the previous network thinks it's losing (V: -38.69%), so even though priors are very low around 0.2% for both, the id396's value results in 60.3% training up from 18.0%! @killerducky is this change in value expected from 50-normalization? It should definitely help generate better training data in some cases! 👍
For reference in the Scorpio game, here's id396 V for each move showing two moves have positive V:
Whereas with id395, only one move:
|
I wouldn't say it's directly expected. But r50 was breaking the net, so with it fixed hopefully the net will fix other things too, or we will find the next problem. |
There looks to be quite a bit of difference between id401 and id402. I'm surprised at how much the value can change in just one network.
|
Rerunning the original "SCTR" position with 11089 with varying visits (no noise, no softmax, no aversion):
Those would estimate average policy training from the existing 0.62% to: 0%, 0%, 28%, 64% And with "Wasp" position:
Similarly increasing 1.15% prior towards: 18%, 59%, 79%, 89%. For reference, here's the other top visited moves at 6400:
At least for these tactical positions where other moves would be significantly worse than the one correct play, increasing visits allows MCTS to eventually give enough visits to the higher prior moves to then find the hidden tactics. So instead of adjusting noise in various ways, just simply doubling visits should lead to significantly higher visits to the correct move and consequently rapidly increasing the prior training above the noise threshold. (Increasing visits improves policy head while keeping the existing noise settings, and increasing visits also improves value head while keeping existing temperature without needing #237.) |
I reran the positions with 11089, and things definitely seem better than before finding 6 of 12 correct tactical moves with self-play settings and 800 visits.
If using the default match settings for cpuct and softmax, 11089 finds all except one:
And here's the result with latest test20:
Interesting to see how different the initial network V can be from the searched Q in these positions. |
Co-authored-by: borg323 <[email protected]>
* Add a mode to turn lc0 into a chunk data rescorer powered by Tablebase. * Add some stats. * Add secondary rescoring using wdl to reduce back propigation of blunders a bit. * Add policy distribution adjust support to rescorer. * Track the game outcomes, and the change to the start of the game * Add DTZ based assistance for secondary rescoring. * Change move count to a moves remaining to potentially use for modulating target value. * Use DTZ for pawnless 3 piece positions as a substitue for DTM to adjust move_count to be more correct * another fix. * More fixing. * Getting things compiling again. * Make rescorer more obvious. * reorder to match struct order. * Actually update the version when converting to v4 format. * Implement the threading support. * Fix compilation issues on some compilers. * More compilation fixing. * Fix off by one. * Add support for root probe policy boosting for minimum dtz in winning positions. * Fix test compile. * Fix missing option. * Add a counter. * Log if policy boost is for a move labelled illegal. * Add a histogram for total amount of boosted policy per boosted position. * Distribute boost rather than apply to all - also log before and after dists. * Add gaviotatb code for later use in dtm_boost * Fix compile issue on linux. * Prepare logic for dtm policy boost. * Load gaviota tb if specified. * Probe gaviota to decide which 'safe' moves are most deserving of boost based on dtm. * First attempt at supporting arbitrary starting point training data for rescorer. * Fix missing brackets. * Some fixes. * Avoid crashes from walking history before start of provided game information. * Some more merge fixes. * Fix some formatting. * Only process .gz files, don't crash out on invalid files, don't create output until input has been read. * Don't keep partially valid files. * Add basic range validation for input data. * Don't create writer any earlier than needed. * Fix decoding castling moves for the new Move format. * Validate game moves for legality. * Also log illegal move if it passes probability check but fails the real check. * Fix another merge error. * Compile fix for linux. * Plies left in rescorer (#1) * Rescore move_count using Gaviota TBs * Fix lczero-common commit * Add condition for Gaviota move_count rescoring * Post merge fixup for the kings/knights change in board. * Rescore tb v5 (#2) * Make lc0 output v5 training data. * Finish merge of v5 data into rescorer tb. * Fixes for rescoring v4 data. * Revert some unneeded formatting changes. * Support FRC input_format in rescoring. * Add some very important missing break statements... * Fix merge. * Change movement decode to not rely on there being any history planes filled in. Since that will not always be the case for input type 3. * Minimum changes to make it compile again post merge. * Input format 3 support. * Fix data range checks were incorrect for format 3 and 2. * Fix up bugs with chess 960 castle moves that leave a rook or king in place. * Post merge compile fixups for renames. * Add support for hectoplies and hectoplies armageddon to validate, and fixup the merge of latest code. * More fixes for type 4 and 132. * Add input format conversion support to rescorer. * Better match for training. * Add canonical v2 format to rescorer. * Add a utility for substituting policy from higher quality data into main data. * Fix missing option and add some commented out diagnostic code. * More cleanup in comments. * Handle empty policy-substitutions dir and input dir better. * Don't keep chunks that are marked as not for training. * More fixes for handling files with placeholder chunks. * Add 'deblunderer' Completely untested... * Fix some bugs in deblunder. * simplify windows rescorer build (#4) Co-authored-by: borg323 <[email protected]> * Tweak windows build file. * Some updates for writer.h/cc for v6 * Update rescorer loop.cc for V6. * Some additional validations to do with played_idx/best_idx. * make appveyor build the rescorer (#7) Co-authored-by: borg323 <[email protected]> * subproject for gaviota tb files (#8) Co-authored-by: borg323 <[email protected]> * 'Fix' for build on windows Probably should be fixed some other way... * Fix my breakage. (#9) * Update loop.cc * Update meson.build * Use the v6 field played_q to do a more direct blunder rescoring (#5) * included the issue 1308 deblunder mechanism in loop.cc * blunder detection now acts on missed proven wins and unforced proven losses * added comment on missing activeM * removed probabilistic randomization of result rescorer and worked with v6 data instead * included moves left rescore, removed unneeded options * doubled code not needed as final positions aren't special * changed appveyor script to hopefully build rescorer.sln * reverted failed attempt at fixing appveyor * included minimal std::cout for blunders * included blunder counter, added comment to visits v6 data checking * checking for bit 3 of invariance info to make sure best_q is a proven win * Fix v5 upgrading for decisive games. * Additional safety. * Add missing brackets. * don't keep the first TB position for the deblundering pass. (#10) * included the issue 1308 deblunder mechanism in loop.cc * blunder detection now acts on missed proven wins and unforced proven losses * added comment on missing activeM * removed probabilistic randomization of result rescorer and worked with v6 data instead * included moves left rescore, removed unneeded options * doubled code not needed as final positions aren't special * changed appveyor script to hopefully build rescorer.sln * reverted failed attempt at fixing appveyor * included minimal std::cout for blunders * included blunder counter, added comment to visits v6 data checking * checking for bit 3 of invariance info to make sure best_q is a proven win * don't keep the first TB position for rescorer * change recorer logo (#11) Co-authored-by: borg323 <[email protected]> * Make the deblunder transition soft through a width parameter (#13) * included the issue 1308 deblunder mechanism in loop.cc * blunder detection now acts on missed proven wins and unforced proven losses * added comment on missing activeM * removed probabilistic randomization of result rescorer and worked with v6 data instead * included moves left rescore, removed unneeded options * doubled code not needed as final positions aren't special * changed appveyor script to hopefully build rescorer.sln * reverted failed attempt at fixing appveyor * included minimal std::cout for blunders * included blunder counter, added comment to visits v6 data checking * checking for bit 3 of invariance info to make sure best_q is a proven win * don't keep the first TB position for rescorer * added a deblunder width parameter to allow a soft transition * clang formatting * resolve merge conflict * Add nnue plain file output (#12) * GetFen() from pr834 * first version of nnue output * flag to delete fils * address review comments * support pre v6 data * fix sign * correct nnue data misunderstanding Co-authored-by: borg323 <[email protected]> * fix copy-paste error (#15) Co-authored-by: borg323 <[email protected]> * add -t flag (#16) Co-authored-by: borg323 <[email protected]> * Post merge fixes. * Missed cleanup. * Fix input format change bug that can corrupt played_idx and best_idx * Post merge fixes. * fix merge * remove unnecessary options * split out rescore loop * minimize rescorer build * merge rescorer with master * minimize syzygy diff --------- Co-authored-by: Tilps <[email protected]> Co-authored-by: Henrik Forstén <[email protected]> Co-authored-by: borg323 <[email protected]> Co-authored-by: Naphthalin <[email protected]>
Uncertainty weighting
* Add a mode to turn lc0 into a chunk data rescorer powered by Tablebase. * Add some stats. * Add secondary rescoring using wdl to reduce back propigation of blunders a bit. * Add policy distribution adjust support to rescorer. * Track the game outcomes, and the change to the start of the game * Add DTZ based assistance for secondary rescoring. * Change move count to a moves remaining to potentially use for modulating target value. * Use DTZ for pawnless 3 piece positions as a substitue for DTM to adjust move_count to be more correct * another fix. * More fixing. * Getting things compiling again. * Make rescorer more obvious. * reorder to match struct order. * Actually update the version when converting to v4 format. * Implement the threading support. * Fix compilation issues on some compilers. * More compilation fixing. * Fix off by one. * Add support for root probe policy boosting for minimum dtz in winning positions. * Fix test compile. * Fix missing option. * Add a counter. * Log if policy boost is for a move labelled illegal. * Add a histogram for total amount of boosted policy per boosted position. * Distribute boost rather than apply to all - also log before and after dists. * Add gaviotatb code for later use in dtm_boost * Fix compile issue on linux. * Prepare logic for dtm policy boost. * Load gaviota tb if specified. * Probe gaviota to decide which 'safe' moves are most deserving of boost based on dtm. * First attempt at supporting arbitrary starting point training data for rescorer. * Fix missing brackets. * Some fixes. * Avoid crashes from walking history before start of provided game information. * Some more merge fixes. * Fix some formatting. * Only process .gz files, don't crash out on invalid files, don't create output until input has been read. * Don't keep partially valid files. * Add basic range validation for input data. * Don't create writer any earlier than needed. * Fix decoding castling moves for the new Move format. * Validate game moves for legality. * Also log illegal move if it passes probability check but fails the real check. * Fix another merge error. * Compile fix for linux. * Plies left in rescorer (LeelaChessZero#1) * Rescore move_count using Gaviota TBs * Fix lczero-common commit * Add condition for Gaviota move_count rescoring * Post merge fixup for the kings/knights change in board. * Rescore tb v5 (LeelaChessZero#2) * Make lc0 output v5 training data. * Finish merge of v5 data into rescorer tb. * Fixes for rescoring v4 data. * Revert some unneeded formatting changes. * Support FRC input_format in rescoring. * Add some very important missing break statements... * Fix merge. * Change movement decode to not rely on there being any history planes filled in. Since that will not always be the case for input type 3. * Minimum changes to make it compile again post merge. * Input format 3 support. * Fix data range checks were incorrect for format 3 and 2. * Fix up bugs with chess 960 castle moves that leave a rook or king in place. * Post merge compile fixups for renames. * Add support for hectoplies and hectoplies armageddon to validate, and fixup the merge of latest code. * More fixes for type 4 and 132. * Add input format conversion support to rescorer. * Better match for training. * Add canonical v2 format to rescorer. * Add a utility for substituting policy from higher quality data into main data. * Fix missing option and add some commented out diagnostic code. * More cleanup in comments. * Handle empty policy-substitutions dir and input dir better. * Don't keep chunks that are marked as not for training. * More fixes for handling files with placeholder chunks. * Add 'deblunderer' Completely untested... * Fix some bugs in deblunder. * simplify windows rescorer build (LeelaChessZero#4) Co-authored-by: borg323 <[email protected]> * Tweak windows build file. * Some updates for writer.h/cc for v6 * Update rescorer loop.cc for V6. * Some additional validations to do with played_idx/best_idx. * make appveyor build the rescorer (LeelaChessZero#7) Co-authored-by: borg323 <[email protected]> * subproject for gaviota tb files (LeelaChessZero#8) Co-authored-by: borg323 <[email protected]> * 'Fix' for build on windows Probably should be fixed some other way... * Fix my breakage. (LeelaChessZero#9) * Update loop.cc * Update meson.build * Use the v6 field played_q to do a more direct blunder rescoring (LeelaChessZero#5) * included the issue 1308 deblunder mechanism in loop.cc * blunder detection now acts on missed proven wins and unforced proven losses * added comment on missing activeM * removed probabilistic randomization of result rescorer and worked with v6 data instead * included moves left rescore, removed unneeded options * doubled code not needed as final positions aren't special * changed appveyor script to hopefully build rescorer.sln * reverted failed attempt at fixing appveyor * included minimal std::cout for blunders * included blunder counter, added comment to visits v6 data checking * checking for bit 3 of invariance info to make sure best_q is a proven win * Fix v5 upgrading for decisive games. * Additional safety. * Add missing brackets. * don't keep the first TB position for the deblundering pass. (LeelaChessZero#10) * included the issue 1308 deblunder mechanism in loop.cc * blunder detection now acts on missed proven wins and unforced proven losses * added comment on missing activeM * removed probabilistic randomization of result rescorer and worked with v6 data instead * included moves left rescore, removed unneeded options * doubled code not needed as final positions aren't special * changed appveyor script to hopefully build rescorer.sln * reverted failed attempt at fixing appveyor * included minimal std::cout for blunders * included blunder counter, added comment to visits v6 data checking * checking for bit 3 of invariance info to make sure best_q is a proven win * don't keep the first TB position for rescorer * change recorer logo (LeelaChessZero#11) Co-authored-by: borg323 <[email protected]> * Make the deblunder transition soft through a width parameter (LeelaChessZero#13) * included the issue 1308 deblunder mechanism in loop.cc * blunder detection now acts on missed proven wins and unforced proven losses * added comment on missing activeM * removed probabilistic randomization of result rescorer and worked with v6 data instead * included moves left rescore, removed unneeded options * doubled code not needed as final positions aren't special * changed appveyor script to hopefully build rescorer.sln * reverted failed attempt at fixing appveyor * included minimal std::cout for blunders * included blunder counter, added comment to visits v6 data checking * checking for bit 3 of invariance info to make sure best_q is a proven win * don't keep the first TB position for rescorer * added a deblunder width parameter to allow a soft transition * clang formatting * resolve merge conflict * Add nnue plain file output (LeelaChessZero#12) * GetFen() from pr834 * first version of nnue output * flag to delete fils * address review comments * support pre v6 data * fix sign * correct nnue data misunderstanding Co-authored-by: borg323 <[email protected]> * fix copy-paste error (LeelaChessZero#15) Co-authored-by: borg323 <[email protected]> * add -t flag (LeelaChessZero#16) Co-authored-by: borg323 <[email protected]> * Post merge fixes. * Missed cleanup. * Fix input format change bug that can corrupt played_idx and best_idx * Post merge fixes. * fix merge * remove unnecessary options * split out rescore loop * minimize rescorer build * merge rescorer with master * minimize syzygy diff --------- Co-authored-by: Tilps <[email protected]> Co-authored-by: Henrik Forstén <[email protected]> Co-authored-by: borg323 <[email protected]> Co-authored-by: Naphthalin <[email protected]> (cherry picked from commit 738c4aa)
Porting to lc0 of lczero issues glinscott/leela-chess#698 and glinscott/leela-chess#699 using the same game for analysis:
CCLS SCTR vs id359 game 1
Trying to find Rxh4 https://clips.twitch.tv/NimbleLazyNewtPRChase:
Here's the history of networks from 364 going back 10 at a time and what they thought of the winning move Rxh4 / a4h4 (focus on V and P for now):
Generally, the prior for this winning move is very low at under 1%, and the value is also unfavorable for white, so search will normally avoid it. This is tricky for tactics to be learned where playing an initially bad move opens up a better outcome.
That's where noise comes in to trick search into visiting more, and here's 50 runs of
./lc0 --weights=id359 --verbose-move-stats --noise --no-smart-pruning
withgo nodes 800
from the aboveposition startpos …
:Here, 13 of 50 games would have produced valuable training data, so noise is indeed working, but the majority is training to avoid the correct move. Averaging this training data for the move across 50 games should cause P to move towards 16.3% (= 6523 / ~800 / 50). But then combined with training data from other games, the networks have learned to keep avoiding this move.
As from the other issue: The premise is that for a self-play to end up in a learnable board state, it seems unfortunate that it misses the opportunity to generate valuable training data for the correct move more often than not. Clearly, AZ's numbers are good enough to eventually generate strong networks, but perhaps training search could be better optimized?
I've rerun the analysis with lc0 and 50 games each configuration from the above board state to measure the average training data for the expected tactic:
Testing patches for visit twice and negative fpu
I only ran one "visit each root move twice" as even with the default search parameters, it generally searches much deeper after being nudged over with the forced breadth exploration. This is true across all the previously listed networks from id364 to id124 above, and the output with high Ns are with "visit twice."
Is there an appropriate level of average tactic training? It looks like the current 16.3% is too low to outweigh the other training data. A related question is how often are self-play games getting into learnable states, but I don't have a good way to answer that.
The text was updated successfully, but these errors were encountered: