Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use losing positions to train policy #703

Open
trebe opened this issue Jun 14, 2018 · 6 comments
Open

Don't use losing positions to train policy #703

trebe opened this issue Jun 14, 2018 · 6 comments

Comments

@trebe
Copy link

trebe commented Jun 14, 2018

Suggestion

When extracting training data for policy training, skip lost positions.
Lost positions are those that led to a loss in the training game, the actual outcome of the game.

Based on chat discussions, I know this will be controversial, so I will do my best to motivate it, with some intuition.

Motivation

The use of the policy network is to direct the MCTS search algorithm during training game generation and match play.
The question we are effectively asking the neural network to learn how to answer, is "What would likely be the policy result of an 800 node search from here".
The real question that we want answered is, "Suppose there is a way to win or draw this position, what would be the best way to do that". This is a conditional question, hence conditional training is called for.

By including lost positions (positions that are part of actually lost games) in the training, we are learning the opposite. Those positions are teaching the network this: Yeah, I know we will lose, but here is the best way to lose.
The thought is maybe that by choosing the "best losing move" we will have a higher chance of actually winning instead, but there is no reason to believe that. We are just reinforcing the moves that led to the actual loss. If we are really losing anyway, we don't care about the moves. Only when there is a real chance of winning/drawing, we should care. And those real chances are manifested by actual games where some "miracle" happened.

Imagine a seemingly lost position. What we want a network to learn is: "Among all similar positions, where some miracle occured and we managed to draw, what move led to that miracle".

But policy is not only useful for the winning side

One objection I heard in the chat was this (reproduced to the best of my memory): "Policy is not only used by winning side. We also want to guess what our losing opponent will move".
No, not really. We want to know what their best chance of managing a draw will be. If there is any way my opponent can avoid losing, what would that way be. And that is what we want to search. To find that "what is any way, if it exists" move is the same as conditionally training only for the won/drawn positions. In a winning position, we want to assume an opponent still desperately trying, not an opponent that accepts its losing fate.

Even if some positions are so hopeless that they will have zero training applied to them, it doesn't matter. Why waste network capacity and training time to learn something that doesn't matter. Why learn the best way to lose.

Implementation / testing

My idea could easily be tested by training a network in parallel to the main network. Either from bootstrap, or starting with a copy of latest network.

Value output

The value network output is not covered by this proposal. Lost positions should of course still be trained as lost, i.e. target value is 0.

@Dorus
Copy link

Dorus commented Jun 14, 2018

Well, first thing wrong with your reasoning is that

"What would likely be the policy result of an 800 node search from here".

and

"Suppose there is a way to win or draw this position, what would be the best way to do that"

Are not mutually exclusive. In fact, both these statements are true. Because those 800 nodes will be searching for that magical draw or win in an otherwise lost position.

You seem to assume that the network only learns about what it can find in 800 nodes, however, remember the value network is ran at those 800 nodes too, and that value output is based on 800 nodes at that position. Again those 800 nodes are also based on the value output of the next 800 nodes. Of course accuracy goes down as some positions might not be part of the training data, or been part of it a long time ago, or the network might have forgotten about them. Anyway that's all solved with more training (or moar layers).

Secondly: It's important the network has a strong opponent to play against. If you watch the mate sheet, you can see LCZ was winning against random moves on certain checkmates for a long time, but it only learned to defeat SF on those positions after the winrate against itself wend down. The winrate against itself going down can only mean it learned to defend better, and only once it learned to defend better, the MCTS was able to identify positions where a "magic draw" (or win) was possible, and then avoid those.

If the policy does not know what to do on the defensive side when losing, the search will quickly beome very very wide, because it has to search all defensive moves instead of just the most promising ones.

In a winning position, we want to assume an opponent still desperately trying, not an opponent that accepts its losing fate.

The only way to get an opponent that keeps trying new tricks is by train on the losing positions. Only when trained it will look for that one move that has 0.2% winrate instead of 0.1%.

The only position that you do not need to train is one where both the winning side know it won and the losing side know it has lost. This is why turning on resign is a good idea.

@trebe
Copy link
Author

trebe commented Jun 14, 2018

I am well aware of the recursive aspect of letting the network learn what a 800-node search would find :)

Because those 800 nodes will be searching for that magical draw or win in an otherwise lost position

Yes, so let's use the result whenever it finds such a magical draw or win, and not reinforce the losses.

If it never finds it, then the position is lost, and reinforcing will not help making it stronger to find it later. Draws in similar positions, and temperature in training games, could help, though.

The winrate against itself going down can only mean it learned to defend better

What does it mean to defend better? I means that it more often got lucky and found a draw. Great, then use that draw for training the policy. But not until there's a draw.

important the network has a strong opponent to play against

Exactly. So train that strong opponent from the strong demonstrations, namely the drawn games. Don't reinforce the losing moves.

one move that has 0.2% winrate instead of 0.1%

Great. If that really translates to actually winning 0.2% of the time, as opposed to being fantasy, then use those winning games. What were the moves when we won, not when we lost.

I hope someone will have cycles and motivation enough to just try the idea. No client changes are needed, just parallel training, and if performance goes up, as I assume, we're good.

@Dorus
Copy link

Dorus commented Jun 14, 2018

Because of the recursive aspect, that this game is lost after 150 moves does not mean move 75 was wrong. With your proposal you toss away half the training data. While the value network trains towards the final game result, policy actually trains towards the best value output. This means it can still learn about good moves in lost self play games.

Self play games are with temp and noise, so there will be quite a few of them with the wrong actual winner, while those games are still very valuable to the training set.

In fact, in lost positions, you more likely than not toss away 98% of the training data, and will only train on those freak wins from noise and temp. Those are not necessarily the games that discovered new concepts.

@trebe
Copy link
Author

trebe commented Jun 14, 2018

Right, there's a noisy aspect of the final outcome.
But that should help us: It will promote some moves out of lost positions, proportionally to how close those moves are to turning the game around.
What you call "freak wins" are the moves that gave a drawing chance in a lost position. Those are fine to keep for training.

in lost positions, you more likely than not toss away 98% of the training data

which is fine. It's better than learning/reinforcing the moves that already demonstratedly lost. And in truely lost positions, there's nothing to learn - just resign.

@trebe
Copy link
Author

trebe commented Jun 14, 2018

I think the training is not the bottleneck here, rather the game generation is, right?

So why not just fork a network, and train it in parallel.

Maybe I'll just learn how to run the training pipeline :)

@trebe
Copy link
Author

trebe commented Jun 14, 2018

Another way to express my proposal is using learning weights for the policy loss SGD:
Three different weights depending on game outcome loss, draw, win.
Currently those are 1, 1, 1.
My proposal is 0, 1, 1.
And actually, I like 0, .5, 1 better, but that would have complicated my initial proposal.

I get your point about learning from games that although still losing, they temporarily during the game led to higher value score (until score went to 0 because of checkmate). One might imagine that this will lead to better defense in lost positions (although my main point is that I believe not).
But a compromise that accommodates both viewpoints would be to set weights 0.5, 1, 1.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants