Fatal bug in implementation of GAE #21

SupernovaTitanium · 2023-04-02T11:37:26Z

https://github.com/uvipen/Super-mario-bros-PPO-pytorch/blob/ab4248d715346c6adc33c2157455e2b98c130bcc/train.py#L119
It should be

gae = gae * opt.gamma * opt.tau*(1 - done)

Suppose worker 1 has to sample 500 steps. The game prematurely ends at 250 steps, the worker will restart the game and continue sampling 250 steps. The trajectory would be s1,s2,...,s250,s1',s2',...s250'.
The wrong implementation forgets to reset GAE to zero when calculating GAE of s250. It will make GAE bigger than expected. This will cause the advantage of s250 become bigger and bigger, which will make the network think you should output a250 when seeing s250. (However, this is not true, performing s250 at a250 make you die).

Therefore, the critic loss diverges (advantage becomes bigger and bigger, network can't predict it right). Stuck at action that make you die. The agent does not learn anything.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fatal bug in implementation of GAE #21

Fatal bug in implementation of GAE #21

SupernovaTitanium commented Apr 2, 2023

Fatal bug in implementation of GAE #21

Fatal bug in implementation of GAE #21

Comments

SupernovaTitanium commented Apr 2, 2023