You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Suppose worker 1 has to sample 500 steps. The game prematurely ends at 250 steps, the worker will restart the game and continue sampling 250 steps. The trajectory would be s1,s2,...,s250,s1',s2',...s250'.
The wrong implementation forgets to reset GAE to zero when calculating GAE of s250. It will make GAE bigger than expected. This will cause the advantage of s250 become bigger and bigger, which will make the network think you should output a250 when seeing s250. (However, this is not true, performing s250 at a250 make you die).
Therefore, the critic loss diverges (advantage becomes bigger and bigger, network can't predict it right). Stuck at action that make you die. The agent does not learn anything.
The text was updated successfully, but these errors were encountered:
https://github.com/uvipen/Super-mario-bros-PPO-pytorch/blob/ab4248d715346c6adc33c2157455e2b98c130bcc/train.py#L119
It should be
Suppose worker 1 has to sample 500 steps. The game prematurely ends at 250 steps, the worker will restart the game and continue sampling 250 steps. The trajectory would be s1,s2,...,s250,s1',s2',...s250'.
The wrong implementation forgets to reset GAE to zero when calculating GAE of s250. It will make GAE bigger than expected. This will cause the advantage of s250 become bigger and bigger, which will make the network think you should output a250 when seeing s250. (However, this is not true, performing s250 at a250 make you die).
Therefore, the critic loss diverges (advantage becomes bigger and bigger, network can't predict it right). Stuck at action that make you die. The agent does not learn anything.
The text was updated successfully, but these errors were encountered: