Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapter 8: Training loop and min_progress #34

Open
frank-roesler opened this issue Dec 8, 2021 · 1 comment
Open

Chapter 8: Training loop and min_progress #34

frank-roesler opened this issue Dec 8, 2021 · 1 comment

Comments

@frank-roesler
Copy link

Unless I'm mistaken, there is something odd about the main training loop (Listing 8.13) for the Super Mario game in Chapter 8. The way that the current x-position is checked against the min_progress parameter makes no sense to me.
More precisely: in line 23 of the main training loop, the environment step is taken (6 times) and last_x_pos is set to the current x-position:

state2, e_reward_, done, info = env.step(action)
last_x_pos = info['x_pos']

In the following lines of code, neither last_x_pos nor info['x_pos'] are changed. Then in line 33 the two are compared to one another:

if episode_length > params['max_episode_len']:
     if (info['x_pos'] - last_x_pos) < params['min_progress']:
          done = True
     else:
          last_x_pos = info['x_pos']

Isn't info['x_pos'] - last_x_pos always going to be zero here? This would always reset the environment as soon as episode_length > params['max_episode_len'].
What is the min_progress parameter meant to be intuitively? The progress from beginning till the end of one episode? The progress from time 0 till max_episode_len? Or the progress against a certain checkpoint in a certain amount of time? If so, how are these checkpoints chosen?
This has not become clear to me yet, neither from the book nor from the code.

@frank-roesler
Copy link
Author

Addon:
This also explains why in figure 8.19 in the book the training time for each episode is always exactly the same (i.e. the horizontal distance between consecutive peaks is always identical). The training loop always runs for params['max_episode_len'] and then resets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant