Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not really an issue, more a question #10

Open
ghost opened this issue Jun 25, 2021 · 4 comments
Open

Not really an issue, more a question #10

ghost opened this issue Jun 25, 2021 · 4 comments

Comments

@ghost
Copy link

ghost commented Jun 25, 2021

Hi,
I want to reuse your experiment on MiniGrid as a benchmark to my paper on RL generalisation ... it fits nicely, but I am not clear how to replicate the experiment to generate the orange line on your paper, can you provide some insight ?
Are your running the training on 2 000 000 environments to generate the chart ?
Thanks a lot in advance.

@ghost
Copy link
Author

ghost commented Jun 25, 2021

Just to be more precise, I would like to train your agent on 1000 random environment and test it on 1000 other environment to get the generalisation percentage on these test environment ... not sure how I can do that with the code provided ... thanks

@maximilianigl
Copy link
Collaborator

maximilianigl commented Jun 27, 2021

Hi, thanks for your interest!
We only have an explicit train/test split for the Coinrun environment. For MiniGrid, we randomly sample from all possible layouts during training. This doesn't allow us to explicitly measure the generalisation gap, but the performance of the agent (and their learning speed) still correlates with how well they generalise as the number of possible layouts is so large that they rarely see the same layout twice. So Figure 2 just shows the normal training performance we usually report in RL.
Note that there's a lot of variation in the results, which is why ours are averaged over 30 random seeds.

@ghost
Copy link
Author

ghost commented Jun 29, 2021

Sure, so if I understood it well, you make iterations where you train on 3 environments randomly chosen and then test on another one also randomly chosen ? right ? the results in computed every 30 test as an average of reward over these 30 test environments ...

@maximilianigl
Copy link
Collaborator

maximilianigl commented Jul 3, 2021

For MiniGrid we're using the usual PPO setup (see here for hyperparameters:

  • We're running on 16 environments on parallel (--procs). Each environment is run entirely independently from the others, i.e. when we reach the end of the episode in one environment, we randomly sample a new layout in that environment and continue rollouts there. The layout sampling is not restricted in any way, so eventually we should have seen every possible layout during training. Generalisation is only important because there are so many of them; not sure how many exactly but your estimate of 2M could be correct, though maybe it's a bit less.
  • There's no explicit train/test split, i.e. we report the performance on each environment and also use that environment to subsequently train, like we often do in RL. The only major difference to most RL setups is in the randomness of layout generation inside the environment. From an algorithm/training perspective, we just run standard PPO.
  • In case you were referring to the N3r suffix in the environment name with 'train on 3 environments': The N3r only means that layouts are randomly generated with up to 3 rooms.

Not sure if that helps, please let me know if not - I feel like we might be talking past each other :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant