You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've adapted the environment from this blog post, exact code of the env shown below. They implemented a recurrent A3C agent in TF1 (it was written a while ago).
It's a very simple "contextual bandit" environment where, for each episode, there are two random colors given as the observation, the order of which is randomly flipped at each timestep. There are two actions, each corresponding to a particular color. The reward is associated with a particular color, and the agent has to employ a strategy where they figure out which color leads to reward and select the action based on that color. Upon environment reset, the task structure stays the same but new pixel colors are chosen.
I figured SB3 Recurrent PPO should be able to solve this environment pretty easily. There is work showing that a recurrent policy network trained with A2C can solve a more complex 3d version of this task.
I have tried training for 1e6-1e8 timesteps and optimizing hyperparameters with Optuna (my ranges can be found below). I'm wondering whether this task uncovers a hidden issue with Recurrent PPO in SB3 or if this is just a deceptively difficult task? I have yet to try to run this environment with other packages as well to see if its specific to recurrent ppo.
🐛 Bug
I've adapted the environment from this blog post, exact code of the env shown below. They implemented a recurrent A3C agent in TF1 (it was written a while ago).
It's a very simple "contextual bandit" environment where, for each episode, there are two random colors given as the observation, the order of which is randomly flipped at each timestep. There are two actions, each corresponding to a particular color. The reward is associated with a particular color, and the agent has to employ a strategy where they figure out which color leads to reward and select the action based on that color. Upon environment reset, the task structure stays the same but new pixel colors are chosen.
I figured SB3 Recurrent PPO should be able to solve this environment pretty easily. There is work showing that a recurrent policy network trained with A2C can solve a more complex 3d version of this task.
I have tried training for 1e6-1e8 timesteps and optimizing hyperparameters with Optuna (my ranges can be found below). I'm wondering whether this task uncovers a hidden issue with Recurrent PPO in SB3 or if this is just a deceptively difficult task? I have yet to try to run this environment with other packages as well to see if its specific to recurrent ppo.
Any thoughts/insights?
Hyperparameter ranges:
Code example
Relevant log output / Error message
No response
System Info
No response
Checklist
The text was updated successfully, but these errors were encountered: