Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lstm+ppo cannot converge in Pendulum-v0 environment #2

Open
1900360 opened this issue Oct 2, 2022 · 7 comments
Open

lstm+ppo cannot converge in Pendulum-v0 environment #2

1900360 opened this issue Oct 2, 2022 · 7 comments

Comments

@1900360
Copy link

1900360 commented Oct 2, 2022

Hi @nslyubaykin
lstm+ppo cannot converge in Pendulum-v0 environment, I don't know there is some setting error in my code, could you check it for a moment?
reward curve shown as below:
image
lstm_parallel_ppo.txt

@nslyubaykin
Copy link
Owner

Hi @1900360!

Did parameters mentioned in this issue help for that task?

@1900360
Copy link
Author

1900360 commented Oct 5, 2022

Hi @nslyubaykin
Sure, here are the parameters:

actor = PPO(
policy_net=NormalLSTM(obs_dim, acs_dim, nlayers_lstm=2,
seq_len=1+n_lags,
nunits_lstm=32, nunits_dense=8,
out_activation=torch.nn.Identity(),
init_log_std=-1.5),
device=torch.device('cpu'),
learning_rate=1e-4,
n_epochs_per_update=50,
batch_size=5000,
target_kl=0.2,
eps=0.2,
gamma=0.9,
obs_nlags=n_lags,
obs_expand_axis=0,
obs_concat_axis=0,
obs_padding='zeros',
standardize_advantages=True,
weight_decay=0.0
)

critic = GAE(
critic_net=VLSTM(obs_dim, nlayers_lstm=2,
seq_len=1+n_lags,
nunits_lstm=32, nunits_dense=8),
device=torch.device('cpu'),
learning_rate=1e-4,
batch_size=5000,
gamma=0.9,
gae_lambda=0.95,
n_target_updates=1,
n_steps_per_update=50,
obs_nlags=n_lags,
obs_expand_axis=0,
obs_concat_axis=0,
obs_padding='zeros'
)

parallel_sampler = ParallelSampler(env=envs,
obs_nlags=n_lags,
obs_expand_axis=0,
obs_concat_axis=0,
gpus_share=0,
obs_padding='zeros')

for step in tqdm(range(n_steps)):
actor.set_critic(None)
actor.set_device(torch.device('cpu'))
train_batch = parallel_sampler.sample(actor=actor,
n_transitions=5000,
max_path_length=None,
reset_when_not_done=False,
train_sampling=True)
actor.set_device(torch.device('cpu'))
actor.set_critic(critic)
critic_logs = critic.update(train_batch)
actor_logs = actor.update(train_batch)

The process of training is too slow for the simple gym environment, is there some improved space?

@1900360
Copy link
Author

1900360 commented Oct 6, 2022

Waiting for your answer, I'm very interested in this:D

@nslyubaykin
Copy link
Owner

Hi @1900360!

I am not sure if I understand correctly what do you mean by slow training. Is the convergence is slower itself, or it is computationally slower? And the second question, what do you mean by improved space?

@1900360
Copy link
Author

1900360 commented Oct 6, 2022

Hi @nslyubaykin!
Sorry for my unclear statement. I mean 'slow training' is not only convergence but also computing resources mainly spend on these steps:

critic_logs = critic.update(train_batch)
actor_logs = actor.update(train_batch)
I have use parameters from parallel_ppo, but I still get these reward curve, so I wonder whether my parameters settings is correct:
image

@1900360
Copy link
Author

1900360 commented Oct 8, 2022

Do you have any idea? I didn't get anything since I'm a freshman in DRL :)

@nslyubaykin
Copy link
Owner

Hi @1900360!

The reason for the slower computation is the fact that your policy is now dealing with larger observations (obs_dim*n_lags) and other things equal has more parameters (new architecture may also affect it). Plus there is some minor computational overhead with creating and processing lags. Regarding training divergence, one option is that you need just to find a new right set of hyper-parameters for this new architecture (which can be found only by trial and error). The other option is that this environment performance is just harmed by introducing lags. According to my experience, when the observations are already fully observable, using lags may harm the performance by adding redundant information to an observation.

Also using

out_activation=torch.nn.Tanh()
acs_scale=2

with NormalLSTM is preferable for that task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants