lstm+ppo cannot converge in Pendulum-v0 environment #2

1900360 · 2022-10-02T12:33:33Z

Hi @nslyubaykin
lstm+ppo cannot converge in Pendulum-v0 environment, I don't know there is some setting error in my code, could you check it for a moment?
reward curve shown as below:

lstm_parallel_ppo.txt

nslyubaykin · 2022-10-05T10:19:45Z

Hi @1900360!

Did parameters mentioned in this issue help for that task?

1900360 · 2022-10-05T12:36:07Z

Hi @nslyubaykin
Sure, here are the parameters:

actor = PPO(
policy_net=NormalLSTM(obs_dim, acs_dim, nlayers_lstm=2,
seq_len=1+n_lags,
nunits_lstm=32, nunits_dense=8,
out_activation=torch.nn.Identity(),
init_log_std=-1.5),
device=torch.device('cpu'),
learning_rate=1e-4,
n_epochs_per_update=50,
batch_size=5000,
target_kl=0.2,
eps=0.2,
gamma=0.9,
obs_nlags=n_lags,
obs_expand_axis=0,
obs_concat_axis=0,
obs_padding='zeros',
standardize_advantages=True,
weight_decay=0.0
)

critic = GAE(
critic_net=VLSTM(obs_dim, nlayers_lstm=2,
seq_len=1+n_lags,
nunits_lstm=32, nunits_dense=8),
device=torch.device('cpu'),
learning_rate=1e-4,
batch_size=5000,
gamma=0.9,
gae_lambda=0.95,
n_target_updates=1,
n_steps_per_update=50,
obs_nlags=n_lags,
obs_expand_axis=0,
obs_concat_axis=0,
obs_padding='zeros'
)

parallel_sampler = ParallelSampler(env=envs,
obs_nlags=n_lags,
obs_expand_axis=0,
obs_concat_axis=0,
gpus_share=0,
obs_padding='zeros')

for step in tqdm(range(n_steps)):
actor.set_critic(None)
actor.set_device(torch.device('cpu'))
train_batch = parallel_sampler.sample(actor=actor,
n_transitions=5000,
max_path_length=None,
reset_when_not_done=False,
train_sampling=True)
actor.set_device(torch.device('cpu'))
actor.set_critic(critic)
critic_logs = critic.update(train_batch)
actor_logs = actor.update(train_batch)

The process of training is too slow for the simple gym environment, is there some improved space?

1900360 · 2022-10-06T06:58:44Z

Waiting for your answer, I'm very interested in this:D

nslyubaykin · 2022-10-06T11:12:38Z

Hi @1900360!

I am not sure if I understand correctly what do you mean by slow training. Is the convergence is slower itself, or it is computationally slower? And the second question, what do you mean by improved space?

1900360 · 2022-10-06T15:11:20Z

Hi @nslyubaykin!
Sorry for my unclear statement. I mean 'slow training' is not only convergence but also computing resources mainly spend on these steps:

critic_logs = critic.update(train_batch)
actor_logs = actor.update(train_batch)
I have use parameters from parallel_ppo, but I still get these reward curve, so I wonder whether my parameters settings is correct:

1900360 · 2022-10-08T16:28:47Z

Do you have any idea? I didn't get anything since I'm a freshman in DRL :)

nslyubaykin · 2022-10-16T10:49:42Z

Hi @1900360!

The reason for the slower computation is the fact that your policy is now dealing with larger observations (obs_dim*n_lags) and other things equal has more parameters (new architecture may also affect it). Plus there is some minor computational overhead with creating and processing lags. Regarding training divergence, one option is that you need just to find a new right set of hyper-parameters for this new architecture (which can be found only by trial and error). The other option is that this environment performance is just harmed by introducing lags. According to my experience, when the observations are already fully observable, using lags may harm the performance by adding redundant information to an observation.

Also using

out_activation=torch.nn.Tanh()
acs_scale=2

with NormalLSTM is preferable for that task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lstm+ppo cannot converge in Pendulum-v0 environment #2

lstm+ppo cannot converge in Pendulum-v0 environment #2

1900360 commented Oct 2, 2022

nslyubaykin commented Oct 5, 2022

1900360 commented Oct 5, 2022

1900360 commented Oct 6, 2022

nslyubaykin commented Oct 6, 2022

1900360 commented Oct 6, 2022

1900360 commented Oct 8, 2022

nslyubaykin commented Oct 16, 2022

lstm+ppo cannot converge in Pendulum-v0 environment #2

lstm+ppo cannot converge in Pendulum-v0 environment #2

Comments

1900360 commented Oct 2, 2022

nslyubaykin commented Oct 5, 2022

1900360 commented Oct 5, 2022

1900360 commented Oct 6, 2022

nslyubaykin commented Oct 6, 2022

1900360 commented Oct 6, 2022

1900360 commented Oct 8, 2022

nslyubaykin commented Oct 16, 2022