-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Implement Recurrent SAC #201
Comments
Hello, Make sure to read the contributing guide carefully. For benchmarking, best would be to use the "NoVel" env that are available in the RL Zoo (see https://wandb.ai/sb3/no-vel-envs/reports/PPO-vs-RecurrentPPO-aka-PPO-LSTM-on-environments-with-masked-velocity-SB3-Contrib---VmlldzoxOTI4NjE4). |
Thanks for the references. I will check them out and come back. |
Just a quick update: I plan to do this by the end of 2023 when I have some free time. Currently I have three higher priority projects. |
Status update:
|
I've got these results on
It took about 20 hours to compute per run. Perhaps now this |
Hello, Di you also manage to solve the mountain car problem? |
I believe, yes. Let me render the env to verify since rewards are not the same for MountainCarContinuousNoVel-v0 (continuous action space) and MountainCar-v0 (discrete action space). |
Loosely speaking, here they are:
As you can see, RSAC_S share the RNN state between the actor and the critic, but only actor can change the RNN state. Whereas in RSAC actor and critics have their own RNN states. |
thanks, similar to what is implemented for PPO: stable-baselines3-contrib/sb3_contrib/common/recurrent/policies.py Lines 238 to 247 in 588c6bd
|
Update, I just rendered |
i can help you with that, the continuous version has a deceptive reward and need quite some exploration noise EDIT: working hyperparameters: https://github.com/DLR-RM/rl-baselines3-zoo/blob/8cecab429726d7e6aaebd261d26ed8fc23b7d948/hyperparams/sac.yml#L2 (note: the gSDE exploration is important there, otherwise a high OU noise would work too) |
Thanks, I'll check those hyperparameters. |
Indeed, having use_sde=True seems helping to solve Edit: I also tried nearby hyperparameters and indeed gSDE contribution seems to be non-negligible. |
The consistent exploration. To solve this task, you need to build-up momentum, having a bang-bang like strategy is one way (it is discuss a bit longer in the first version of the paper: https://arxiv.org/pdf/2005.05719v1.pdf).
I did a full hyperparameters search and with gSDE many are working (more than half of the tested configurations): https://github.com/DLR-RM/rl-baselines3-zoo/blob/sde/logs/report_sde_MountainCarContinuous-v0_500-trials-50000-tpe-median_1581693633.csv |
I am currently checking the two strategies for RNN state initialization, proposed in R2D2 paper (store state and burn-in). |
So far I've got this: recurrent replay buffer with overlapping chunks supporting SB3 interface. I also wrote a specification (test) to reduce future surprises. https://gist.github.com/masterdezign/47b3c6172dd1624bb9a7ef23cbc79c8c The limitation is |
Hi! I didn't obtain good results and then I had to put the project on hold. I plan to restart working on it starting from tomorrow. |
🚀 Feature
Hi!
I would like to implement a recurrent soft actor-critic. Is it a sensible contribution?
Motivation
I actually need this algorithm in my projects.
Pitch
The sb3 ecosystem would benefit from yet another algorithm. As a new contributor, I might need a little guidance though.
Alternatives
An alternative would be another off-policy algorithm using LSTM.
Additional context
No response
Checklist
The text was updated successfully, but these errors were encountered: