Project 2: Continuous Control

This is the second project as part of Deep Reinforcement Learning course in Udacity. The objective of the project is to keep a double-jointed to arm to in the goal position as long as possible.

Problem Description

For this project we worked with the second version of Reacher environmnt provided by Unity which contains 20 identical agents, each with their own copy of the environment. In this setting, all the agents must obtain an average score of +30 (over 100 consecutive episodes and over all agents).

In other words, the environment is considered sovled, when the average (over 100 episodes) of those average scores is at least +30.

Implementation Details

At the heart of this approach lies an actor-critic method. Policy-gradient methods like REINFORCE use Monte-Carlo esitmate. As a result they exhibit high variance. Value-based approaches using Temporal Difference estimates display low variance. Actor-Critic methods combine these two approaches and extract the best out of both worlds. They are more stable than TD estimates and at the same time need fewer samples than policy-gradient methods.

An Actor-Critic method contains two neural network one for the actor and one for the critic. The actor's role is to update the policy which is then evaluted by the critc, in turn training the actor to acheive a good policy.

In order to update the policy in vanila Policy-gradient methods we do the following

Accumulate rewards over the episode.
Compute the average them.
Calculate the gradient.
Perform gradient descent and update the policy.

In Actor-Critic methods we use the value provided by the critic to update the actors policy.

Deep Deterministic Policy Gradient (DDPG) is the version of actor-critic method we used to solve the above environment. Here the actor generates a deterministic policy which is evaluated by the critic. Note that some of the actor-critic variants use stochastic policies. We update the critic using TD error and action learns using the following deterministic policy gradient.

In the above equation the actor is represented using a parametrized function and Q(s,a) is the critic.

The above update step is ran for each agent (in this case 20)at regular intervals. There are also a few more techniques that were incorporated to stabilize the training.

Fixed Targets

This was introduced as part of Deep Q Network that proved to be quite useful in learning and updating weights. Both the actor and critic would have two neural neworks (a local and a target). As with DQN we freeze the target network and learn weights for the local model.

Soft udpates

In DQN, we update the target weights after every C steps. In DDPG we update for every step but in a soft manner, i.e math w_t = 99.99% * w_t + 0.01% w_l where w_t is weight of target and w_l is weight of local. This way most of the target network weights are retained and moved slowly.

Experience Replay

In most of the RL techniques the adjacent states (or experience tuples) are interlinked. So our model might get biased to a certain behavior (or path) and never learn other paths. Hence it is highly necessary that our samples are independent. To acheive this purpose, we maintain a Replay Buffer of experience tuples. After a fixed number of iterations we randomly sample an experience from this buffer, calcualte the loss and eventually update the parameters. This way, we break correlations between adjacent tuples and stabilize the learning. Around the same time, it helps us re-use the experience instead of running through the environment again.

Results

I used PyTorch an open source deep learning platform by facebook to implement the neural network and Udacity AWS for GPU training. Refer to the README for other run time dependencies.

The Slack Channel suggested few ideas on improving the model and one of them was to use Batch Normalization. I have used it at the first layer in both the actor and critic which enhanced the learning process. Leaky ReLu (with 0.01 slope) proved to work much better than ReLu.

Note: workspace_utils provided in the Udacity forums, helped me keep the workspace awake during the training.

Hyper Parameters

Following is the list of hyperparamters that we used during the training

Parameter	Value
Epoch Size epsiodes	1000
First Hidden Layer	256
Second Hidden Layer	128
Third Hidden Layer (Critic)	128
Leaky ReLu (Slope)	0.01
Replay Buffer Size	1000000
Mini Batch Size	1024
Discount Rate	0.99
Tau (Soft update parameter)	0.001
Learning Rate (Actor)	0.0001
Learning Rate (Critic)	0.0003
Weight Decay	0.0001

You can find the models weights for actor and critic in the Results folder.

Observations

The environment was sovled in 318 epsiodes with an average score of 30.07. The learning was too slow in the beginning and picked up during the later stages. However DDPG updates seem to much smoother than the earlier approaches we've played with as evident from the graph below.

Batch Normalization could be one of the reason for smoothness in DDPG. Increasing Epoch size for each episode didn't alter the performance much. Also altering the learning rates didn't influence the training much.

Further Improvements

Using Prioritized Replay (paper) has generally shown to have been quite useful. So we could give that one a try.
Other algorithms like TRPO, PPO, A3C, A2C that have been discussed in the course could potentially lead to better results as well.
We could try Q-prop algorithm combines both off-policy and on-policy learning.
Clipping gradients and (hard) setting both critic and target to same set of weights might yield a better performance.
Optimization techniques like dropouts, early stopping and warm restarts could reduce the training time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report.md

Report.md

Project 2: Continuous Control

Problem Description

Implementation Details

Fixed Targets

Soft udpates

Experience Replay

Results

Hyper Parameters

Observations

Further Improvements

Files

Report.md

Latest commit

History

Report.md

File metadata and controls

Project 2: Continuous Control

Problem Description

Implementation Details

Fixed Targets

Soft udpates

Experience Replay

Results

Hyper Parameters

Observations

Further Improvements