Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How "trajectory divergence term" is calculated in compute_cost function in algorithm_traj_opt.py #98

Open
wangsd01 opened this issue Nov 20, 2017 · 2 comments

Comments

@wangsd01
Copy link

wangsd01 commented Nov 20, 2017

Could you help to give any reference to this part of code? Thank you!

def compute_costs(self, m, eta, augment=True):
    """ Compute cost estimates used in the LQR backward pass. """
    traj_info, traj_distr = self.cur[m].traj_info, self.cur[m].traj_distr
    if not augment:  # Whether to augment cost with term to penalize KL
        return traj_info.Cm, traj_info.cv

    multiplier = self._hyperparams['max_ent_traj']
    fCm, fcv = traj_info.Cm / (eta + multiplier), traj_info.cv / (eta + multiplier)
    K, ipc, k = traj_distr.K, traj_distr.inv_pol_covar, traj_distr.k

    # Add in the trajectory divergence term.
    for t in range(self.T - 1, -1, -1):
        fCm[t, :, :] += eta / (eta + multiplier) * np.vstack([
            np.hstack([
                K[t, :, :].T.dot(ipc[t, :, :]).dot(K[t, :, :]),
                -K[t, :, :].T.dot(ipc[t, :, :])
            ]),
            np.hstack([
                -ipc[t, :, :].dot(K[t, :, :]), ipc[t, :, :]
            ])
        ])
        fcv[t, :] += eta / (eta + multiplier) * np.hstack([
            K[t, :, :].T.dot(ipc[t, :, :]).dot(k[t, :]),
            -ipc[t, :, :].dot(k[t, :])
        ])

    return fCm, fcv
@wangsd01
Copy link
Author

This part is to add divergence of predicted trajectory and sampled trajectory as additional cost.
i.e. (Kx + k - u).T * inverse_policy_variance_matrix * (Kx+k -u)
u is sampled action from data.
Kx + k is predicted action from global policy network.

@WilsonWangTHU
Copy link

@wangsd01 Hi, I am also looking at these lines. Have you solved the problem?

I am not 100% sure what's happening, but one thing that looks especially suspicious to me is that the derivative to u is Cov^{-1}.dot(k_old).

In the code repo, by looking at the forward pass, it uses u = Kx + k, instead of u = K(x-x_old) + k + u_old.
And therefore, I kinda feel that if we actually take the derivative of the KL penalty wrt u, we will have something like
Cov^{-1}.dot(u_new - u_old) = Cov^{-1}.dot(K_new.dot(x) - K_old.dot(x) + k_new - k_old) != Cov^{-1}.dot(k_old).

Not sure if I missed anything. Be great if you could help :( @cbfinn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants