Implementation of advantage function #4476
Unanswered
gauss-clb
asked this question in
Community | Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/coati/experience_maker/naive.py#L52
Why
value
only uses prompt part, https://github.com/hpcaitech/ColossalAI/blob/main/applications/Chat/coati/models/base/critic.py#L49, butr
uses prompt+response?Why
reward=r-self.kl_coef*kl_divergence(action_log_probs, base_action_log_probs)
, is there any theory to support it?Beta Was this translation helpful? Give feedback.
All reactions