You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
“When putting reinforcement learning in the realm of large language models, the environment distribution and the output distribution of the policy model π RL(y|x) are identical. It means that the distribution of the environment shifts as π RL(y|x) is optimized.”这句话我有点没看懂,在RLFH中,SFT模型是那个agent,那environment不是应当指代的是reword model吗,这里的environment distribution好像是指的SFT模型的生成的回答的分布(如果我没有理解错的话),那这个不是应该叫做action distribution吗?
The text was updated successfully, but these errors were encountered:
The text was updated successfully, but these errors were encountered: