Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilization of negative samples #2

Open
HillZhang1999 opened this issue Jun 25, 2024 · 1 comment
Open

Utilization of negative samples #2

HillZhang1999 opened this issue Jun 25, 2024 · 1 comment
Labels
about dataset datasets of PRM and policy model

Comments

@HillZhang1999
Copy link

Dear authors:
First of all, I appreciate your engaging and informative work! I have a question regarding your research: I noticed that you only utilize positive samples for SFT when enhancing the policy models. Have you considered incorporating negative samples through methods such as DPO?

@zhangdan0602
Copy link
Collaborator

Thank you for your question! Indeed, we reproduce the baseline, Self-Rewarding, which runs the DPO using negative samples of LLMs judgments.

@zhangdan0602 zhangdan0602 added the about dataset datasets of PRM and policy model label Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
about dataset datasets of PRM and policy model
Projects
None yet
Development

No branches or pull requests

2 participants