-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Examples] boiler plate code for multi-turn reward for RLHF #2467
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2467
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@vmoens I was wondering when this PR could be merged. Please let me know if there is any gap. Thanks |
Hi @rghosh08 The current script doesn't integrate any component of the library and therefore is of limited value within torchrl. Thanks again for collaborating |
Thanks @vmoens for your feedback. I will come up with an integration. Appreciate your guidance. |
@vmoens sorry for the long hiatus. Could you please check the updated code and advise me whether it is in the right track? Thanks I have used following artifacts within PyTorch/rl from torchrl.envs import EnvBase |
hi @vmoens do you have any feedback on my latest commit? |
Description
This PR addresses: [Feature Request] multi-turn reward for RLHF #2271
This PR implements the reward system for multi-turn reinforcement learning from human feedback (RLHF), following the guidelines outlined in the paper Multi-turn Reinforcement Learning from Preference Human Feedback. The key changes involve creating a simulated multi-turn dialogue environment where human feedback (rewards) is used to guide policy learning. The implemented policy is trained using policy gradient methods, updating based on human feedback provided at each turn.
Changes include:
Motivation and Context
This change is necessary to replicate the reward structure proposed in the referenced paper, implementing multi-turn RLHF in a way that closely follows the described methodology. It introduces the simulation of human preferences, which plays a key role in the learning process. This change also resolves issue #2271, which proposed adding this reward mechanism to the project.
Closes [Feature Request] multi-turn reward for RLHF #2271
I have raised an issue to propose this change (required for new features and bug fixes)
Types of changes
What types of changes does your code introduce? Remove all that do not apply:
Checklist
Go over all the following points, and put an
x
in all the boxes that apply.