[Examples] boiler plate code for multi-turn reward for RLHF #2467

rghosh08 · 2024-10-05T09:15:55Z

Description

This PR addresses: [Feature Request] multi-turn reward for RLHF #2271

This PR implements the reward system for multi-turn reinforcement learning from human feedback (RLHF), following the guidelines outlined in the paper Multi-turn Reinforcement Learning from Preference Human Feedback. The key changes involve creating a simulated multi-turn dialogue environment where human feedback (rewards) is used to guide policy learning. The implemented policy is trained using policy gradient methods, updating based on human feedback provided at each turn.

Changes include:

A multi-turn dialogue environment that simulates five turns of conversation, with human feedback as rewards.
A policy network to generate responses in the dialogue based on the current conversation state, where input states are padded/truncated to ensure proper dimensionality.
A reward mechanism that simulates human feedback using random choices for rewards (+1 for positive, -1 for negative feedback).
A training loop using policy gradients to train the policy network based on discounted cumulative rewards.

Motivation and Context

This change is necessary to replicate the reward structure proposed in the referenced paper, implementing multi-turn RLHF in a way that closely follows the described methodology. It introduces the simulation of human preferences, which plays a key role in the learning process. This change also resolves issue #2271, which proposed adding this reward mechanism to the project.

Closes [Feature Request] multi-turn reward for RLHF #2271
I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Remove all that do not apply:

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds core functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)
Example (update in the folder of examples)

Checklist

Go over all the following points, and put an x in all the boxes that apply.

I have read the CONTRIBUTION guide (required)
My change requires a change to the documentation.
I have updated the tests accordingly (required for a bug fix or a new feature).
I have updated the documentation accordingly.

pytorch-bot · 2024-10-05T09:15:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2467

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

rghosh08 · 2024-10-29T18:51:13Z

@vmoens I was wondering when this PR could be merged. Please let me know if there is any gap. Thanks

vmoens · 2024-10-30T08:56:32Z

Hi @rghosh08
Thanks for working on this.
I appreciate the effort in doing this and it'd be awesome to have some version of this in the lib.

The current script doesn't integrate any component of the library and therefore is of limited value within torchrl.
The RLHF examples we provide are there to show how the libs components are to be used to achieve some specific goal.
I feel the code as it is now would be an odd one out with limited benefit for the user base.
If you're interested I'd invite you to work on editing this for a better integration within the lib!

Thanks again for collaborating

rghosh08 · 2024-11-06T22:32:02Z

Thanks @vmoens for your feedback. I will come up with an integration. Appreciate your guidance.

rghosh08 · 2025-02-13T09:43:35Z

@vmoens sorry for the long hiatus. Could you please check the updated code and advise me whether it is in the right track? Thanks

I have used following artifacts within PyTorch/rl

from torchrl.envs import EnvBase
from torchrl.envs.libs.gym import GymWrapper
from torchrl.modules import ProbabilisticActor, ValueOperator
from torchrl.collectors import SyncDataCollector
from torchrl.data import TensorDictReplayBuffer, LazyTensorStorage
from torchrl.objectives import ClipPPOLoss
from torchrl.objectives.value import GAE

rghosh08 · 2025-02-23T09:31:26Z

hi @vmoens do you have any feedback on my latest commit?

boiler plate code for multi-turn reward for RLHF

abeb00f

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 5, 2024

vmoens added the enhancement New feature or request label Oct 8, 2024

vmoens changed the title ~~boiler plate code for multi-turn reward for RLHF~~ [Examples] boiler plate code for multi-turn reward for RLHF Oct 8, 2024

update

717c01e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Examples] boiler plate code for multi-turn reward for RLHF #2467

[Examples] boiler plate code for multi-turn reward for RLHF #2467

rghosh08 commented Oct 5, 2024

pytorch-bot bot commented Oct 5, 2024

rghosh08 commented Oct 29, 2024

vmoens commented Oct 30, 2024

rghosh08 commented Nov 6, 2024

rghosh08 commented Feb 13, 2025

rghosh08 commented Feb 23, 2025

[Examples] boiler plate code for multi-turn reward for RLHF #2467

Are you sure you want to change the base?

[Examples] boiler plate code for multi-turn reward for RLHF #2467

Conversation

rghosh08 commented Oct 5, 2024

Description

Changes include:

Motivation and Context

Types of changes

Checklist

pytorch-bot bot commented Oct 5, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2467

rghosh08 commented Oct 29, 2024

vmoens commented Oct 30, 2024

rghosh08 commented Nov 6, 2024

rghosh08 commented Feb 13, 2025

rghosh08 commented Feb 23, 2025