Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Self-Rewarding Algorithm with TRT Support #321

Open
wants to merge 314 commits into
base: main
Choose a base branch
from

Conversation

trias702
Copy link
Collaborator

What does this PR do ?

Adds support for the Self-Rewarding and Meta-Rewarding algorithms from the following two papers:

https://arxiv.org/abs/2401.10020
https://arxiv.org/abs/2407.19594

Changelog

  • Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

Please see the new tutorial document at: docs/user-guide/self_rewarding.rst

Before your PR is "Ready for review"

Pre checks:

Checklist when contributing a new algorithm

  • Does the trainer resume and restore model state all states?
  • Does the trainer support all parallelism techniques(PP, TP, DP)?
  • Does the trainer support max_steps=-1 and validation?
  • Does the trainer only call APIs defined in alignable_interface.py?
  • Does the trainer have proper logging?

Additional Information

  • Related to # (issue)

jiemingz and others added 30 commits April 17, 2024 11:38
Signed-off-by: Jimmy Zhang <[email protected]>
Signed-off-by: Jimmy Zhang <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Jimmy Zhang <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: jiemingz <=>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: jiemingz <=>
Signed-off-by: jiemingz <=>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
@jgerh
Copy link
Collaborator

jgerh commented Jan 21, 2025

Please let me know when all copyedits to the CHANGELOG.md and
docs/user-guide/self_rewarding.rst files made in the November 26, 2024 review have been addressed and the threads resolved. I can then verify the changes and approve. Let me know if you have any questions about the Tech Pubs process for reviewing PRs.

Copy link
Collaborator

@odelalleau odelalleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIP review

nemo_aligner/data/nlp/builders.py Show resolved Hide resolved
Copy link
Collaborator

@odelalleau odelalleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up on TruncatedGPTSFTChatDataset (needs some scrutiny since it's not in experimental)

nemo_aligner/data/nlp/datasets.py Show resolved Hide resolved
nemo_aligner/data/nlp/datasets.py Show resolved Hide resolved
nemo_aligner/data/nlp/datasets.py Outdated Show resolved Hide resolved
nemo_aligner/data/nlp/datasets.py Show resolved Hide resolved
nemo_aligner/data/nlp/builders.py Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
nemo_aligner/data/nlp/builders.py Show resolved Hide resolved
Copy link
Collaborator

@odelalleau odelalleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, mostly related to the limit_train_batches change.

docs/user-guide-experimental/generation.rst Outdated Show resolved Hide resolved
nemo_aligner/data/nlp/builders.py Outdated Show resolved Hide resolved
nemo_aligner/data/nlp/builders.py Outdated Show resolved Hide resolved
nemo_aligner/data/nlp/builders.py Outdated Show resolved Hide resolved
examples/nlp/gpt/train_gpt_dpo.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed the tech pubs review of docs/user-guide-experimental/generation.rst and docs/user-guide-experimental/self_rewarding.rst and provided copyedits.


All algorithms in NeMo Aligner are compatible with any GPT-based model from Megatron Core (i.e., those with mcore_gpt=True in the configuration). For this tutorial, we will demonstrate the generation pipeline using a 2B GPT model with 4096 sequence length <https://huggingface.co/nvidia/GPT-2B-001>__. This tutorial is also applicable to other GPT models, such as Llama models, regardless of their size.

Obtaining a pretrained model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix capitalization and change procedural heading to an imperative verb

Obtain a Pretrained Model

@@ -0,0 +1,235 @@
.. include:: /content/nemo.rsts

Model Generation with Data Parallelism and TRT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revise heading for SEO

Model Generation with Data Parallelism and TensorRT (TRT)

Comment on lines +6 to +8
The NeMo framework supports efficient model generation via the NeMo Aligner codebase.

All algorithms in NeMo Aligner are compatible with any GPT-based model from Megatron Core (i.e., those with mcore_gpt=True in the configuration). For this tutorial, we will demonstrate the generation pipeline using a 2B GPT model with 4096 sequence length <https://huggingface.co/nvidia/GPT-2B-001>__. This tutorial is also applicable to other GPT models, such as Llama models, regardless of their size.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest a revision to the introductory text, including the purpose and other copyedits

This tutorial demonstrates efficient model generation using NeMo Framework and the NeMo-Aligner codebase. It shows how to set up a 2B GPT model with a sequence length of 4096, available on Hugging Face <https://huggingface.co/nvidia/GPT-2B-001>__, and applies to other models like Llama.

The tutorial covers obtaining and preparing a pretrained model, configuring parameters, and running the generation process. It highlights using aligned models for better outputs and provides steps for terminal and Slurm execution, ensuring efficient data parallelism and handling TransformerEngine issues. All NeMo-Aligner algorithms work with any GPT-based model from Megatron Core.


Obtaining a pretrained model
############################
To start, we must first get an aligned model to generate responses from. There are 2 models we recommend to get started. The rest of the tutorial will work with either model, but for demonstration purposes we will use the smaller 2B model.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested revision and fix grammar

To get started, we need an aligned model for generating responses. We recommend two models: 2B GPT and LLaMa2 7B. While the tutorial works with either, we will use the smaller 2B model for demonstration purposes.

.. tab-item:: 2B GPT
:sync: key1

#. Get the 2B checkpoint via ``wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo``
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a period

#. Get the 2B checkpoint via wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo.

Comment on lines +188 to +197
- chosen_lengths: average token length of chosen responses (average taken across GBS)
- reject_lengths: as above but for rejected responses
- chosen_generated_rewards: the average reward (across GBS) generated by the LLM-as-a-judge for chosen responses
- rejected_generated_rewards: as above but for rejected responses
- rewards_chosen_mean: see below for a definition of what reward means in this context
- rewards_rejected_mean: as above but for rejected responses
- bad_samples_per_GBS: the percentage of samples in a GBS which are excluded from training because of bad output from the LLM-as-a-judge (could be caused by parse errors, or all responses being judge with the same score, etc)
- bad_ends_per_GBS: only valid if using TRT, this tracks the percentage of each GBS where TRT generates incorrect stop tokens (should be really low, < 1%)
- preference_loss: the raw DPO variant loss
- sft_loss: if adding an SFT loss (categorical cross-entropy loss) for the chosen response, then you can see that raw loss here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix capitalization and punctuation

  • chosen_lengths: Average token length of chosen responses (average taken across GBS).
  • reject_lengths: Same as above, but for rejected responses.
  • chosen_generated_rewards: The average reward (across GBS) generated by the LLM-as-a-judge for chosen responses.
  • rejected_generated_rewards: Same as above, but for rejected responses.
  • rewards_chosen_mean: See below for a definition of what "reward" means in this context.
  • rewards_rejected_mean: Same as above, but for rejected responses.
  • bad_samples_per_GBS: The percentage of samples in a GBS that are excluded from training due to bad output from the LLM-as-a-judge (could be caused by parse errors, all responses being judged with the same score, etc.).
  • bad_ends_per_GBS: Only valid if using TRT, this tracks the percentage of each GBS where TRT generates incorrect stop tokens (should be really low, < 1%).
  • preference_loss: The raw DPO variant loss.
  • sft_loss: If adding an SFT loss (categorical cross-entropy loss) for the chosen response, you can see that raw loss here.

Comment on lines +208 to +214
* global_batch_size: we recommend using 64, and going up to 128 only for large models (70B+) that are also training with large datasets
* iterations/epochs: the original paper uses 3 iterations with 1 epoch per iteration, and we find this to be sufficient for most use cases
* learning rate: for SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7.
* ref_policy_kl_penalty: we did not see large changes from perturbations to this value; we recommend 0.1 - 0.001
* length_control: depends very much on model size and data, but we found good results with [0,0,0.1]
* use_meta_judge: we have found stronger results when settings this to true, which is in line with the paper's results
* meta_judge_pcnt: we recommend you do not set this higher than 0.15 (15%). Any higher, and we have observed that the llm-as-a-judge model starts to output identical scores for every response (always a 5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revise, fix punctuation, fix capitalization

  • global_batch_size: We recommend using 64 and going up to 128 only for large models (70B+) that are also training with large datasets.
  • iterations/epochs: The original paper uses 3 iterations with 1 epoch per iteration. We find this to be sufficient for most use cases.
  • learning rate: For SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 and 9e-7.
  • ref_policy_kl_penalty: We did not see large changes from perturbations to this value; we recommend 0.1 to 0.001.
  • length_control: Depends very much on model size and data, but we found good results with [0,0,0.1].
  • use_meta_judge: We have found stronger results when setting this to true, which is in line with the paper's results.
  • meta_judge_pcnt: We recommend you do not set this higher than 0.15 (15%). Any higher, and we have observed that the LLM-as-a-judge model starts to output identical scores for every response (always a 5).

All metrics will be grouped by either ``train/`` or ``val/`` in WandB, representing whether that metric is from the training or validation set, respectively.
You can also see a table which will print out the prompt, chosen response, and rejected response for each validation step. This allows you to keep track of response quality and hallucinations.

When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix capitalization

When it comes to ideal hyperparameters for self-rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases.

You can also see a table which will print out the prompt, chosen response, and rejected response for each validation step. This allows you to keep track of response quality and hallucinations.

When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases.
Additionally, Self-Rewarding (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix capitalization

Additionally, self-rewarding training (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.


When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases.
Additionally, Self-Rewarding (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.
Below are some of observations from the Nvidia Alignment team as to what parameters we have seen work well:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix capitalization

Below are some of observations from the NVIDIA Alignment team as to what parameters we have seen work well:

@jgerh
Copy link
Collaborator

jgerh commented Jan 29, 2025

Completed the tech pubs review of CHANGELOG.md. See attached file for extensive edits.
CHANGELOG_edits.md

trias702 and others added 2 commits January 30, 2025 20:14
Signed-off-by: Daniel Egert <[email protected]>
for more information, see https://pre-commit.ci

Signed-off-by: NeMo-Aligner CI <[email protected]>
Copy link
Collaborator

@odelalleau odelalleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking about the new verifier code that crept in within deb3604

nemo_aligner/utils/verifiers/code_verifier.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@odelalleau odelalleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments / questions

if hasattr(self.indexed_dataset, "select"):
self.indexed_dataset = self.indexed_dataset.select(good_idxes)
else:
self.indexed_dataset = [x for i, x in enumerate(self.indexed_dataset) if i in good_idxes]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned with blowing up memory here, in case of large datasets (from what I can tell, indexed_dataset doesn't necessarily load everything in memory). It seems safer to me to implement a tiny wrapper on top of the original self.indexed_dataset to give access only to the samples in good_idxes (i.e. essentially implement .select() if it's not available).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the code works now, the only way to get a self.index_dataset which isn't a PyArrow datasets object is by loading the jsonl manually, and for that, there's no way to selectively load certain indexes without blowing up memory, that I know of

Comment on lines +1029 to +1040
# unfortunately, GPTSFTDataset uses datasets.load_dataset for loading JSONL files, which causes massive issues if your JSONL file
# has irregular/jagged fields, for example when including metadata for the code/math Verifier. HF's datasets library can't parse
# these fields because it expands everything to a Pandas table under the bonnet, which means it needs an immutable, old-fashioned
# SQL-like schema which is consistent for every sample, and having dynamic payload for a field per sample will cause this to break
# Hence, we need a hack-around that allows us to load data which contains Verifier metadata, and the only way to do that is to load
# the data using old-fashioned json.loads(), which is what we do here when we detect that the datasets.load_datasets method failed
def _load_dataset(self):
try:
super()._load_dataset()
except:
with open(self.file_path, "r", encoding="utf_8") as fr:
self.indexed_dataset = [json.loads(line) for line in fr]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand exactly when this problem occurs -- shouldn't hf_dataset be set to False when using some "irregular" JSONL file? And if that's the case, wouldn't it some this problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's rather difficult to describe with text alone, but I can show how/when the issue occurs. It's an issue outside of Nemo+Aligner altogether, you can replicate the problem with just HuggingFace's datasets library and my jsonl file which has the verifier fields added in

# to loading a jsonl manually. In that case, we need to ensure each sample sent to _process_example is immutable or else
# you will have corrupted text, see here: https://github.com/NVIDIA/NeMo/blob/25b7d3d7f217a11f85ed23b5917b7b840f330e2f/nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py#L204
def _process_example(self, example):
if hasattr(self.indexed_dataset, "select"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just taking a note to revisit this if depending on the resolution of my other comments in this file.

Copy link
Collaborator

@odelalleau odelalleau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question related to the change for prepare_for_inference in 7178c88

Comment on lines 1354 to 1356
self.model.finish_inference()
if self.use_trtllm_generation:
self.trtllm_generate.free()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggle to parse this function so I may be wrong, but it seems to me that we don't have a 1:1 mapping between self.model.prepare_for_inference() and self.model.finish_inference(), because this block here is under a for batch in divide_chunks(final_buffer, original_gbs_size): loop => we may end up running it multiple times?

I'm also a bit suspicious of the fact we call get_rewards_meta() after this block (because this function doesn't set prepare_for_inference=True when it calls get_generations())

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a good catch, thanks. I believe I have fixed it now, please have a look

trias702 and others added 2 commits February 7, 2025 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algorithms CI documentation Improvements or additions to documentation Run CICD Set + un-set to retrigger Utils
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants