-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Self-Rewarding Algorithm with TRT Support #321
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Jimmy Zhang <[email protected]>
Signed-off-by: Jimmy Zhang <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Jimmy Zhang <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: jiemingz <=>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: jiemingz <=>
Signed-off-by: jiemingz <=>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Signed-off-by: Gerald Shen <[email protected]>
Please let me know when all copyedits to the CHANGELOG.md and |
Signed-off-by: Daniel Egert <[email protected]>
Signed-off-by: Daniel Egert <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>
Signed-off-by: Daniel Egert <[email protected]>
Signed-off-by: Daniel Egert <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>
Signed-off-by: Daniel Egert <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WIP review
nemo_aligner/experimental/self_rewarding/train_gpt_self_rewarding.py
Outdated
Show resolved
Hide resolved
… PR review Signed-off-by: Daniel Egert <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up on TruncatedGPTSFTChatDataset
(needs some scrutiny since it's not in experimental)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, mostly related to the limit_train_batches
change.
Signed-off-by: Daniel Egert <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>
…s logic to the dataloader Signed-off-by: Daniel Egert <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>
Signed-off-by: Daniel Egert <[email protected]>
Signed-off-by: Daniel Egert <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed the tech pubs review of docs/user-guide-experimental/generation.rst and docs/user-guide-experimental/self_rewarding.rst and provided copyedits.
|
||
All algorithms in NeMo Aligner are compatible with any GPT-based model from Megatron Core (i.e., those with mcore_gpt=True in the configuration). For this tutorial, we will demonstrate the generation pipeline using a 2B GPT model with 4096 sequence length <https://huggingface.co/nvidia/GPT-2B-001>__. This tutorial is also applicable to other GPT models, such as Llama models, regardless of their size. | ||
|
||
Obtaining a pretrained model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix capitalization and change procedural heading to an imperative verb
Obtain a Pretrained Model
@@ -0,0 +1,235 @@ | |||
.. include:: /content/nemo.rsts | |||
|
|||
Model Generation with Data Parallelism and TRT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revise heading for SEO
Model Generation with Data Parallelism and TensorRT (TRT)
The NeMo framework supports efficient model generation via the NeMo Aligner codebase. | ||
|
||
All algorithms in NeMo Aligner are compatible with any GPT-based model from Megatron Core (i.e., those with mcore_gpt=True in the configuration). For this tutorial, we will demonstrate the generation pipeline using a 2B GPT model with 4096 sequence length <https://huggingface.co/nvidia/GPT-2B-001>__. This tutorial is also applicable to other GPT models, such as Llama models, regardless of their size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest a revision to the introductory text, including the purpose and other copyedits
This tutorial demonstrates efficient model generation using NeMo Framework and the NeMo-Aligner codebase. It shows how to set up a 2B GPT model with a sequence length of 4096, available on Hugging Face <https://huggingface.co/nvidia/GPT-2B-001>
__, and applies to other models like Llama.
The tutorial covers obtaining and preparing a pretrained model, configuring parameters, and running the generation process. It highlights using aligned models for better outputs and provides steps for terminal and Slurm execution, ensuring efficient data parallelism and handling TransformerEngine issues. All NeMo-Aligner algorithms work with any GPT-based model from Megatron Core.
|
||
Obtaining a pretrained model | ||
############################ | ||
To start, we must first get an aligned model to generate responses from. There are 2 models we recommend to get started. The rest of the tutorial will work with either model, but for demonstration purposes we will use the smaller 2B model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggested revision and fix grammar
To get started, we need an aligned model for generating responses. We recommend two models: 2B GPT and LLaMa2 7B. While the tutorial works with either, we will use the smaller 2B model for demonstration purposes.
.. tab-item:: 2B GPT | ||
:sync: key1 | ||
|
||
#. Get the 2B checkpoint via ``wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a period
#. Get the 2B checkpoint via wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo
.
- chosen_lengths: average token length of chosen responses (average taken across GBS) | ||
- reject_lengths: as above but for rejected responses | ||
- chosen_generated_rewards: the average reward (across GBS) generated by the LLM-as-a-judge for chosen responses | ||
- rejected_generated_rewards: as above but for rejected responses | ||
- rewards_chosen_mean: see below for a definition of what reward means in this context | ||
- rewards_rejected_mean: as above but for rejected responses | ||
- bad_samples_per_GBS: the percentage of samples in a GBS which are excluded from training because of bad output from the LLM-as-a-judge (could be caused by parse errors, or all responses being judge with the same score, etc) | ||
- bad_ends_per_GBS: only valid if using TRT, this tracks the percentage of each GBS where TRT generates incorrect stop tokens (should be really low, < 1%) | ||
- preference_loss: the raw DPO variant loss | ||
- sft_loss: if adding an SFT loss (categorical cross-entropy loss) for the chosen response, then you can see that raw loss here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix capitalization and punctuation
- chosen_lengths: Average token length of chosen responses (average taken across GBS).
- reject_lengths: Same as above, but for rejected responses.
- chosen_generated_rewards: The average reward (across GBS) generated by the LLM-as-a-judge for chosen responses.
- rejected_generated_rewards: Same as above, but for rejected responses.
- rewards_chosen_mean: See below for a definition of what "reward" means in this context.
- rewards_rejected_mean: Same as above, but for rejected responses.
- bad_samples_per_GBS: The percentage of samples in a GBS that are excluded from training due to bad output from the LLM-as-a-judge (could be caused by parse errors, all responses being judged with the same score, etc.).
- bad_ends_per_GBS: Only valid if using TRT, this tracks the percentage of each GBS where TRT generates incorrect stop tokens (should be really low, < 1%).
- preference_loss: The raw DPO variant loss.
- sft_loss: If adding an SFT loss (categorical cross-entropy loss) for the chosen response, you can see that raw loss here.
* global_batch_size: we recommend using 64, and going up to 128 only for large models (70B+) that are also training with large datasets | ||
* iterations/epochs: the original paper uses 3 iterations with 1 epoch per iteration, and we find this to be sufficient for most use cases | ||
* learning rate: for SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7. | ||
* ref_policy_kl_penalty: we did not see large changes from perturbations to this value; we recommend 0.1 - 0.001 | ||
* length_control: depends very much on model size and data, but we found good results with [0,0,0.1] | ||
* use_meta_judge: we have found stronger results when settings this to true, which is in line with the paper's results | ||
* meta_judge_pcnt: we recommend you do not set this higher than 0.15 (15%). Any higher, and we have observed that the llm-as-a-judge model starts to output identical scores for every response (always a 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revise, fix punctuation, fix capitalization
- global_batch_size: We recommend using 64 and going up to 128 only for large models (70B+) that are also training with large datasets.
- iterations/epochs: The original paper uses 3 iterations with 1 epoch per iteration. We find this to be sufficient for most use cases.
- learning rate: For SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 and 9e-7.
- ref_policy_kl_penalty: We did not see large changes from perturbations to this value; we recommend 0.1 to 0.001.
- length_control: Depends very much on model size and data, but we found good results with [0,0,0.1].
- use_meta_judge: We have found stronger results when setting this to true, which is in line with the paper's results.
- meta_judge_pcnt: We recommend you do not set this higher than 0.15 (15%). Any higher, and we have observed that the LLM-as-a-judge model starts to output identical scores for every response (always a 5).
All metrics will be grouped by either ``train/`` or ``val/`` in WandB, representing whether that metric is from the training or validation set, respectively. | ||
You can also see a table which will print out the prompt, chosen response, and rejected response for each validation step. This allows you to keep track of response quality and hallucinations. | ||
|
||
When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix capitalization
When it comes to ideal hyperparameters for self-rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases.
You can also see a table which will print out the prompt, chosen response, and rejected response for each validation step. This allows you to keep track of response quality and hallucinations. | ||
|
||
When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases. | ||
Additionally, Self-Rewarding (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix capitalization
Additionally, self-rewarding training (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.
|
||
When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases. | ||
Additionally, Self-Rewarding (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult. | ||
Below are some of observations from the Nvidia Alignment team as to what parameters we have seen work well: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix capitalization
Below are some of observations from the NVIDIA Alignment team as to what parameters we have seen work well:
Completed the tech pubs review of CHANGELOG.md. See attached file for extensive edits. |
Signed-off-by: Daniel Egert <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking about the new verifier code that crept in within deb3604
Signed-off-by: Daniel Egert <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments / questions
if hasattr(self.indexed_dataset, "select"): | ||
self.indexed_dataset = self.indexed_dataset.select(good_idxes) | ||
else: | ||
self.indexed_dataset = [x for i, x in enumerate(self.indexed_dataset) if i in good_idxes] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned with blowing up memory here, in case of large datasets (from what I can tell, indexed_dataset
doesn't necessarily load everything in memory). It seems safer to me to implement a tiny wrapper on top of the original self.indexed_dataset
to give access only to the samples in good_idxes
(i.e. essentially implement .select()
if it's not available).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way the code works now, the only way to get a self.index_dataset which isn't a PyArrow datasets object is by loading the jsonl manually, and for that, there's no way to selectively load certain indexes without blowing up memory, that I know of
# unfortunately, GPTSFTDataset uses datasets.load_dataset for loading JSONL files, which causes massive issues if your JSONL file | ||
# has irregular/jagged fields, for example when including metadata for the code/math Verifier. HF's datasets library can't parse | ||
# these fields because it expands everything to a Pandas table under the bonnet, which means it needs an immutable, old-fashioned | ||
# SQL-like schema which is consistent for every sample, and having dynamic payload for a field per sample will cause this to break | ||
# Hence, we need a hack-around that allows us to load data which contains Verifier metadata, and the only way to do that is to load | ||
# the data using old-fashioned json.loads(), which is what we do here when we detect that the datasets.load_datasets method failed | ||
def _load_dataset(self): | ||
try: | ||
super()._load_dataset() | ||
except: | ||
with open(self.file_path, "r", encoding="utf_8") as fr: | ||
self.indexed_dataset = [json.loads(line) for line in fr] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to understand exactly when this problem occurs -- shouldn't hf_dataset
be set to False
when using some "irregular" JSONL file? And if that's the case, wouldn't it some this problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's rather difficult to describe with text alone, but I can show how/when the issue occurs. It's an issue outside of Nemo+Aligner altogether, you can replicate the problem with just HuggingFace's datasets library and my jsonl file which has the verifier fields added in
# to loading a jsonl manually. In that case, we need to ensure each sample sent to _process_example is immutable or else | ||
# you will have corrupted text, see here: https://github.com/NVIDIA/NeMo/blob/25b7d3d7f217a11f85ed23b5917b7b840f330e2f/nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py#L204 | ||
def _process_example(self, example): | ||
if hasattr(self.indexed_dataset, "select"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just taking a note to revisit this if
depending on the resolution of my other comments in this file.
nemo_aligner/experimental/self_rewarding/conf/gpt_self_rewarding_llama3.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question related to the change for prepare_for_inference
in 7178c88
self.model.finish_inference() | ||
if self.use_trtllm_generation: | ||
self.trtllm_generate.free() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I struggle to parse this function so I may be wrong, but it seems to me that we don't have a 1:1 mapping between self.model.prepare_for_inference()
and self.model.finish_inference()
, because this block here is under a for batch in divide_chunks(final_buffer, original_gbs_size):
loop => we may end up running it multiple times?
I'm also a bit suspicious of the fact we call get_rewards_meta()
after this block (because this function doesn't set prepare_for_inference=True
when it calls get_generations()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a good catch, thanks. I believe I have fixed it now, please have a look
…or normalise prompts Signed-off-by: Daniel Egert <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>
What does this PR do ?
Adds support for the Self-Rewarding and Meta-Rewarding algorithms from the following two papers:
https://arxiv.org/abs/2401.10020
https://arxiv.org/abs/2407.19594
Changelog
Usage
Please see the new tutorial document at:
docs/user-guide/self_rewarding.rst
Before your PR is "Ready for review"
Pre checks:
Checklist when contributing a new algorithm
max_steps=-1
andvalidation
?Additional Information