feat: Self-Rewarding Algorithm with TRT Support #321

trias702 · 2024-09-26T22:14:33Z

What does this PR do ?

Adds support for the Self-Rewarding and Meta-Rewarding algorithms from the following two papers:

https://arxiv.org/abs/2401.10020
https://arxiv.org/abs/2407.19594

Changelog

Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

Please see the new tutorial document at: docs/user-guide/self_rewarding.rst

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing a new algorithm

Does the trainer resume and restore model state all states?
Does the trainer support all parallelism techniques(PP, TP, DP)?
Does the trainer support max_steps=-1 and validation?
Does the trainer only call APIs defined in alignable_interface.py?
Does the trainer have proper logging?

Additional Information

Related to # (issue)

Signed-off-by: Jimmy Zhang <[email protected]>

Signed-off-by: Gerald Shen <[email protected]>

Signed-off-by: Jimmy Zhang <[email protected]>

Signed-off-by: Gerald Shen <[email protected]>

Signed-off-by: jiemingz <=>

Signed-off-by: Gerald Shen <[email protected]>

…t_build

Signed-off-by: Gerald Shen <[email protected]>

Signed-off-by: jiemingz <=>

Signed-off-by: Gerald Shen <[email protected]>

jgerh · 2025-01-21T18:19:22Z

Please let me know when all copyedits to the CHANGELOG.md and
docs/user-guide/self_rewarding.rst files made in the November 26, 2024 review have been addressed and the threads resolved. I can then verify the changes and approve. Let me know if you have any questions about the Tech Pubs process for reviewing PRs.

Signed-off-by: Daniel Egert <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

Signed-off-by: Daniel Egert <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

Signed-off-by: Daniel Egert <[email protected]>

odelalleau

WIP review

nemo_aligner/data/nlp/builders.py

nemo_aligner/experimental/self_rewarding/train_gpt_self_rewarding.py

… PR review Signed-off-by: Daniel Egert <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

odelalleau

Follow-up on TruncatedGPTSFTChatDataset (needs some scrutiny since it's not in experimental)

nemo_aligner/data/nlp/datasets.py

nemo_aligner/data/nlp/builders.py

CHANGELOG.md

nemo_aligner/data/nlp/builders.py

odelalleau

A few comments, mostly related to the limit_train_batches change.

nemo_aligner/experimental/generation/conf/gpt_generation.yaml

docs/user-guide-experimental/generation.rst

nemo_aligner/data/nlp/builders.py

examples/nlp/gpt/train_gpt_dpo.py

Signed-off-by: Daniel Egert <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

…s logic to the dataloader Signed-off-by: Daniel Egert <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

Signed-off-by: Daniel Egert <[email protected]>

jgerh

Completed the tech pubs review of docs/user-guide-experimental/generation.rst and docs/user-guide-experimental/self_rewarding.rst and provided copyedits.

jgerh · 2025-01-29T18:47:27Z

docs/user-guide-experimental/generation.rst

+
+All algorithms in NeMo Aligner are compatible with any GPT-based model from Megatron Core (i.e., those with mcore_gpt=True in the configuration). For this tutorial, we will demonstrate the generation pipeline using a 2B GPT model with 4096 sequence length <https://huggingface.co/nvidia/GPT-2B-001>__. This tutorial is also applicable to other GPT models, such as Llama models, regardless of their size.
+
+Obtaining a pretrained model


fix capitalization and change procedural heading to an imperative verb

Obtain a Pretrained Model

jgerh · 2025-01-29T18:57:02Z

docs/user-guide-experimental/generation.rst

@@ -0,0 +1,235 @@
+.. include:: /content/nemo.rsts
+
+Model Generation with Data Parallelism and TRT


revise heading for SEO

Model Generation with Data Parallelism and TensorRT (TRT)

jgerh · 2025-01-29T19:15:26Z

docs/user-guide-experimental/generation.rst

+The NeMo framework supports efficient model generation via the NeMo Aligner codebase.
+
+All algorithms in NeMo Aligner are compatible with any GPT-based model from Megatron Core (i.e., those with mcore_gpt=True in the configuration). For this tutorial, we will demonstrate the generation pipeline using a 2B GPT model with 4096 sequence length <https://huggingface.co/nvidia/GPT-2B-001>__. This tutorial is also applicable to other GPT models, such as Llama models, regardless of their size.


suggest a revision to the introductory text, including the purpose and other copyedits

This tutorial demonstrates efficient model generation using NeMo Framework and the NeMo-Aligner codebase. It shows how to set up a 2B GPT model with a sequence length of 4096, available on Hugging Face <https://huggingface.co/nvidia/GPT-2B-001>__, and applies to other models like Llama.

The tutorial covers obtaining and preparing a pretrained model, configuring parameters, and running the generation process. It highlights using aligned models for better outputs and provides steps for terminal and Slurm execution, ensuring efficient data parallelism and handling TransformerEngine issues. All NeMo-Aligner algorithms work with any GPT-based model from Megatron Core.

jgerh · 2025-01-29T19:22:27Z

docs/user-guide-experimental/generation.rst

+
+Obtaining a pretrained model
+############################
+To start, we must first get an aligned model to generate responses from. There are 2 models we recommend to get started. The rest of the tutorial will work with either model, but for demonstration purposes we will use the smaller 2B model. 


suggested revision and fix grammar

To get started, we need an aligned model for generating responses. We recommend two models: 2B GPT and LLaMa2 7B. While the tutorial works with either, we will use the smaller 2B model for demonstration purposes.

jgerh · 2025-01-29T19:22:58Z

docs/user-guide-experimental/generation.rst

+    .. tab-item:: 2B GPT
+        :sync: key1
+
+        #. Get the 2B checkpoint via ``wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo``


add a period

#. Get the 2B checkpoint via wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo.

jgerh · 2025-01-29T22:13:23Z

docs/user-guide-experimental/self_rewarding.rst

+- chosen_lengths: average token length of chosen responses (average taken across GBS)
+- reject_lengths: as above but for rejected responses
+- chosen_generated_rewards: the average reward (across GBS) generated by the LLM-as-a-judge for chosen responses
+- rejected_generated_rewards: as above but for rejected responses
+- rewards_chosen_mean: see below for a definition of what reward means in this context
+- rewards_rejected_mean: as above but for rejected responses
+- bad_samples_per_GBS: the percentage of samples in a GBS which are excluded from training because of bad output from the LLM-as-a-judge (could be caused by parse errors, or all responses being judge with the same score, etc)
+- bad_ends_per_GBS: only valid if using TRT, this tracks the percentage of each GBS where TRT generates incorrect stop tokens (should be really low, < 1%)
+- preference_loss: the raw DPO variant loss
+- sft_loss: if adding an SFT loss (categorical cross-entropy loss) for the chosen response, then you can see that raw loss here


fix capitalization and punctuation

chosen_lengths: Average token length of chosen responses (average taken across GBS).

reject_lengths: Same as above, but for rejected responses.

chosen_generated_rewards: The average reward (across GBS) generated by the LLM-as-a-judge for chosen responses.

rejected_generated_rewards: Same as above, but for rejected responses.

rewards_chosen_mean: See below for a definition of what "reward" means in this context.

rewards_rejected_mean: Same as above, but for rejected responses.

bad_samples_per_GBS: The percentage of samples in a GBS that are excluded from training due to bad output from the LLM-as-a-judge (could be caused by parse errors, all responses being judged with the same score, etc.).

bad_ends_per_GBS: Only valid if using TRT, this tracks the percentage of each GBS where TRT generates incorrect stop tokens (should be really low, < 1%).

preference_loss: The raw DPO variant loss.

sft_loss: If adding an SFT loss (categorical cross-entropy loss) for the chosen response, you can see that raw loss here.

jgerh · 2025-01-29T22:20:04Z

docs/user-guide-experimental/self_rewarding.rst

+* global_batch_size: we recommend using 64, and going up to 128 only for large models (70B+) that are also training with large datasets
+* iterations/epochs: the original paper uses 3 iterations with 1 epoch per iteration, and we find this to be sufficient for most use cases
+* learning rate: for SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7.
+* ref_policy_kl_penalty: we did not see large changes from perturbations to this value; we recommend 0.1 - 0.001
+* length_control: depends very much on model size and data, but we found good results with [0,0,0.1]
+* use_meta_judge: we have found stronger results when settings this to true, which is in line with the paper's results
+* meta_judge_pcnt: we recommend you do not set this higher than 0.15 (15%). Any higher, and we have observed that the llm-as-a-judge model starts to output identical scores for every response (always a 5)


revise, fix punctuation, fix capitalization

global_batch_size: We recommend using 64 and going up to 128 only for large models (70B+) that are also training with large datasets.

iterations/epochs: The original paper uses 3 iterations with 1 epoch per iteration. We find this to be sufficient for most use cases.

learning rate: For SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 and 9e-7.

ref_policy_kl_penalty: We did not see large changes from perturbations to this value; we recommend 0.1 to 0.001.

length_control: Depends very much on model size and data, but we found good results with [0,0,0.1].

use_meta_judge: We have found stronger results when setting this to true, which is in line with the paper's results.

meta_judge_pcnt: We recommend you do not set this higher than 0.15 (15%). Any higher, and we have observed that the LLM-as-a-judge model starts to output identical scores for every response (always a 5).

jgerh · 2025-01-29T23:06:13Z

docs/user-guide-experimental/self_rewarding.rst

+All metrics will be grouped by either ``train/`` or ``val/`` in WandB, representing whether that metric is from the training or validation set, respectively.
+You can also see a table which will print out the prompt, chosen response, and rejected response for each validation step. This allows you to keep track of response quality and hallucinations.
+
+When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases.


fix capitalization

When it comes to ideal hyperparameters for self-rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases.

jgerh · 2025-01-29T23:06:41Z

docs/user-guide-experimental/self_rewarding.rst

+You can also see a table which will print out the prompt, chosen response, and rejected response for each validation step. This allows you to keep track of response quality and hallucinations.
+
+When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases.
+Additionally, Self-Rewarding (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.


fix capitalization

Additionally, self-rewarding training (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.

jgerh · 2025-01-29T23:07:04Z

docs/user-guide-experimental/self_rewarding.rst

+
+When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases.
+Additionally, Self-Rewarding (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.
+Below are some of observations from the Nvidia Alignment team as to what parameters we have seen work well:


fix capitalization

Below are some of observations from the NVIDIA Alignment team as to what parameters we have seen work well:

jgerh · 2025-01-29T23:12:56Z

Completed the tech pubs review of CHANGELOG.md. See attached file for extensive edits.
CHANGELOG_edits.md

Signed-off-by: Daniel Egert <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

odelalleau

Asking about the new verifier code that crept in within deb3604

nemo_aligner/utils/verifiers/code_verifier.py

Signed-off-by: Daniel Egert <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

odelalleau

A few more comments / questions

odelalleau · 2025-02-06T17:08:42Z

nemo_aligner/data/nlp/datasets.py

+        if hasattr(self.indexed_dataset, "select"):
+            self.indexed_dataset = self.indexed_dataset.select(good_idxes)
+        else:
+            self.indexed_dataset = [x for i, x in enumerate(self.indexed_dataset) if i in good_idxes]


I'm concerned with blowing up memory here, in case of large datasets (from what I can tell, indexed_dataset doesn't necessarily load everything in memory). It seems safer to me to implement a tiny wrapper on top of the original self.indexed_dataset to give access only to the samples in good_idxes (i.e. essentially implement .select() if it's not available).

The way the code works now, the only way to get a self.index_dataset which isn't a PyArrow datasets object is by loading the jsonl manually, and for that, there's no way to selectively load certain indexes without blowing up memory, that I know of

odelalleau · 2025-02-06T17:20:18Z

nemo_aligner/data/nlp/datasets.py

+    # unfortunately, GPTSFTDataset uses datasets.load_dataset for loading JSONL files, which causes massive issues if your JSONL file
+    # has irregular/jagged fields, for example when including metadata for the code/math Verifier. HF's datasets library can't parse
+    # these fields because it expands everything to a Pandas table under the bonnet, which means it needs an immutable, old-fashioned
+    # SQL-like schema which is consistent for every sample, and having dynamic payload for a field per sample will cause this to break
+    # Hence, we need a hack-around that allows us to load data which contains Verifier metadata, and the only way to do that is to load
+    # the data using old-fashioned json.loads(), which is what we do here when we detect that the datasets.load_datasets method failed
+    def _load_dataset(self):
+        try:
+            super()._load_dataset()
+        except:
+            with open(self.file_path, "r", encoding="utf_8") as fr:
+                self.indexed_dataset = [json.loads(line) for line in fr]


Trying to understand exactly when this problem occurs -- shouldn't hf_dataset be set to False when using some "irregular" JSONL file? And if that's the case, wouldn't it some this problem?

It's rather difficult to describe with text alone, but I can show how/when the issue occurs. It's an issue outside of Nemo+Aligner altogether, you can replicate the problem with just HuggingFace's datasets library and my jsonl file which has the verifier fields added in

odelalleau · 2025-02-06T17:20:42Z

nemo_aligner/data/nlp/datasets.py

+    # to loading a jsonl manually. In that case, we need to ensure each sample sent to _process_example is immutable or else
+    # you will have corrupted text, see here: https://github.com/NVIDIA/NeMo/blob/25b7d3d7f217a11f85ed23b5917b7b840f330e2f/nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py#L204
+    def _process_example(self, example):
+        if hasattr(self.indexed_dataset, "select"):


Just taking a note to revisit this if depending on the resolution of my other comments in this file.

nemo_aligner/experimental/self_rewarding/conf/gpt_self_rewarding_llama3.yaml

odelalleau

Another question related to the change for prepare_for_inference in 7178c88

odelalleau · 2025-02-07T16:12:40Z

nemo_aligner/experimental/self_rewarding/self_rewarding.py

+                    self.model.finish_inference()
+                    if self.use_trtllm_generation:
+                        self.trtllm_generate.free()


I struggle to parse this function so I may be wrong, but it seems to me that we don't have a 1:1 mapping between self.model.prepare_for_inference() and self.model.finish_inference(), because this block here is under a for batch in divide_chunks(final_buffer, original_gbs_size): loop => we may end up running it multiple times?

I'm also a bit suspicious of the fact we call get_rewards_meta() after this block (because this function doesn't set prepare_for_inference=True when it calls get_generations())

This was a good catch, thanks. I believe I have fixed it now, please have a look

…or normalise prompts Signed-off-by: Daniel Egert <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

jiemingz and others added 30 commits April 17, 2024 11:38

trtllm patch file

dfac922

Signed-off-by: Jimmy Zhang <[email protected]>

dockerfile

c159aa3

Signed-off-by: Jimmy Zhang <[email protected]>

fix build

d81caef

Signed-off-by: Gerald Shen <[email protected]>

fix bug

92c19f6

Signed-off-by: Gerald Shen <[email protected]>

add groupnorm build

7088f54

Signed-off-by: Jimmy Zhang <[email protected]>

upgrade to latest te and mcore

472a56c

Signed-off-by: Gerald Shen <[email protected]>

Merge remote-tracking branch 'origin/dev' into aligner_trt_build

032bf35

Signed-off-by: Gerald Shen <[email protected]>

fix

d5f55f5

Signed-off-by: Gerald Shen <[email protected]>

specify max token

c7cdca1

Signed-off-by: jiemingz <=>

fix

04d02c8

Signed-off-by: Gerald Shen <[email protected]>

Merge remote-tracking branch 'origin/geshen/main_trt' into aligner_tr…

56ccacf

…t_build

fix critic checkpoint loading

d23865f

Signed-off-by: Gerald Shen <[email protected]>

add assert

3c21c81

Signed-off-by: Gerald Shen <[email protected]>

fix bug

2c99dcb

Signed-off-by: Gerald Shen <[email protected]>

fix

9e8526d

Signed-off-by: Gerald Shen <[email protected]>

fix

c1daeb9

Signed-off-by: Gerald Shen <[email protected]>

update dockerfile

e16c357

Signed-off-by: Gerald Shen <[email protected]>

update to 24.03.01 deps

410eaf5

Signed-off-by: Gerald Shen <[email protected]>

fix

e405432

Signed-off-by: Gerald Shen <[email protected]>

update dockerfile

07cfa67

Signed-off-by: Gerald Shen <[email protected]>

add dockerfileg

b2dfee0

Signed-off-by: Gerald Shen <[email protected]>

fix trtllm patch

63cd6b3

Signed-off-by: jiemingz <=>

clamp output with warning

6901348

Signed-off-by: jiemingz <=>

fix

74a0bb1

Signed-off-by: Gerald Shen <[email protected]>

remove debug statements

b6a05fd

Signed-off-by: Gerald Shen <[email protected]>

Merge remote-tracking branch 'origin/main' into aligner_trt_build

db2701b

add debug info

8dd5c59

Signed-off-by: Gerald Shen <[email protected]>

bump pytrition version

b5d6f88

Signed-off-by: Gerald Shen <[email protected]>

add critic speed

5464827

Signed-off-by: Gerald Shen <[email protected]>

critic speedup

00e4298

Signed-off-by: Gerald Shen <[email protected]>

trias702 and others added 7 commits January 23, 2025 11:32

Added some temp logging

b93549f

Signed-off-by: Daniel Egert <[email protected]>

Merged in main

0b04388

Signed-off-by: Daniel Egert <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

0b06d95

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

Initial commit for new experimental structure

a5c9614

Signed-off-by: Daniel Egert <[email protected]>

Initial commit for new experimental structure

cf58ead

Signed-off-by: Daniel Egert <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

bbaf99f

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

Moved self-rewarding and generation tests to new experimental folder

139d739

Signed-off-by: Daniel Egert <[email protected]>

odelalleau reviewed Jan 24, 2025

View reviewed changes

nemo_aligner/data/nlp/builders.py Show resolved Hide resolved

nemo_aligner/experimental/self_rewarding/train_gpt_self_rewarding.py Outdated Show resolved Hide resolved

trias702 and others added 2 commits January 24, 2025 16:40

Added documentation for generation and made some fixes in response to…

1794fc1

… PR review Signed-off-by: Daniel Egert <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5e163ef

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

odelalleau reviewed Jan 28, 2025

View reviewed changes

trias702 and others added 6 commits January 28, 2025 19:48

Updated for PR review suggestions

deb3604

Signed-off-by: Daniel Egert <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

58a9c4c

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

Fixed compute_num_steps_per_epoch since moving the limit_train_batche…

d895625

…s logic to the dataloader Signed-off-by: Daniel Egert <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f4179c5

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

Fixed compute_num_steps_per_epoch issue for generation too

8239056

Signed-off-by: Daniel Egert <[email protected]>

Added copyright to files which didn't have it

8925c02

Signed-off-by: Daniel Egert <[email protected]>

jgerh reviewed Jan 29, 2025

View reviewed changes

trias702 and others added 2 commits January 30, 2025 20:14

Changes for PR review

6365f73

Signed-off-by: Daniel Egert <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

8c9d4f7

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

odelalleau reviewed Jan 31, 2025

View reviewed changes

nemo_aligner/utils/verifiers/code_verifier.py Outdated Show resolved Hide resolved

trias702 and others added 2 commits January 31, 2025 12:58

Removed Verifier util for now, and fixed limit_train_batches for KTO

cfe86fa

Signed-off-by: Daniel Egert <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

18f17f5

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

odelalleau reviewed Feb 6, 2025

View reviewed changes

odelalleau reviewed Feb 7, 2025

View reviewed changes

trias702 and others added 2 commits February 7, 2025 12:50

Moved TRT inference deactivation outside for loop and added boolean f…

d3e55ee

…or normalise prompts Signed-off-by: Daniel Egert <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

bfc81c6

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Self-Rewarding Algorithm with TRT Support #321

feat: Self-Rewarding Algorithm with TRT Support #321

trias702 commented Sep 26, 2024

jgerh commented Jan 21, 2025

odelalleau left a comment

odelalleau left a comment

odelalleau left a comment

jgerh left a comment

jgerh Jan 29, 2025

jgerh Jan 29, 2025

jgerh Jan 29, 2025

jgerh Jan 29, 2025

jgerh Jan 29, 2025

jgerh Jan 29, 2025

jgerh Jan 29, 2025

jgerh Jan 29, 2025

jgerh Jan 29, 2025

jgerh Jan 29, 2025

jgerh commented Jan 29, 2025

odelalleau left a comment

odelalleau left a comment

odelalleau Feb 6, 2025

trias702 Feb 7, 2025

odelalleau Feb 6, 2025

trias702 Feb 7, 2025

odelalleau Feb 6, 2025

odelalleau left a comment

odelalleau Feb 7, 2025

trias702 Feb 7, 2025


		All algorithms in NeMo Aligner are compatible with any GPT-based model from Megatron Core (i.e., those with mcore_gpt=True in the configuration). For this tutorial, we will demonstrate the generation pipeline using a 2B GPT model with 4096 sequence length <https://huggingface.co/nvidia/GPT-2B-001>__. This tutorial is also applicable to other GPT models, such as Llama models, regardless of their size.

		Obtaining a pretrained model

		@@ -0,0 +1,235 @@
		.. include:: /content/nemo.rsts

		Model Generation with Data Parallelism and TRT

		The NeMo framework supports efficient model generation via the NeMo Aligner codebase.

		All algorithms in NeMo Aligner are compatible with any GPT-based model from Megatron Core (i.e., those with mcore_gpt=True in the configuration). For this tutorial, we will demonstrate the generation pipeline using a 2B GPT model with 4096 sequence length <https://huggingface.co/nvidia/GPT-2B-001>__. This tutorial is also applicable to other GPT models, such as Llama models, regardless of their size.

feat: Self-Rewarding Algorithm with TRT Support #321

Are you sure you want to change the base?

feat: Self-Rewarding Algorithm with TRT Support #321

Conversation

trias702 commented Sep 26, 2024

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Checklist when contributing a new algorithm

Additional Information

jgerh commented Jan 21, 2025

odelalleau left a comment

Choose a reason for hiding this comment

odelalleau left a comment

Choose a reason for hiding this comment

odelalleau left a comment

Choose a reason for hiding this comment

jgerh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh commented Jan 29, 2025

odelalleau left a comment

Choose a reason for hiding this comment

odelalleau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

odelalleau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment