Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Self-Rewarding Algorithm with TRT Support #321

Open
wants to merge 316 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 250 commits
Commits
Show all changes
316 commits
Select commit Hold shift + click to select a range
d81caef
fix build
gshennvm Apr 18, 2024
92c19f6
fix bug
gshennvm Apr 18, 2024
7088f54
add groupnorm build
jiemingz Apr 19, 2024
472a56c
upgrade to latest te and mcore
gshennvm Apr 18, 2024
032bf35
Merge remote-tracking branch 'origin/dev' into aligner_trt_build
gshennvm Apr 19, 2024
d5f55f5
fix
gshennvm Apr 19, 2024
c7cdca1
specify max token
Apr 20, 2024
04d02c8
fix
gshennvm Apr 20, 2024
56ccacf
Merge remote-tracking branch 'origin/geshen/main_trt' into aligner_tr…
gshennvm Apr 20, 2024
d23865f
fix critic checkpoint loading
gshennvm Apr 22, 2024
3c21c81
add assert
gshennvm Apr 22, 2024
2c99dcb
fix bug
gshennvm Apr 22, 2024
9e8526d
fix
gshennvm Apr 22, 2024
c1daeb9
fix
gshennvm Apr 23, 2024
e16c357
update dockerfile
gshennvm Apr 23, 2024
410eaf5
update to 24.03.01 deps
gshennvm Apr 23, 2024
e405432
fix
gshennvm Apr 24, 2024
07cfa67
update dockerfile
gshennvm Apr 24, 2024
b2dfee0
add dockerfileg
gshennvm Apr 26, 2024
63cd6b3
fix trtllm patch
Apr 29, 2024
6901348
clamp output with warning
Apr 29, 2024
74a0bb1
fix
gshennvm Apr 29, 2024
b6a05fd
remove debug statements
gshennvm Apr 30, 2024
db2701b
Merge remote-tracking branch 'origin/main' into aligner_trt_build
gshennvm Apr 30, 2024
8dd5c59
add debug info
gshennvm May 6, 2024
b5d6f88
bump pytrition version
gshennvm May 6, 2024
5464827
add critic speed
gshennvm May 7, 2024
00e4298
critic speedup
gshennvm May 7, 2024
fe6864b
Merge remote-tracking branch 'origin/geshen/critic_refactor' into ges…
gshennvm May 7, 2024
80579ec
fix
gshennvm May 13, 2024
f81f55a
add pad sequence length
gshennvm May 14, 2024
4a034f4
dockerfile
gshennvm May 15, 2024
66b5a54
higher stability
gshennvm May 16, 2024
7841381
Merge remote-tracking branch 'origin/main' into geshen/debug_critic
gshennvm May 16, 2024
1779c51
add
gshennvm May 16, 2024
e357ef9
add hack for ckpt
gshennvm May 24, 2024
6b606e8
fix conf
gshennvm May 24, 2024
a669837
no import
gshennvm May 24, 2024
ada5f45
add
gshennvm May 24, 2024
393acc6
fix
gshennvm May 25, 2024
b6a4d59
run through
gshennvm May 25, 2024
977e6e7
fix
gshennvm May 25, 2024
621718d
adaptive
gshennvm May 25, 2024
6109b8b
output tensor
gshennvm May 25, 2024
866c22b
add logging
gshennvm May 26, 2024
02aa2b8
fix for llama3
gshennvm May 28, 2024
e6f27c5
disable last checkpoint
gshennvm May 31, 2024
c689d2a
fix padding bug
gshennvm Jun 1, 2024
cd4aaa5
add critic warmup
gshennvm Jun 1, 2024
993e358
Revert "add"
gshennvm Jun 8, 2024
28fcaf3
Merge remote-tracking branch 'origin/main' into geshen/trt_llm_to_main
gshennvm Jun 8, 2024
ef347e5
fix module missing bug
gshennvm Jun 11, 2024
752d0bd
Ensure critic server does not squeeze out a singleton batch dim (#199)
terrykong Jun 11, 2024
78e6536
Merge branch 'geshen/llama3_rlhf' into geshen/trt_llm_to_main
gshennvm Jun 12, 2024
4de3eeb
Merge branch 'geshen/trt_llm_to_main' of github.com:NVIDIA/NeMo-Align…
gshennvm Jun 12, 2024
8a39881
TRTLLM PP wrong format WAR
jiemingz May 17, 2024
666e969
docker file branch
gshennvm Jun 12, 2024
3bec1bc
fix config
gshennvm Jun 12, 2024
3e7ca5f
remove prints
gshennvm Jun 12, 2024
12a0aae
remove print
gshennvm Jun 12, 2024
3956b6d
remove unneeded statement
gshennvm Jun 12, 2024
e090663
no save topk
gshennvm Jun 12, 2024
af83947
Merge remote-tracking branch 'origin/main' into geshen/trt_llm_to_main
gshennvm Jun 24, 2024
cc03b76
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 24, 2024
605bda1
critic speedup
gshennvm Jun 24, 2024
b3dedfd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 24, 2024
9fb90ff
fix
gshennvm Jun 24, 2024
bf62bcc
Merge branch 'geshen/critic_speedup' of github.com:NVIDIA/NeMo-Aligne…
gshennvm Jun 24, 2024
aea50ad
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 24, 2024
0a9416e
fix
gshennvm Jun 24, 2024
41ffeb5
Merge branch 'geshen/critic_speedup' of github.com:NVIDIA/NeMo-Aligne…
gshennvm Jun 24, 2024
e3d85bf
add parallel_state
gshennvm Jun 24, 2024
1225041
fix
gshennvm Jun 24, 2024
5315fc5
rename
gshennvm Jun 24, 2024
9c6141b
fix
gshennvm Jun 24, 2024
2cef93e
fix
gshennvm Jun 24, 2024
f5bd8c5
pull changes from degert/spin-trt-beta (#220)
gshennvm Jun 25, 2024
cbd7095
add text input support
gshennvm Jun 25, 2024
c72e4c1
clean up location of tokenization
gshennvm Jun 25, 2024
220cf17
remove useless imports
gshennvm Jun 25, 2024
1ed3c46
refactor rm server
gshennvm Jun 26, 2024
aa4249d
remove run rm file
gshennvm Jun 26, 2024
22615b7
remove preferred batch size logic
gshennvm Jun 26, 2024
95a9507
add comment
gshennvm Jun 26, 2024
0fdb124
allow users to specify their own preferred batch size
gshennvm Jun 26, 2024
031287c
add comments and add changelog
gshennvm Jun 26, 2024
fa37930
update changelog
gshennvm Jun 26, 2024
c83a98d
remove old reward model callable
gshennvm Jun 26, 2024
a9f722a
inference should be done with collect loss data =True
gshennvm Jun 26, 2024
15460d2
Merge branch 'geshen/critic_speedup' into geshen/trt_llm_to_main
gshennvm Jun 27, 2024
f6c09ea
fix issues with merge
gshennvm Jun 27, 2024
1d85152
cleanup configs
gshennvm Jun 27, 2024
b3078c5
add strip to sequence length
gshennvm Jun 27, 2024
1eb0823
Merge remote-tracking branch 'origin/geshen/critic_speedup' into gesh…
gshennvm Jun 27, 2024
6203e90
change
gshennvm Jun 27, 2024
e51c45f
fix
gshennvm Jun 27, 2024
1b220f1
clean actor
gshennvm Jun 28, 2024
0815b56
backwards compatibility in actor
gshennvm Jun 28, 2024
8d75cbf
Apply suggestions from code review
gshennvm Jun 28, 2024
ab4c549
modify changelog
gshennvm Jun 28, 2024
82fb3a1
fixup! modify changelog
gshennvm Jun 28, 2024
12f85d2
add comments to ppo_critic config
gshennvm Jun 28, 2024
f90f4d6
add note on breaking change in inference rm
gshennvm Jun 28, 2024
527557d
change inference mbs to 4
gshennvm Jun 28, 2024
fe9d288
add comments for inference rm config
gshennvm Jun 28, 2024
f3124d3
revert gbs flag back to previous in ppo critic
gshennvm Jun 28, 2024
efadcae
delete unused variable
gshennvm Jun 28, 2024
35a2895
Update nemo_aligner/algorithms/critic_server_trainer.py
gshennvm Jun 28, 2024
4487932
remove add_eos arg, and update attribute annotate script
gshennvm Jun 29, 2024
7e7f27b
Merge branch 'geshen/critic_speedup' of github.com:NVIDIA/NeMo-Aligne…
gshennvm Jun 29, 2024
e9c7b39
no mutation on inputs when processing them for inference
gshennvm Jun 29, 2024
c6f6da4
fix bug when padding
gshennvm Jun 29, 2024
ebb69f4
add comment for forward_micro_batch_size in training_rm.yaml
gshennvm Jun 29, 2024
2775e81
change non_blocking to use sync
gshennvm Jun 29, 2024
fe0399f
Merge branch 'geshen/critic_speedup' into geshen/trt_llm_to_main
gshennvm Jun 30, 2024
ffa253f
nemo export api changes
jiemingz Jul 1, 2024
7ca9e34
upgrade to newer nemo export
gshennvm Jul 1, 2024
8181168
fix imports
gshennvm Jul 1, 2024
4d0853d
Communicator hang fix in the actor loop (#200)
terrykong Jul 1, 2024
ec548b8
add nemo guard for when things don't stop properly
gshennvm Jul 3, 2024
ce7a07f
cleanup communicator clean
gshennvm Jul 3, 2024
bb2fc48
fix
gshennvm Jul 3, 2024
606f690
critic speedup
gshennvm Jun 24, 2024
f48dc29
middle of PP should be broadcasted as well
gshennvm Jul 11, 2024
708bc24
update with critic changes
gshennvm Jul 11, 2024
48ad685
Merge remote-tracking branch 'origin/main' into geshen/trt_llm_to_main
gshennvm Jul 11, 2024
d053475
general cleanup
gshennvm Jul 11, 2024
0b4a92d
add checker for trt
gshennvm Jul 11, 2024
b72a5ec
remove comments
gshennvm Jul 11, 2024
984acaa
fix
gshennvm Jul 12, 2024
c11e1d7
fix
gshennvm Jul 12, 2024
14a9926
another fix
gshennvm Jul 12, 2024
7c2fc3e
add typing
gshennvm Jul 12, 2024
fe02867
cleanup
gshennvm Jul 12, 2024
02ad2fa
ppo trainer should use stop and get time
gshennvm Jul 12, 2024
24c53be
add some comments
gshennvm Jul 12, 2024
8a25e5e
critic warmup should have good default
gshennvm Jul 12, 2024
24f138a
added ppo in changelog
gshennvm Jul 12, 2024
9c72c53
add comments
gshennvm Jul 12, 2024
5ed9cd8
Avoids crash in PPOTrainer if using adafactor w/o learning rate (#234)
terrykong Jul 12, 2024
8b6627a
rename
gshennvm Jul 12, 2024
56032c8
Merge branch 'geshen/trt_llm_to_main' of github.com:NVIDIA/NeMo-Align…
gshennvm Jul 12, 2024
1e17f8b
Raise exceptions if using trtllm and use_Greedy in sampling params is…
terrykong Jul 12, 2024
e0a94d0
fix bugs
gshennvm Jul 12, 2024
280ad36
Initial commit of TRT version of self-rewarding
trias702 Jul 13, 2024
3af712d
Fixed bug again for limit_train_batches
trias702 Jul 13, 2024
835b3b3
cleanup pad id handling when PP > 1
gshennvm Jul 13, 2024
2b95331
fix issue with PP > 1 check
gshennvm Jul 14, 2024
261269a
add is_end logic
gshennvm Jul 14, 2024
83ba660
add is end logic
gshennvm Jul 14, 2024
f3912e7
add is end logic
gshennvm Jul 14, 2024
09d2783
fix
gshennvm Jul 14, 2024
5105ed9
fix
gshennvm Jul 14, 2024
a2bf8a0
fix another bug
gshennvm Jul 15, 2024
d9d45d6
Merge remote-tracking branch 'origin/main' into geshen/trt_llm_to_main
gshennvm Jul 15, 2024
f00d09e
update changelog
gshennvm Jul 15, 2024
830e599
update dockerfile
gshennvm Jul 15, 2024
09c357c
Update the hash of the conversion script to include TE fix for mcore …
terrykong Jul 16, 2024
c41cc08
update docs
gshennvm Jul 16, 2024
ab8c97b
Lots of bugfixes
trias702 Jul 17, 2024
92fec51
Exposed repetition_penalty to TRT and added in intra-mb randomness
trias702 Jul 17, 2024
7f7d4f9
removed assert statements in trt_llm
trias702 Jul 18, 2024
395cf04
add repetition penalty
gshennvm Jul 18, 2024
3a8da14
random seed and better clamping
gshennvm Jul 18, 2024
4174248
add memory logging
gshennvm Jul 18, 2024
36d8ab4
make it clear seed can be None
gshennvm Jul 18, 2024
82e8793
fix seed args
gshennvm Jul 18, 2024
7506be0
clear memory
gshennvm Jul 19, 2024
c8b88c3
Fixed the PP bug I was getting
trias702 Jul 19, 2024
1d1f051
add more clear memory
gshennvm Jul 20, 2024
aae1cd3
remove assert
gshennvm Jul 21, 2024
b4afcf6
RC-1 status reached
trias702 Jul 25, 2024
dab8c61
fix unloading
gshennvm Jul 25, 2024
5651e20
Added new generation algo
trias702 Jul 27, 2024
ff3bc1c
optimisations for generation code
trias702 Jul 30, 2024
aadc662
merged and made more bugfixes
trias702 Aug 1, 2024
314c217
Fixed final trt decoding bug for pp > 1 and llama3 tokenizer
trias702 Aug 3, 2024
2fdb5b0
RC-2 status reached
trias702 Aug 5, 2024
31f4ba3
Integrated incorrect ref log probs fixes from PR 228
trias702 Aug 5, 2024
7dd511c
Added length control methodology to self rewarding
trias702 Aug 9, 2024
11d22e8
Fixed bug with length control
trias702 Aug 12, 2024
de8dda8
Fixed LocalNonpersistentObject error and added meta judge logic
trias702 Aug 17, 2024
4f81524
Attempts to fix the oscillation issue
trias702 Aug 27, 2024
723f55f
Stable fix for meta judge oscillation
trias702 Sep 5, 2024
acd4d07
Moved the templates to the conf file instead
trias702 Sep 24, 2024
a249a44
Merged to latest aligner main
trias702 Sep 24, 2024
ce687ed
Merge remote-tracking branch 'origin' into degert/self-rewarding-trt
trias702 Sep 25, 2024
005702b
Fixes to yaml tab issues
trias702 Sep 25, 2024
a35e359
Far enough along that I can file the PR
trias702 Sep 26, 2024
fc1def7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 26, 2024
eab62af
Added self-rewarding rst doc
trias702 Sep 26, 2024
8805db5
Merge branch 'degert/self-rewarding-trt' of https://github.com/NVIDIA…
trias702 Sep 26, 2024
fb61b86
Fixed bad merge in utils.py
trias702 Sep 26, 2024
a7817b3
Removed trt patch file
trias702 Sep 26, 2024
71029d9
Fixed ordering of clamp bug
trias702 Oct 1, 2024
9051695
Merge remote-tracking branch 'origin' into degert/self-rewarding-trt
trias702 Oct 1, 2024
35f8ee4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 1, 2024
ab2d3ce
Merge branch 'degert/self-rewarding-trt' of https://github.com/NVIDIA…
trias702 Oct 1, 2024
294afcd
Merge remote-tracking branch 'origin' into degert/self-rewarding-trt
trias702 Oct 24, 2024
c465f30
Added CI test for SPIN
trias702 Oct 26, 2024
2092084
Added CI tests for self-rewarding
trias702 Oct 29, 2024
99c7bcd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 29, 2024
4e435f7
Added CI tests for generation
trias702 Oct 29, 2024
3d429fb
Merged in v13 TRT changes
trias702 Nov 1, 2024
b460a14
Removed max_input_tokens from TRT algos
trias702 Nov 1, 2024
780e8ab
Made config yaml fixes in response to initial comments
trias702 Nov 18, 2024
cc487fb
Updated to main branch
trias702 Nov 18, 2024
83e830a
Removed generation_batch_size param from TRT
trias702 Nov 18, 2024
5b7aae3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 18, 2024
a1f9620
Minor fixes for new TRT api
trias702 Nov 20, 2024
01aced0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 20, 2024
224de3d
SPIN bug fixes and migrated generation to work with TRT v13
trias702 Nov 21, 2024
82ff16d
Merge branch 'degert/self-rewarding-trt' of https://github.com/NVIDIA…
trias702 Nov 21, 2024
c608520
Changes to self_rewarding.yaml in response to review comments
trias702 Nov 22, 2024
e4d36b6
Added Torch Dynamo logic to self-rewarding
trias702 Nov 25, 2024
34e4994
Fixed minor issue with TRT v13 compatibility
trias702 Nov 26, 2024
4314347
Trying new fix for truncation for SPIN
trias702 Nov 27, 2024
92c6ee3
Update examples/nlp/gpt/conf/gpt_generation.yaml
trias702 Dec 3, 2024
260eb90
Update examples/nlp/gpt/conf/gpt_generation.yaml
trias702 Dec 3, 2024
f93acab
Update examples/nlp/gpt/conf/gpt_generation.yaml
trias702 Dec 3, 2024
e4b1712
Update examples/nlp/gpt/conf/gpt_generation.yaml
trias702 Dec 3, 2024
e18d2fc
Updates for PR review
trias702 Dec 5, 2024
d646801
Made fixes in response to PR review
trias702 Dec 6, 2024
de7b8aa
Moved limit_train_batches logic to build_dataloader and tried to make…
trias702 Dec 6, 2024
449bee7
Fixed SPIN metrics code
trias702 Dec 6, 2024
0e35eb4
Changed how meta_judge_pcnt applies to local DP batch
trias702 Dec 18, 2024
7178c88
Bugfix for the PP > 1 RNG issue
trias702 Jan 18, 2025
b93549f
Added some temp logging
trias702 Jan 23, 2025
0b04388
Merged in main
trias702 Jan 23, 2025
0b06d95
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 23, 2025
a5c9614
Initial commit for new experimental structure
trias702 Jan 23, 2025
cf58ead
Initial commit for new experimental structure
trias702 Jan 23, 2025
bbaf99f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 23, 2025
139d739
Moved self-rewarding and generation tests to new experimental folder
trias702 Jan 23, 2025
1794fc1
Added documentation for generation and made some fixes in response to…
trias702 Jan 25, 2025
5e163ef
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 25, 2025
deb3604
Updated for PR review suggestions
trias702 Jan 29, 2025
58a9c4c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 29, 2025
d895625
Fixed compute_num_steps_per_epoch since moving the limit_train_batche…
trias702 Jan 29, 2025
f4179c5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 29, 2025
8239056
Fixed compute_num_steps_per_epoch issue for generation too
trias702 Jan 29, 2025
8925c02
Added copyright to files which didn't have it
trias702 Jan 29, 2025
6365f73
Changes for PR review
trias702 Jan 31, 2025
8c9d4f7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 31, 2025
cfe86fa
Removed Verifier util for now, and fixed limit_train_batches for KTO
trias702 Jan 31, 2025
18f17f5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 31, 2025
d3e55ee
Moved TRT inference deactivation outside for loop and added boolean f…
trias702 Feb 7, 2025
bfc81c6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 7, 2025
8fa17bb
Fixed meta-reward generation bug accidentally using the actor policy …
trias702 Feb 13, 2025
24e857a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 13, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/cicd-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@ jobs:
- reinforce-llama3-pp2-reshard
- dpo-llama3
- dpo-llama3-pack
- spin-llama3
- kd-llama3
- sft-llama3
- rm-llama3
Expand Down
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
- Using latest TransformerEngine versions may require `++model.dist_ckpt_load_strictness=log_all` when loading from a older pre-existing checkpoint to not error out.
- NeMo-Aligner now requires Megatron-LM==0.9.0 for the APIs to calculate the microbatch sizes (API introduced `megatron.core.num_microbatches_calculator.reconfigure_num_microbatch_calculator`).
- NeMo-Aligner now requires a version of NeMo with this change to how the MoE spec is handled: https://github.com/NVIDIA/NeMo/pull/9035 .
- Using the new `TruncatedGPTSFTChatDataset` with max_seq_length > 0 will result in a finalised, loaded dataset which has a different number of samples
than what was previously loaded with `GPTSFTChatDataset`

### Bug Fixes
- It is now required, for stability, to add `export NCCL_ALGO=...` to scripts launching PPO training loop. Please see the [RLHF docs](./docs/user-guide/rlhf.rst) for information.
Expand All @@ -56,7 +58,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)

### New Features and Optimizations
- Implement Kahneman-Tversky Optimization (KTO).
- Sequence packing is now supported when running SFT with SFTChatDataset.
trias702 marked this conversation as resolved.
Show resolved Hide resolved
- Sequence packing is now supported when running SFT with prompt-response datasets.
- [EXPERIMENTAL] Implement the [Self-Rewarding](https://arxiv.org/abs/2401.10020) and [Meta-Rewarding](https://arxiv.org/abs/2407.19594) algorithms.

### Breaking Changes

Expand Down
235 changes: 235 additions & 0 deletions docs/user-guide-experimental/generation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
.. include:: /content/nemo.rsts

Model Generation with Data Parallelism and TRT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revise heading for SEO

Model Generation with Data Parallelism and TensorRT (TRT)

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

The NeMo framework supports efficient model generation via the NeMo Aligner codebase.

All algorithms in NeMo Aligner are compatible with any GPT-based model from Megatron Core (i.e., those with mcore_gpt=True in the configuration). For this tutorial, we will demonstrate the generation pipeline using a 2B GPT model with 4096 sequence length <https://huggingface.co/nvidia/GPT-2B-001>__. This tutorial is also applicable to other GPT models, such as Llama models, regardless of their size.
Comment on lines +6 to +8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest a revision to the introductory text, including the purpose and other copyedits

This tutorial demonstrates efficient model generation using NeMo Framework and the NeMo-Aligner codebase. It shows how to set up a 2B GPT model with a sequence length of 4096, available on Hugging Face <https://huggingface.co/nvidia/GPT-2B-001>__, and applies to other models like Llama.

The tutorial covers obtaining and preparing a pretrained model, configuring parameters, and running the generation process. It highlights using aligned models for better outputs and provides steps for terminal and Slurm execution, ensuring efficient data parallelism and handling TransformerEngine issues. All NeMo-Aligner algorithms work with any GPT-based model from Megatron Core.


Obtaining a pretrained model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix capitalization and change procedural heading to an imperative verb

Obtain a Pretrained Model

############################
To start, we must first get an aligned model to generate responses from. There are 2 models we recommend to get started. The rest of the tutorial will work with either model, but for demonstration purposes we will use the smaller 2B model.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested revision and fix grammar

To get started, we need an aligned model for generating responses. We recommend two models: 2B GPT and LLaMa2 7B. While the tutorial works with either, we will use the smaller 2B model for demonstration purposes.


.. tab-set::

.. tab-item:: 2B GPT
:sync: key1

#. Get the 2B checkpoint via ``wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo``
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a period

#. Get the 2B checkpoint via wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo.

#. Extract the NeMo File to a folder with ``mkdir model_checkpoint && tar -xvf GPT-2B-001_bf16_tp1.nemo -C model_checkpoint``
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a period, fix capitalization

  #. Extract the NeMo file to a folder with ``mkdir model_checkpoint && tar -xvf GPT-2B-001_bf16_tp1.nemo -C model_checkpoint``.

#. And then run the script to convert from old NeMo checkpoint to Megatron-Core checkpoint. The script is located `here <https://github.com/NVIDIA/NeMo/blob/86b198ff93438d454f9c7f3550bcfb7d4e59feab/scripts/nlp_language_modeling/convert_nemo_gpt_to_mcore.py>`__.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revise (remove And then) and add definite article "the"

#. Run the script to convert from the old NeMo checkpoint to the Megatron-Core checkpoint. The script is located here <https://github.com/NVIDIA/NeMo/blob/86b198ff93438d454f9c7f3550bcfb7d4e59feab/scripts/nlp_language_modeling/convert_nemo_gpt_to_mcore.py>__.

.. code-block:: bash

python convert_nemo_gpt_to_mcore.py \
--in-folder ./model_checkpoint \
--out-file ./mcore_gpt.nemo

.. tab-item:: LLaMa2 7B
:sync: key2

#. Download the `Llama 2 7B LLM model and tokenizer <https://huggingface.co/meta-llama/Llama-2-7b>`__ into the models folder.
#. Convert the LLaMa2 LLM into ``.nemo`` format
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a colon

#. Convert the LLaMa2 LLM into ``.nemo`` format:

.. code-block:: bash

python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
--input_name_or_path /path/to/llama --output_path /output_path/mcore_gpt.nemo

After these steps you should have a file ``mcore_gpt.nemo`` to use in NeMo-Aligner.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comma

After these steps, you should have a file, mcore_gpt.nemo, to use in NeMo-Aligner.


.. note::
Mcore models use TransformerEngine as a backend, and it tries to find efficient kernels. But depending on the GPU you have it may not find them. If you ever face errors that relate to kernel finding set these variables on top of your script.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revise

Mcore models utilize TransformerEngine as a backend to find efficient kernels. However, depending on your GPU, it may not always succeed. If you encounter errors related to kernel finding, set these variables at the top of your script.


.. code-block:: bash

export NVTE_MASKED_SOFTMAX_FUSION=0
export NVTE_FLASH_ATTN=0
export NVTE_FUSED_ATTN=0

Additionally, TransformerEngine is non-deterministic by default, meaning subsequent runs of generation using identical parameters will produce different results, which is not ideal for generation.
Helpfully, TransformerEngine exposes a flag to set if you want to guarantee deterministic generation runs:
Comment on lines +49 to +50
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revise

Additionally, TransformerEngine is non-deterministic by default. This means that running the same generation with identical parameters multiple times will yield different results, which is not ideal for generation consistency. Fortunately, TransformerEngine provides a flag that you can set to ensure deterministic generation runs:


.. code-block:: bash

export NVTE_ALLOW_NONDETERMINISTIC_ALGO=0
export NVTE_MASKED_SOFTMAX_FUSION=0

Aligned vs Foundational (base) model for Generation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix capitalization and punctuation

Aligned Model vs. Foundational (Base) Model for Generation

###################################################
Generation can be run on either base/foundational models, that is, models which have only been trained on autoregressive language prediction tasks and not on instruction following tasks,
or, you can also run generation on models which have been aligned on instruction-based or preference-based datasets as well, similar to DPO/PPO. Either model will work, but you will get much higher quality
responses (generations) from an aligned model, and we highly recommend using an aligned model for generation if you want high quality responses.
Comment on lines +59 to +61
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested revision to break up long sentences and fix punctuation

Generation can be executed on either base or foundational models. These are models that have been trained solely on autoregressive language prediction tasks and not on instruction-following tasks. Alternatively, you can run generation on models that have been aligned with instruction-based or preference-based datasets, similar to DPO/PPO. Both types of models are capable of performing generation. However, you will achieve significantly higher quality responses (generations) from an aligned model. Therefore, we highly recommend using an aligned model for generation if you want high-quality responses.


Data Format for Generation
##########################

The input files for generation in Aligner use the exact same format of .jsonl files as used by SFT in Nemo and Aligner. Please see the data formatting section of SFT to understand the data format necessary for Self-Rewarding :ref:`SFT guide <sft>`
Please note that Aligner generation does not support the use of mmap or binary files, only .jsonl files in the SFT format.
Comment on lines +66 to +67
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revise and use full name/capitalization for NeMo-Aligner and NeMo

The input files for generation in NeMo-Aligner use the same format of .jsonl files as those used by SFT in NeMo and NeMo-Aligner. Please see the data formatting section of SFT to understand the necessary data format for self-rewarding training :ref:SFT guide . Note that NeMo-Aligner generation does not support the use of mmap or binary files, only .jsonl files in SFT format.


Running Generation in Aligner
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change procedural heading to imperative verb and add hyphen

Run Generation in NeMo-Aligner

#############################

Once your data is processed into the correct format you are ready to begin generation. You must start with a pretrained or aligned model. For this section we will use the aligned model from the previous section for generation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revise

After processing your data into the correct format, you can begin generation. Start with a pretrained or aligned model. For this section, we'll use the aligned model from the previous section to generate the content.

For the purposes of the following sections, we'll assume your generation jsonl file is located in ``/path/to/generation_sft_format.jsonl``.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix punctuation

For the purposes of the following sections, we'll assume your generation .jsonl file is located in /path/to/generation_sft_format.jsonl.


The key parameters for generation are located under ``model.generation`` and include the following:

``model.generation.num_responses_to_gen`` - controls how many responses you want the model to generate per prompt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a period

model.generation.num_responses_to_gen - controls how many responses you want the model to generate per prompt.


The following block shows the standard Nemo sampling params for generating responses, which are the same as we use across all Nemo and Aligner codebases:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revise and use full name/capitalization for NeMo-Aligner and NeMo

The following block shows the standard NeMo sampling params for generating responses, which are the same as we use across all NeMo and NeMo-Aligner codebases:


.. code-block:: yaml
sampling_params:
use_greedy: False
temperature: 1.0
top_k: 0
top_p: 1.0
repetition_penalty: 1.0
add_BOS: False
all_probs: False
compute_logprob: False
end_strings: ["<|endoftext|>", "<extra_id_1>"]

# length argument for autoregressive sampling
# max length means max amount of tokens to generate
length_params:
max_length: ${int_div:${model.encoder_seq_length}, 2}
min_length: 1

Finally, we have the TRT parameters, which allows for faster TRTLLM-based response generation:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix subject-verb agreement

Finally, we have the TRT parameters, which allow for faster TRTLLM-based response generation:


.. code-block:: yaml
trt_llm:
enable: True # use this to turn TRT on/off
# reshard: False # reshard is not supported in generation

# TRTLLM preallocates activation memory according to the number of input tokens
max_input_len: ${subtract:${model.encoder_seq_length}, ${model.generation.length_params.max_length}}

model_type: gptnext # can be gptj, gptnext, llama, gemma, falcon

# Generation does not have a training stage, so there is no need to unload the engine.
unload_engine_train: False


Keep in mind that Aligner generation utilises data parallelism to speed up generation. This means that your input data file will be divided by GBS, and data which is
not cleanly divisible by GBS will be dropped starting from the end of the file. For example, if your data file has 11639 samples with a GBS of 32, this means that
11639 mod 32 = 23 samples will be dropped and not generated. To avoid this, you can either reduce your data parallelism to 1, or you can pad your data file up to the nearest
multiple of your GBS (you can pad with basic prompts like "how are you"). Additionally, if you truncate your input data using the ``model.data.train_ds.max_seq_length`` parameter,
then your data will be reduced even further. Truncation applies before the DP truncation.
Comment on lines +115 to +119
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested revision

.. note::

NeMo-Aligner generation uses data parallelism to speed up generation. Your input data file will be divided by GBS. Data not cleanly divisible by GBS will be dropped from the end of the file. 

**Example**: 
If your data file has 11,639 samples with a GBS of 32, then 11,639 mod 32 = 23 samples will be dropped and not generated.

To avoid this, you can either:

- Reduce your data parallelism to 1.
- Pad your data file up to the nearest multiple of your GBS (e.g., with basic prompts like "how are you?").

Additionally, if you truncate your input data using the `model.data.train_ds.max_seq_length` parameter, your data will be reduced even further. Truncation applies before the DP truncation.


With your data prepared, you can now run generation. We demonstrate two techniques below, one using cmdline inputs directly, and another demonstrating the use of SLURM.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revise

With your data prepared, you can now run the generation. We demonstrate two techniques below: one using command-line inputs directly, and another demonstrating the use of SLURM.



.. tab-set::

.. tab-item:: Terminal
:sync: key3

To run Self-Rewarding model training on the terminal directly:

.. code-block:: bash

export GPFS="/path/to/nemo-aligner-repo"
export TRAIN_DATA_PATH="/path/to/generation_sft_format.jsonl"

python -u ${GPFS}/examples/nlp/gpt/run_generation.py \
trainer.num_nodes=1 \
trainer.devices=8 \
model.micro_batch_size=1 \
model.global_batch_size=32 \
pretrained_checkpoint.restore_from_path=/path/to/megatron_gpt_sft.nemo \
"model.data.train_ds.file_path=${TRAIN_DATA_PATH}" \
exp_manager.create_wandb_logger=false \
exp_manager.wandb_logger_kwargs.project=null \
exp_manager.wandb_logger_kwargs.name=null \
exp_manager.explicit_log_dir=/results \
++model.sequence_parallel=false \
++model.apply_rope_fusion=false \
trainer.generation.max_epochs=1 \
model.generation.num_responses_to_gen=1 \
trainer.generation.trt_llm.enable=true

.. tab-item:: Slurm
:sync: key4

To run generation with Slurm, use the script below. The script uses 4 nodes, but you can change the node count to something different:

.. code-block:: bash

#!/bin/bash
#SBATCH -A <<ACCOUNT NAME>>
#SBATCH -p <<PARTITION NAME>>
#SBATCH -N 4
#SBATCH -t 4:00:00
#SBATCH -J <<JOB NAME>>
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node 8
#SBATCH --exclusive
#SBATCH --overcommit

GPFS="/path/to/nemo-aligner-repo"
PRETRAINED_CHECKPOINT_NEMO_FILE="/path/to/megatron_gpt_sft.nemo"

TRAIN_DATA_PATH="/path/to/generation_sft_format.jsonl"

PROJECT="<<WANDB PROJECT>>"

CONTAINER=<<<CONTAINER>>> # use the latest NeMo Training container, Aligner will work there
MOUNTS="--container-mounts=${GPFS}:${GPFS},${TRAIN_DATA_PATH}:${TRAIN_DATA_PATH},${PRETRAINED_CHECKPOINT_NEMO_FILE}:${PRETRAINED_CHECKPOINT_NEMO_FILE}"

RESULTS_DIR="/path/to/result_dir"

OUTFILE="${RESULTS_DIR}/rm-%j_%t.out"
ERRFILE="${RESULTS_DIR}/rm-%j_%t.err"
mkdir -p ${RESULTS_DIR}

read -r -d '' cmd <<EOF
echo "*******STARTING********" \
&& echo "---------------" \
&& echo "Starting generation" \
&& cd ${GPFS} \
&& export PYTHONPATH="${GPFS}:${PYTHONPATH}" \
&& export NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 \
&& export NVTE_MASKED_SOFTMAX_FUSION=0 \
&& export HYDRA_FULL_ERROR=1 \
&& python -u ${GPFS}/examples/nlp/gpt/run_generation.py \
trainer.num_nodes=${SLURM_JOB_NUM_NODES} \
trainer.devices=8 \
pretrained_checkpoint.restore_from_path='${PRETRAINED_CHECKPOINT_NEMO_FILE}' \
"model.data.train_ds.file_path=${TRAIN_DATA_PATH}" \
model.micro_batch_size=1 \
model.global_batch_size=32 \
++model.sequence_parallel=false \
++model.apply_rope_fusion=false \
exp_manager.explicit_log_dir=${RESULTS_DIR} \
exp_manager.create_wandb_logger=False \
exp_manager.wandb_logger_kwargs.name=null \
exp_manager.wandb_logger_kwargs.project=null \
trainer.generation.max_epochs=1 \
model.generation.num_responses_to_gen=1 \
trainer.generation.trt_llm.enable=true
EOF

srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
set +x

The output file containing the responses will be located in ``${RESULTS_DIR}/generations/generations.jsonl``. All responses will be stored to this file as they
are generated, and even if your generation process abruptly terminates, it will resume where it left off once restarted. Once generation is complete all of your
responses will be located in this file.

Comment on lines +217 to +220
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested revision

The output file containing the responses will be located in ${RESULTS_DIR}/generations/generations.jsonl. All responses will be stored in this file as they are generated. If your generation process abruptly terminates, it will resume where it left off once restarted. Upon completion, all responses will be located in this file.

The structure of this file is a .jsonl file, where each line represents a JSON object of the following form:

.. code-block:: json
{ step: the step number in the epoch,
consumed_samples: the number of samples consumed so far of the input dataset,
prompt: the prompt passed to the model,
responses: a list of length ``model.generation.num_responses_to_gen`` which contains all of the responses to the input prompt
}

The step and consumed_samples fields are not needed by the end user, but they're there so that the process can correctly resume if it unexpectedly goes down in the middle
of a generation run.
Comment on lines +230 to +231
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested revision

You do not need the step and consumed_samples fields. However, they are included to ensure that the process can accurately resume if it is unexpectedly interrupted during a generation run.


Please note that the responses will contain all raw tokens which the model generated, this includes all special headers, turn starts/ends, and BOS/EOS tokens. To get a "clean" output
the end user must filter this out themselves via some sort of post-processing step (which is not currently provided).
Comment on lines +233 to +234
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggested revision

Please note that the responses will include all raw tokens generated by the model. This includes special headers, turn starts/ends, and BOS/EOS tokens. To obtain a "clean" output, you will need to filter these out through a post-processing step, which is not currently provided.


Loading