Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nemo-automodel checkpoint-io refactor #12070

Draft
wants to merge 100 commits into
base: main
Choose a base branch
from

Conversation

akoumpa
Copy link
Member

@akoumpa akoumpa commented Feb 5, 2025

Changes:

  1. Centralize all checkpoint-related functionality in HFCheckpointIO, which is responsible for saving the model, optimizer and trainer state-dicts. By Centralizing the checkpoint-io, it can be shared across different strategies (e.g. SingleDeviceStrategy, DDPStrategy, FSDP2Strategy).
  2. Non-MegatronStrategy do not define a consumed_samples attribute, and instead the step attribute is used.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@akoumpa akoumpa changed the title init commit nemo-automodel checkpoint-io refactor Feb 5, 2025
@akoumpa akoumpa force-pushed the akoumparouli/nemo_automodel_checkpoint_io_refactor branch 5 times, most recently from 2ebc0ed to 62c9b22 Compare February 6, 2025 06:20
nemo/lightning/io/hf.py Fixed Show fixed Hide fixed
nemo/lightning/io/hf.py Fixed Show fixed Hide fixed
nemo/lightning/io/hf.py Fixed Show fixed Hide fixed
nemo/lightning/io/hf.py Fixed Show fixed Hide fixed
nemo/lightning/io/hf.py Fixed Show fixed Hide fixed
nemo/lightning/io/hf.py Fixed Show fixed Hide fixed
@akoumpa akoumpa force-pushed the akoumparouli/nemo_automodel_checkpoint_io_refactor branch 2 times, most recently from 162eaa2 to 103326b Compare February 6, 2025 06:34
examples/llm/sft/hf.py Fixed Show fixed Hide fixed
@akoumpa akoumpa force-pushed the akoumparouli/nemo_automodel_checkpoint_io_refactor branch 6 times, most recently from 5a687db to e4327a3 Compare February 7, 2025 18:24
nemo/lightning/io/hf.py Fixed Show fixed Hide fixed
nemo/lightning/io/hf.py Fixed Show fixed Hide fixed
nemo/lightning/io/hf.py Fixed Show fixed Hide fixed
nemo/lightning/io/hf.py Fixed Show fixed Hide fixed
@akoumpa akoumpa force-pushed the akoumparouli/nemo_automodel_checkpoint_io_refactor branch 13 times, most recently from ef2c811 to cf72230 Compare February 9, 2025 20:50
@akoumpa akoumpa force-pushed the akoumparouli/nemo_automodel_checkpoint_io_refactor branch from 04afe44 to ec8b098 Compare February 13, 2025 09:38
akoumpa and others added 4 commits February 13, 2025 01:38
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Comment on lines +263 to +269
# if trainer.state.fn == TrainerFn.FITTING:
# # Load optimizer
# trainer.strategy.load_optimizer_state_dict(adapter_state)
# # Load lr scheduler
# if (lr_schedulers := adapter_state.get('lr_schedulers', None)) is not None:
# for config, lrs_state in zip(trainer.lr_scheduler_configs, lr_schedulers):
# config.scheduler.load_state_dict(lrs_state)

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.

Copilot Autofix AI about 18 hours ago

To fix the problem, we should remove the commented-out code entirely. This will help in maintaining a clean and readable codebase. The specific lines to be removed are 264 to 270 in the file nemo/lightning/pytorch/callbacks/peft.py.

Suggested changeset 1
nemo/lightning/pytorch/callbacks/peft.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/nemo/lightning/pytorch/callbacks/peft.py b/nemo/lightning/pytorch/callbacks/peft.py
--- a/nemo/lightning/pytorch/callbacks/peft.py
+++ b/nemo/lightning/pytorch/callbacks/peft.py
@@ -263,9 +263,2 @@
         # trainer.lightning_module.configure_optimizers()
-        # if trainer.state.fn == TrainerFn.FITTING:
-        #     # Load optimizer
-        #     trainer.strategy.load_optimizer_state_dict(adapter_state)
-        #     # Load lr scheduler
-        #     if (lr_schedulers := adapter_state.get('lr_schedulers', None)) is not None:
-        #         for config, lrs_state in zip(trainer.lr_scheduler_configs, lr_schedulers):
-        #             config.scheduler.load_state_dict(lrs_state)
 
EOF
@@ -263,9 +263,2 @@
# trainer.lightning_module.configure_optimizers()
# if trainer.state.fn == TrainerFn.FITTING:
# # Load optimizer
# trainer.strategy.load_optimizer_state_dict(adapter_state)
# # Load lr scheduler
# if (lr_schedulers := adapter_state.get('lr_schedulers', None)) is not None:
# for config, lrs_state in zip(trainer.lr_scheduler_configs, lr_schedulers):
# config.scheduler.load_state_dict(lrs_state)

Copilot is powered by AI and may make mistakes. Always verify output.
Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
@akoumpa akoumpa added Run CICD and removed Run CICD labels Feb 13, 2025
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
@akoumpa akoumpa force-pushed the akoumparouli/nemo_automodel_checkpoint_io_refactor branch from 08f0c82 to fb995a8 Compare February 13, 2025 16:20
Signed-off-by: Alexandros Koumparoulis <[email protected]>
@akoumpa akoumpa force-pushed the akoumparouli/nemo_automodel_checkpoint_io_refactor branch from bfb4b87 to 75255dc Compare February 13, 2025 16:21
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Signed-off-by: Alexandros Koumparoulis <[email protected]>
@akoumpa akoumpa force-pushed the akoumparouli/nemo_automodel_checkpoint_io_refactor branch from 7d302fd to 7bcd293 Compare February 13, 2025 16:23
@akoumpa akoumpa added Run CICD and removed Run CICD labels Feb 13, 2025
Signed-off-by: Alexandros Koumparoulis <[email protected]>
@akoumpa akoumpa force-pushed the akoumparouli/nemo_automodel_checkpoint_io_refactor branch from 10d67b5 to 7ddb611 Compare February 13, 2025 16:29
@akoumpa akoumpa added Run CICD and removed Run CICD labels Feb 13, 2025
Signed-off-by: Alexandros Koumparoulis <[email protected]>
@akoumpa akoumpa force-pushed the akoumparouli/nemo_automodel_checkpoint_io_refactor branch from 51c9bb6 to a173c05 Compare February 13, 2025 16:32
@akoumpa akoumpa added Run CICD and removed Run CICD labels Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant