Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add FSDP QLoRA test and revert failing PR #403

Merged
merged 3 commits into from
Jun 21, 2024
Merged

Conversation

weifengpy
Copy link
Contributor

@weifengpy weifengpy commented Jun 19, 2024

fix error when running torchtune QLoRA + FSDP2 #380
TypeError: nf4_detach() missing 1 required positional argument: 'args'

torchtune command

tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf --hf-token <HF_TOKEN>
tune run --nnodes 1 --nproc_per_node 4 lora_finetune_fsdp2 --config llama2/7B_qlora enable_activation_checkpointing=False
  • revert NF4 changes from Factor out dispatch and layout registration table #360
  • FSDP2 + QLoRA multi-gpu test: pytest -s test/dtypes/test_nf4.py -k test_qlora: e2e fsdp2 + qlora test
  • NF4.clone test pytest -s test/dtypes/test_nf4.py -k test_tensor_copy: torchtune implemented NF4.clone, upstream it to TorchAO. This is needed by unit test copy.deepcopy(model)

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Copy link

pytorch-bot bot commented Jun 19, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/403

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit a9f6cca with merge base 6b0ca2d (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 19, 2024
@weifengpy weifengpy marked this pull request as draft June 19, 2024 03:02
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
@weifengpy weifengpy marked this pull request as ready for review June 19, 2024 22:02
@@ -11,10 +11,6 @@
from torch import Tensor
from torch.distributed.device_mesh import DeviceMesh
from torch._prims_common import make_contiguous_strides_for
from torchao.dtypes.utils import (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#360 consolidate _implements and _ATEN_OP_OR_TORCH_FN_TABLE but it breaks torchtune. revert for now to unblock torchtune quickly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how exactly this breaks torchtune? is it a versioning thing between saved models and this new model?

Copy link
Contributor Author

@weifengpy weifengpy Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error is TypeError: nf4_detach() missing 1 required positional argument: 'args'. So there is something incompatiable around _ATEN_OP_OR_TORCH_FN_TABLE[func](*args, **kwargs)

the errors shows up when people start training in TorchTune for the 1st time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryzh168 any thoughts on why this is happening, otherwise are you okay to undo your changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drisspg to add a bit more to what @weifengpy already said, the full stack trace is here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2024-06-21 at 11 22 11 AM

I think there is something weird going on with the torch_function dispatch, where the args are becoming the arg name for aten.. If someone wants to track down how this should work and why this is I am for it but otherwhise I will approve to unblock

def test_qlora_fsdp2(self):
from torch.distributed._composable.fsdp import CPUOffloadPolicy, OffloadPolicy

self.run_subtests(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e2e mutli-gpu FSDP + QLoRA test should be able to catch regression in the future

@weifengpy weifengpy merged commit 2eb08be into pytorch:main Jun 21, 2024
12 of 13 checks passed
dbyoung18 pushed a commit to dbyoung18/ao that referenced this pull request Jul 31, 2024
* add FSDP QLoRA test and revert failing PR

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* check pytorch version and cuda for ci

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* revert linter

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants