Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model doesn't train even when ModuleValidator.validate yields no errors #672

Open
sarakodeiri opened this issue Sep 11, 2024 · 4 comments
Open

Comments

@sarakodeiri
Copy link

🐛 Bug

We're trying to privately fine-tune a ViT B/16 model (link) with CIFAR-10 data. The non-private version uses MultiHeadAttention which is not compatible with DP. This compatibility issue is fixed when we use ModuleValidator.fix and it changes to DPMultiHeadAttention. Also, the ModuleValidator.validate function yields no errors. However, the model fails to train and throws the following error:
[NotImplementedError("Model contains a trainable layer with buffers that Opacus doesn't currently support

To fix this, I referred to a previous issue #454 and changed the hook style to "ew" for Expanded Weights. The model, optimizer, and train_loader are created with no errors, but in the training loop, another error shows up:
RuntimeError: Expanded Weights encountered but cannot handle function view

I don't know how to proceed from here. Any help is appreciated. Thank you!

To Reproduce

Colab link: Colab

Steps to reproduce the behavior:

  1. Run Colab file
  2. Attempt to make a privacy engine with hook_style as "hooks"
  3. Change hook style to "ew" and run the training loop

Expected behavior

I expect the ViT/B16 model to be ready to train, especially after ModuleValidator.validate doesn't show any errors with the architecture and its modules.

Environment

  • PyTorch Version (e.g., 1.0): 2.2.1
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): conda
  • Build command you used (if compiling from source): Not compiling from source.
  • Python version: 3.12.2
  • CUDA/cuDNN version: N/A
  • GPU models and configuration: Didn't use GPUs yet, but will use A50/A100 once it gets resolved.
  • Any other relevant information: N/A
@EnayatUllah
Copy link
Contributor

Hi. Thanks for raising this! We are currently working on fixing these incompatibility issues when using Expanded Weights. So, I would suggest you to try the hooks mode -- identify which part of the model uses buffers and try to replace it with similar non-buffered modules.

@sarakodeiri
Copy link
Author

Hi, thank you so much for answering!

  1. How do you think I could identify the parts that use buffers? There's a CLIP wrapper around the model, and Opacus's error message returns the entire module as a problem like this:
    [NotImplementedError("Model contains a trainable layer with buffersthat Opacus doesn't currently support(:CLIP(\n (visual): VisionTransformer(\n (conv1): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16), bias=False)\n (patch_dropout): Identity()\n (ln_pre): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n (transformer): Transformer(\n (resblocks): ModuleList(\n (0-11): 12 x ResidualAttentionBlock(\n (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n (attn): DPMultiheadAttention(\n (qlinear): Linear(in_features=768, out_features=768, bias=True)\n (klinear): Linear(in_features=768, out_features=768, bias=True)\n (vlinear): Linear(in_features=768, out_features=768, bias=True)\n (out_proj): Linear(in_features=768, out_features=768, bias=True)\n (dropout): Dropout(p=0.0, inplace=False)\n )\n (ls_1): Identity()\n (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n (mlp): Sequential(\n (c_fc): Linear(in_features=768, out_features=3072, bias=True)\n (gelu): QuickGELU()\n (c_proj): Linear(in_features=3072, out_features=768, bias=True)\n )\n (ls_2): Identity()\n )\n )\n )\n (ln_post): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n )\n (transformer): Transformer(\n (resblocks): ModuleList(\n (0-11):...
    (I'm no expert, but a more specific error message would be useful.)

  2. Say I correctly identify the problematic parts. What is so great about ModuleValidator.fix is that it changes the modules and moves the weights accordingly. Is there a best practice or tested solution for manually moving the weights?

  3. Maybe too broad of a question: Is there any documentation/report specifying what's incompatible with Opacus, since ModuleValidator.validate doesn't seem to cover everything?

  4. An even broader question, for my curiosity only: Can all non-private models be made private with Opacus? Or have there been cases where models can't be made private?

Thanks again!

@HuanyuZhang
Copy link
Contributor

For now, we rely on both model ModuleValidator and GradSampleModule.validate() to check the compatibility. For the latter, under the strict mode, GSM will throw an error when the module includes a buffer (https://github.com/pytorch/opacus/blob/main/opacus/grad_sample/grad_sample_module.py#L108). The error can be muted by setting strict = False.

@HuanyuZhang
Copy link
Contributor

  1. If you do not wrap the whole model by "clip", the validator should be able to tell you which submodule includes a buffer.
  2. Do not fully understand your question. Do you mind explaining more?
  3. Thanks for your suggestion. Indeed we have a plan to update the documentation to improve clarity.
  4. Let us separate this question into two parts:
  • Can any non-private model get trained by DP-SGD? The answer is no. For some modules having buffers which will reveal private information (like BatchNorm), or clip-style losses (non-linear across samples in the mini-batch), I am not aware of how to train it by DP-SGD.
  • Can any DP-SGD model be trained by Opacus? The answer is yes, but potentially with minor code tweaks. One common issue is that Opacus requires the batch size dimension to be consistent across each submodule. But in some custom modules, this assumption is violated due to the permutation operation which will lead to some gradient mis-match error (e.g., Training a simple transformer model with Opacus produces runtime error due to mismatch in dimensions #666).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants