Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Overwriting with DataParallel #210

Open
EWitting opened this issue Jun 25, 2024 · 0 comments
Open

BUG: Overwriting with DataParallel #210

EWitting opened this issue Jun 25, 2024 · 0 comments
Labels
00 - Bug Something isn't working 11 - Trainer Issues related to model trainers needs milestone Marks an issue as needing a milestone

Comments

@EWitting
Copy link
Contributor

We automatically enable Cuda and Dataparallel when 1 or more gpus are available. This is reasonable, since you can just set CUDA_VISIBLE_DEVICES as environment variable, although it makes testing less convenient.

The main problem is that we overwrite self.model with the dataparallel model. This means that self.model can be either the architecture directly, or a wrapped architecture depending on whether you're using gpu. This sometimes makes it more complicated when trying to access part of the architecture.

Most importantly, when saving self.model, it also saves the parallel wrapper, which references cuda:0 and cuda:1. If the same model is loaded on a PC with a different number of gpus, this crashes, and model.cpu() doesn't work either. Solution would probably be to separately track self.module, with the architecture and weights for saving and loading, and self.model for training and inference, which is set in the post_init depending on hardware.

@EWitting EWitting added 00 - Bug Something isn't working 11 - Trainer Issues related to model trainers labels Jun 25, 2024
@schobbejak schobbejak added the needs milestone Marks an issue as needing a milestone label Jun 25, 2024
@schobbejak schobbejak changed the title BUG: overwriting with DataParallel BUG: Overwriting with DataParallel Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug Something isn't working 11 - Trainer Issues related to model trainers needs milestone Marks an issue as needing a milestone
Projects
None yet
Development

No branches or pull requests

2 participants