Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Freezes Before Starting #2

Open
yukiarimo opened this issue May 11, 2024 · 5 comments
Open

Training Freezes Before Starting #2

yukiarimo opened this issue May 11, 2024 · 5 comments

Comments

@yukiarimo
Copy link

(tiny-audio-diffusion) yuki@yuki tiny-audio-diffusion % python train.py exp=drum_diffusion trainer.gpus=1 datamodule.dataset.path=/Users/yuki/Downloads/tiny-audio-diffusion/samples
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[2024-05-11 02:16:26,217][main.utils][INFO] - Disabling python warnings! <config.ignore_warnings=True>
Global seed set to 12345
[2024-05-11 02:16:26,220][__main__][INFO] - Instantiating datamodule <main.diffusion_module.Datamodule>.
[2024-05-11 02:16:27,005][__main__][INFO] - Instantiating model <main.diffusion_module.Model>.
[2024-05-11 02:16:27,183][__main__][INFO] - Instantiating callback <pytorch_lightning.callbacks.RichProgressBar>.
[2024-05-11 02:16:27,183][__main__][INFO] - Instantiating callback <pytorch_lightning.callbacks.ModelCheckpoint>.
[2024-05-11 02:16:27,185][__main__][INFO] - Instantiating callback <pytorch_lightning.callbacks.RichModelSummary>.
[2024-05-11 02:16:27,186][__main__][INFO] - Instantiating callback <main.diffusion_module.SampleLogger>.
[2024-05-11 02:16:27,187][__main__][INFO] - Instantiating logger <pytorch_lightning.loggers.wandb.WandbLogger>.
wandb: Currently logged in as: yukiarimo. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.17.0
wandb: Run data is saved locally in /Users/yuki/Downloads/tiny-audio-diffusionlogs/wandb/run-20240511_021628-7k1pjexi
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run unconditional_diffusion
wandb: ⭐️ View project at https://wandb.ai/yukiarimo/wandbprojectname
wandb: 🚀 View run at https://wandb.ai/yukiarimo/wandbprojectname/runs/7k1pjexi
[2024-05-11 02:16:33,399][__main__][INFO] - Instantiating trainer <pytorch_lightning.Trainer>.
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[2024-05-11 02:16:33,438][__main__][INFO] - Logging hyperparameters!
[2024-05-11 02:16:33,456][__main__][INFO] - Starting training.
┏━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃   ┃ Name                ┃ Type           ┃ Params ┃
┡━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ 0 │ model               │ DiffusionModel │ 31.6 M │
│ 1 │ model.net           │ Module         │ 31.6 M │
│ 2 │ model.diffusion     │ VDiffusion     │ 31.6 M │
│ 3 │ model.sampler       │ VSampler       │ 31.6 M │
│ 4 │ model_ema           │ EMA            │ 63.1 M │
│ 5 │ model_ema.ema_model │ DiffusionModel │ 31.6 M │
└───┴─────────────────────┴────────────────┴────────┘
Trainable params: 31.6 M                                                        
Non-trainable params: 31.6 M                                                    
Total params: 63.1 M                                                            
Total estimated model params size (MB): 126                                     
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
@crlandsc
Copy link
Owner

Hi @yukiarimo. What you shared are pretty standard logs, so they do not really provide any context into what might be your issue. I have not tested this repo on MPS, rather only NVIDIA GPUs or CPUs, so I would start there (i.e. remove the trainer.gpus=1 argument).

@crlandsc crlandsc closed this as completed Jul 5, 2024
@dillfrescott
Copy link

Im getting the same issue on an RTX 4090. It just stops at:

Trainable params: 31.6 M
Non-trainable params: 31.6 M
Total params: 63.1 M
Total estimated model params size (MB): 126

and nothing happens

@crlandsc
Copy link
Owner

Hi @dillfrescott. This is a hard problem to diagnose with little info. To begin narrowing down the possibilities, it would be worthwhile trying to get another pytorch lightning model to train from a different repo. Also, you should probably check to make sure your dependencies are aligned, as that can create weird issues sometimes. I wish I could offer more insight, but it's hard to tell without working with your setup.

@kimbring2
Copy link

@dillfrescott @crlandsc It seems like there is some problem related to multithreading. I solved the freezing issue by setting the num_workers as 1.

@crlandsc
Copy link
Owner

@kimbring2 Good find! It may be a versioning thing then. Multithreading used to work when I trained the original models, but I have had issues with num_workers in other more recent projects where I have used lightning too.

@crlandsc crlandsc reopened this Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants