Training Freezes Before Starting #2

yukiarimo · 2024-05-11T08:17:32Z

(tiny-audio-diffusion) yuki@yuki tiny-audio-diffusion % python train.py exp=drum_diffusion trainer.gpus=1 datamodule.dataset.path=/Users/yuki/Downloads/tiny-audio-diffusion/samples
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[2024-05-11 02:16:26,217][main.utils][INFO] - Disabling python warnings! <config.ignore_warnings=True>
Global seed set to 12345
[2024-05-11 02:16:26,220][__main__][INFO] - Instantiating datamodule <main.diffusion_module.Datamodule>.
[2024-05-11 02:16:27,005][__main__][INFO] - Instantiating model <main.diffusion_module.Model>.
[2024-05-11 02:16:27,183][__main__][INFO] - Instantiating callback <pytorch_lightning.callbacks.RichProgressBar>.
[2024-05-11 02:16:27,183][__main__][INFO] - Instantiating callback <pytorch_lightning.callbacks.ModelCheckpoint>.
[2024-05-11 02:16:27,185][__main__][INFO] - Instantiating callback <pytorch_lightning.callbacks.RichModelSummary>.
[2024-05-11 02:16:27,186][__main__][INFO] - Instantiating callback <main.diffusion_module.SampleLogger>.
[2024-05-11 02:16:27,187][__main__][INFO] - Instantiating logger <pytorch_lightning.loggers.wandb.WandbLogger>.
wandb: Currently logged in as: yukiarimo. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.17.0
wandb: Run data is saved locally in /Users/yuki/Downloads/tiny-audio-diffusionlogs/wandb/run-20240511_021628-7k1pjexi
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run unconditional_diffusion
wandb: ⭐️ View project at https://wandb.ai/yukiarimo/wandbprojectname
wandb: 🚀 View run at https://wandb.ai/yukiarimo/wandbprojectname/runs/7k1pjexi
[2024-05-11 02:16:33,399][__main__][INFO] - Instantiating trainer <pytorch_lightning.Trainer>.
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[2024-05-11 02:16:33,438][__main__][INFO] - Logging hyperparameters!
[2024-05-11 02:16:33,456][__main__][INFO] - Starting training.
┏━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃   ┃ Name                ┃ Type           ┃ Params ┃
┡━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ 0 │ model               │ DiffusionModel │ 31.6 M │
│ 1 │ model.net           │ Module         │ 31.6 M │
│ 2 │ model.diffusion     │ VDiffusion     │ 31.6 M │
│ 3 │ model.sampler       │ VSampler       │ 31.6 M │
│ 4 │ model_ema           │ EMA            │ 63.1 M │
│ 5 │ model_ema.ema_model │ DiffusionModel │ 31.6 M │
└───┴─────────────────────┴────────────────┴────────┘
Trainable params: 31.6 M                                                        
Non-trainable params: 31.6 M                                                    
Total params: 63.1 M                                                            
Total estimated model params size (MB): 126                                     
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

The text was updated successfully, but these errors were encountered:

crlandsc · 2024-05-17T15:54:55Z

Hi @yukiarimo. What you shared are pretty standard logs, so they do not really provide any context into what might be your issue. I have not tested this repo on MPS, rather only NVIDIA GPUs or CPUs, so I would start there (i.e. remove the trainer.gpus=1 argument).

dillfrescott · 2024-10-24T08:00:43Z

Im getting the same issue on an RTX 4090. It just stops at:

Trainable params: 31.6 M
Non-trainable params: 31.6 M
Total params: 63.1 M
Total estimated model params size (MB): 126

and nothing happens

crlandsc · 2024-10-24T16:00:09Z

Hi @dillfrescott. This is a hard problem to diagnose with little info. To begin narrowing down the possibilities, it would be worthwhile trying to get another pytorch lightning model to train from a different repo. Also, you should probably check to make sure your dependencies are aligned, as that can create weird issues sometimes. I wish I could offer more insight, but it's hard to tell without working with your setup.

kimbring2 · 2024-11-29T23:34:29Z

@dillfrescott @crlandsc It seems like there is some problem related to multithreading. I solved the freezing issue by setting the num_workers as 1.

crlandsc · 2024-11-30T02:02:25Z

@kimbring2 Good find! It may be a versioning thing then. Multithreading used to work when I trained the original models, but I have had issues with num_workers in other more recent projects where I have used lightning too.

crlandsc closed this as completed Jul 5, 2024

crlandsc reopened this Nov 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Freezes Before Starting #2

Training Freezes Before Starting #2

yukiarimo commented May 11, 2024

crlandsc commented May 17, 2024

dillfrescott commented Oct 24, 2024

crlandsc commented Oct 24, 2024

kimbring2 commented Nov 29, 2024

crlandsc commented Nov 30, 2024

Training Freezes Before Starting #2

Training Freezes Before Starting #2

Comments

yukiarimo commented May 11, 2024

crlandsc commented May 17, 2024

dillfrescott commented Oct 24, 2024

crlandsc commented Oct 24, 2024

kimbring2 commented Nov 29, 2024

crlandsc commented Nov 30, 2024