You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
adjusted batch size (also tried it another time when I was changing model data type)
Pressed start
Training could not proceed due to CUDA error, I suspect due to something not being cleaned up properly.
This has a 100% repro rate for me. Tried four seperate times whilst testing claims of XYZ not working in the discord.
What did you expect would happen?
That I would be able to continue training with new settings after waiting for training to stop
Relevant log output
TensorBoard 2.18.0 at http://localhost:6006/ (Press CTRL+C to quit)
Traceback (most recent call last):
File "C:\repos\OneTrainer\modules\ui\TrainUI.py", line 560, in __training_thread_function
trainer.start()
File "C:\repos\OneTrainer\modules\trainer\GenericTrainer.py", line 135, in start
self.model_setup.setup_train_device(self.model, self.config)
File "C:\repos\OneTrainer\modules\modelSetup\StableDiffusion3LoRASetup.py", line 245, in setup_train_device
model.transformer_to(self.train_device)
File "C:\repos\OneTrainer\modules\model\StableDiffusion3Model.py", line 164, in transformer_to
self.transformer_offload_conductor.to(device)
File "C:\repos\OneTrainer\modules\util\LayerOffloadConductor.py", line 508, in to
offload_quantized(module, self.__temp_device, allocator=allocator.allocate_like)
File "C:\repos\OneTrainer\modules\util\quantization_util.py", line 184, in offload_quantized
tensor = allocator(module.weight)
File "C:\repos\OneTrainer\modules\util\LayerOffloadConductor.py", line 84, in allocate_like
self.__layer_allocator.ensure_allocation(cache_tensor_index)
File "C:\repos\OneTrainer\modules\util\LayerOffloadConductor.py", line 201, in ensure_allocation
pin_tensor_(self.cache_tensors[cache_tensor_index])
File "C:\repos\OneTrainer\modules\util\torch_util.py", line 190, in pin_tensor_
raise RuntimeError(f"CUDA Error while trying to pin memory. error: {err.value}, ptr: {x.data_ptr()}, size: {x.numel() * x.element_size()}")
RuntimeError: CUDA Error while trying to pin memory. error: 712, ptr: 2575826944256, size: 1653732967
What happened?
This has a 100% repro rate for me. Tried four seperate times whilst testing claims of XYZ not working in the discord.
What did you expect would happen?
That I would be able to continue training with new settings after waiting for training to stop
Relevant log output
Output of
pip freeze
The text was updated successfully, but these errors were encountered: