Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Cancelling a training session that is using CPU offloading reliably soft bricks further training. Must close and reopen #574

Open
O-J1 opened this issue Nov 17, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@O-J1
Copy link
Collaborator

O-J1 commented Nov 17, 2024

What happened?

  1. Pressed stop
  2. Waiting for it to stop
  3. adjusted batch size (also tried it another time when I was changing model data type)
  4. Pressed start
  5. Training could not proceed due to CUDA error, I suspect due to something not being cleaned up properly.

This has a 100% repro rate for me. Tried four seperate times whilst testing claims of XYZ not working in the discord.

What did you expect would happen?

That I would be able to continue training with new settings after waiting for training to stop

Relevant log output

TensorBoard 2.18.0 at http://localhost:6006/ (Press CTRL+C to quit)
Traceback (most recent call last):
  File "C:\repos\OneTrainer\modules\ui\TrainUI.py", line 560, in __training_thread_function
    trainer.start()
  File "C:\repos\OneTrainer\modules\trainer\GenericTrainer.py", line 135, in start
    self.model_setup.setup_train_device(self.model, self.config)
  File "C:\repos\OneTrainer\modules\modelSetup\StableDiffusion3LoRASetup.py", line 245, in setup_train_device
    model.transformer_to(self.train_device)
  File "C:\repos\OneTrainer\modules\model\StableDiffusion3Model.py", line 164, in transformer_to
    self.transformer_offload_conductor.to(device)
  File "C:\repos\OneTrainer\modules\util\LayerOffloadConductor.py", line 508, in to
    offload_quantized(module, self.__temp_device, allocator=allocator.allocate_like)
  File "C:\repos\OneTrainer\modules\util\quantization_util.py", line 184, in offload_quantized
    tensor = allocator(module.weight)
  File "C:\repos\OneTrainer\modules\util\LayerOffloadConductor.py", line 84, in allocate_like
    self.__layer_allocator.ensure_allocation(cache_tensor_index)
  File "C:\repos\OneTrainer\modules\util\LayerOffloadConductor.py", line 201, in ensure_allocation
    pin_tensor_(self.cache_tensors[cache_tensor_index])
  File "C:\repos\OneTrainer\modules\util\torch_util.py", line 190, in pin_tensor_
    raise RuntimeError(f"CUDA Error while trying to pin memory. error: {err.value}, ptr: {x.data_ptr()}, size: {x.numel() * x.element_size()}")
RuntimeError: CUDA Error while trying to pin memory. error: 712, ptr: 2575826944256, size: 1653732967

Output of pip freeze

absl-py==2.1.0
accelerate==1.0.1
aiohappyeyeballs==2.4.3
aiohttp==3.11.2
aiosignal==1.3.1
annotated-types==0.7.0
antlr4-python3-runtime==4.9.3
async-timeout==5.0.1
attrs==24.2.0
bitsandbytes==0.44.1
certifi==2024.8.30
charset-normalizer==3.4.0
cloudpickle==3.1.0
colorama==0.4.6
coloredlogs==15.0.1
contourpy==1.3.1
customtkinter==5.2.2
cycler==0.12.1
dadaptation==3.2
darkdetect==0.8.0
-e git+https://github.com/huggingface/diffusers.git@e45c25d03aeb0a967d8aaa0f6a79f280f6838e1f#egg=diffusers
filelock==3.16.1
flatbuffers==24.3.25
fonttools==4.55.0
frozenlist==1.5.0
fsspec==2024.10.0
ftfy==6.3.1
grpcio==1.68.0
huggingface-hub==0.26.2
humanfriendly==10.0
idna==3.10
importlib_metadata==8.5.0
invisible-watermark==0.2.0
Jinja2==3.1.4
kiwisolver==1.4.7
lightning-utilities==0.11.8
lion-pytorch==0.2.2
Markdown==3.7
markdown-it-py==3.0.0
MarkupSafe==3.0.2
matplotlib==3.9.2
mdurl==0.1.2
-e git+https://github.com/Nerogar/mgds.git@f9edb99bea18da54440c4600894027706b5172ce#egg=mgds
mpmath==1.3.0
multidict==6.1.0
networkx==3.4.2
numpy==1.26.4
nvidia-ml-py==12.560.30
omegaconf==2.3.0
onnxruntime-gpu==1.19.2
open_clip_torch==2.28.0
opencv-python==4.10.0.84
packaging==24.2
pillow==11.0.0
platformdirs==4.3.6
pooch==1.8.2
prodigyopt==1.0
propcache==0.2.0
protobuf==5.28.3
psutil==6.1.0
pydantic==2.9.2
pydantic_core==2.23.4
Pygments==2.18.0
pyparsing==3.2.0
pyreadline3==3.5.4
python-dateutil==2.9.0.post0
pytorch-lightning==2.4.0
pytorch_optimizer==3.1.2
PyWavelets==1.7.0
PyYAML==6.0.2
regex==2024.11.6
requests==2.32.3
rich==13.9.4
safetensors==0.4.5
scalene==1.5.45
schedulefree==1.2.7
sentencepiece==0.2.0
six==1.16.0
sympy==1.13.1
tensorboard==2.18.0
tensorboard-data-server==0.7.2
timm==1.0.11
tokenizers==0.20.3
torch==2.5.1+cu124
torchmetrics==1.6.0
torchvision==0.20.1+cu124
tqdm==4.66.6
transformers==4.46.0
typing_extensions==4.12.2
urllib3==2.2.3
wcwidth==0.2.13
Werkzeug==3.1.3
xformers==0.0.28.post3
yarl==1.17.1
zipp==3.21.0```
@O-J1 O-J1 added the bug Something isn't working label Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant