-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weird errors, this error will appear after running several epochs; #149
Comments
I guess this error was caused by out of GPU memory, so I add torch.cuda.empty_cache() after each epoch; It works for me now. |
I will leave this open, this bug is peculiar. Could you please provide your system information if possible? It may help others later. Thank you @hitxiaoting |
Thanks, vqdang, I update my system information. |
Hi @hitxiaoting , Many thanks for raise the issue, and I met the same problem when retraining the HoverNet with our own dataset. To be clear of your proposed solution: I wonder whether you add "torch.cuda.empty_cache()" at the end of def run() in run_train.py script? |
Hi, I add this line to file ./run_utils/engine.py the end of the function "run()", hope it helps for you.
|
Many thanks, will have a try. |
Some feedback to @vqdang , seems this error will not disappear using the above proposed solution from my side. |
any other solutions for this error? |
inst_com = list(measurements.center_of_mass(inst_map)) the above modification in targets.py worked for me |
Same, I'm getting:
|
@sumanthdonapati your solution worked for me! thanks! |
vqdang#149 sometimes inst_com contains NaN values in it, idk why nan values are occuring in it. I am just replacing nan values with zero to not stop the training
@sumanthdonapati @vqdang # expand the box by 2px
# Because we first pad the ann at line 207, the bboxes
# will remain valid after expansion
inst_box[0] -= 2
inst_box[2] -= 2
inst_box[1] += 2
inst_box[3] += 2 is the comment referring to some obsolete code that is not used anymore ? In my case before expanding: This can be avoided by limiting extension with: inst_box[0] = max(inst_box[0]-2, 0)
inst_box[2] = max(inst_box[2]-2, 0)
inst_box[1] = min(inst_box[1]+2, orig_ann.shape[0])
inst_box[3] = min(inst_box[3]+2, orig_ann.shape[1]) |
Hi, @vqdang, when I training hover-net on the Kumar dataset for several epoches, it will rush on this error;
If there is a problem with the training data, why other epochs can be trained soothly?
My system information is as follows:
Linux version 3.10.0-1160.24.1.el7.x86_64 ([email protected])
(gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Thu Apr 8 19:51:47 UTC 2021
Python 3.7.4
Cuda compilation tools, release 10.2, V10.2.89
pytorch 1.8.1+cu102
Traceback (most recent call last):
File "run_train.py", line 309, in
trainer.run()
File "run_train.py", line 293, in run
phase_info, engine_opt, save_path, prev_log_dir=prev_save_path
File "run_train.py", line 268, in run_once
main_runner.run(opt["nr_epochs"])
File "/home/tingxiao/code/hover_net/run_utils/engine.py", line 172, in run
for data_batch in self.dataloader:
File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
return self._process_data(data)
File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
data.reraise()
File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 11.
Original Traceback (most recent call last):
File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/tingxiao/code/hover_net/dataloader/train_loader.py", line 105, in getitem
inst_map, self.mask_shape, **self.target_gen_kwargs
File "/home/tingxiao/code/hover_net/models/hovernet/targets.py", line 102, in gen_targets
hv_map = gen_instance_hv_map(ann, crop_shape)
File "/home/tingxiao/code/hover_net/models/hovernet/targets.py", line 60, in gen_instance_hv_map
inst_com[0] = int(inst_com[0] + 0.5)
ValueError: cannot convert float NaN to integer
The text was updated successfully, but these errors were encountered: