Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird errors, this error will appear after running several epochs; #149

Open
hitxiaoting opened this issue Jul 25, 2021 · 12 comments
Open
Labels
bug Something isn't working

Comments

@hitxiaoting
Copy link

hitxiaoting commented Jul 25, 2021

Hi, @vqdang, when I training hover-net on the Kumar dataset for several epoches, it will rush on this error;
If there is a problem with the training data, why other epochs can be trained soothly?
My system information is as follows:

Linux version 3.10.0-1160.24.1.el7.x86_64 ([email protected])
(gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Thu Apr 8 19:51:47 UTC 2021
Python 3.7.4
Cuda compilation tools, release 10.2, V10.2.89
pytorch 1.8.1+cu102

Traceback (most recent call last):
File "run_train.py", line 309, in
trainer.run()
File "run_train.py", line 293, in run
phase_info, engine_opt, save_path, prev_log_dir=prev_save_path
File "run_train.py", line 268, in run_once
main_runner.run(opt["nr_epochs"])
File "/home/tingxiao/code/hover_net/run_utils/engine.py", line 172, in run
for data_batch in self.dataloader:
File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
return self._process_data(data)
File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
data.reraise()
File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 11.
Original Traceback (most recent call last):
File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/tingxiao/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/tingxiao/code/hover_net/dataloader/train_loader.py", line 105, in getitem
inst_map, self.mask_shape, **self.target_gen_kwargs
File "/home/tingxiao/code/hover_net/models/hovernet/targets.py", line 102, in gen_targets
hv_map = gen_instance_hv_map(ann, crop_shape)
File "/home/tingxiao/code/hover_net/models/hovernet/targets.py", line 60, in gen_instance_hv_map
inst_com[0] = int(inst_com[0] + 0.5)
ValueError: cannot convert float NaN to integer

@hitxiaoting
Copy link
Author

I guess this error was caused by out of GPU memory, so I add torch.cuda.empty_cache() after each epoch; It works for me now.

@vqdang vqdang added bug Something isn't working library problem Indicating bugs etc. due to the underlying libraries and removed library problem Indicating bugs etc. due to the underlying libraries labels Jul 26, 2021
@vqdang
Copy link
Owner

vqdang commented Jul 26, 2021

I will leave this open, this bug is peculiar. Could you please provide your system information if possible? It may help others later. Thank you @hitxiaoting

@hitxiaoting
Copy link
Author

I will leave this open, this bug is peculiar. Could you please provide your system information if possible? It may help others later. Thank you @hitxiaoting

Thanks, vqdang, I update my system information.

@vqdang vqdang changed the title Weried errors, this error will appear after running several epochs; Weird errors, this error will appear after running several epochs; Jul 26, 2021
@cbhindex
Copy link

Hi @hitxiaoting , Many thanks for raise the issue, and I met the same problem when retraining the HoverNet with our own dataset. To be clear of your proposed solution: I wonder whether you add "torch.cuda.empty_cache()" at the end of def run() in run_train.py script?

@hitxiaoting
Copy link
Author

Hi @hitxiaoting , Many thanks for raise the issue, and I met the same problem when retraining the HoverNet with our own dataset. To be clear of your proposed solution: I wonder whether you add "torch.cuda.empty_cache()" at the end of def run() in run_train.py script?

Hi, I add this line to file ./run_utils/engine.py the end of the function "run()", hope it helps for you.

            pbar.update()
        pbar.close()  # to flush out the bar before doing end of epoch reporting
        self.state.curr_epoch += 1
        self.__trigger_events(Events.EPOCH_COMPLETED)
        torch.cuda.empty_cache()  ########## to free some GPU for a cuda out of Mem bug

        # TODO: [CRITICAL] align the protocol
        self.state.run_accumulated_output.append(
            self.state.epoch_accumulated_output
        )

    return

@cbhindex
Copy link

Hi @hitxiaoting , Many thanks for raise the issue, and I met the same problem when retraining the HoverNet with our own dataset. To be clear of your proposed solution: I wonder whether you add "torch.cuda.empty_cache()" at the end of def run() in run_train.py script?

Hi, I add this line to file ./run_utils/engine.py the end of the function "run()", hope it helps for you.

            pbar.update()
        pbar.close()  # to flush out the bar before doing end of epoch reporting
        self.state.curr_epoch += 1
        self.__trigger_events(Events.EPOCH_COMPLETED)
        torch.cuda.empty_cache()  ########## to free some GPU for a cuda out of Mem bug

        # TODO: [CRITICAL] align the protocol
        self.state.run_accumulated_output.append(
            self.state.epoch_accumulated_output
        )

    return

Many thanks, will have a try.

@cbhindex
Copy link

Some feedback to @vqdang , seems this error will not disappear using the above proposed solution from my side.

@sumanthdonapati
Copy link

any other solutions for this error?

@sumanthdonapati
Copy link

inst_com = list(measurements.center_of_mass(inst_map))
inst_com = [x if math.isnan(x) == False else 0 for x in inst_com ] #added new line
inst_com[0] = int(inst_com[0] + 0.5)

the above modification in targets.py worked for me

@jorgegaticav
Copy link

Same, I'm getting:

C:\Python310\lib\site-packages\scipy\ndimage\measurements.py:1406: RuntimeWarning: invalid value encountered in double_scalars
  results = [sum(input * grids[dir].astype(float), labels, index) / normalizer
Traceback (most recent call last):
  File "E:\ai\hover_net\run_train.py", line 305, in <module>
    trainer.run()
  File "E:\ai\hover_net\run_train.py", line 288, in run
    self.run_once(
  File "E:\ai\hover_net\run_train.py", line 265, in run_once
    main_runner.run(opt["nr_epochs"])
  File "E:\ai\hover_net\run_utils\engine.py", line 173, in run
    for data_batch in self.dataloader:
  File "C:\Python310\lib\site-packages\torch\utils\data\dataloader.py", line 530, in __next__
    data = self._next_data()
  File "C:\Python310\lib\site-packages\torch\utils\data\dataloader.py", line 570, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "C:\Python310\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Python310\lib\site-packages\torch\utils\data\_utils\fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "E:\ai\hover_net\dataloader\train_loader.py", line 119, in __getitem__
    target_dict = self.target_gen_func(
  File "E:\ai\hover_net\models\hovernet\targets.py", line 103, in gen_targets
    hv_map = gen_instance_hv_map(ann, crop_shape)
  File "E:\ai\hover_net\models\hovernet\targets.py", line 61, in gen_instance_hv_map
    inst_com[0] = int(inst_com[0] + 0.5)
ValueError: cannot convert float NaN to integer
Processing: |###############################################################################################1                                        | 14/20[23:06<09:54,99.02s/it]Batch = 16.12650|EMA = 85.26806

@jorgegaticav
Copy link

@sumanthdonapati your solution worked for me! thanks!

@vqdang vqdang pinned this issue Dec 1, 2022
sumanthdonapati added a commit to sumanthdonapati/hover_net that referenced this issue Feb 2, 2023
vqdang#149 sometimes inst_com contains NaN values in it, idk why nan values are occuring in it. I am just replacing nan values with zero to not stop the training
@Mgryn
Copy link

Mgryn commented Mar 4, 2023

@sumanthdonapati @vqdang
I have encountered the same error, it actually comes from the padding:

# expand the box by 2px
# Because we first pad the ann at line 207, the bboxes
# will remain valid after expansion
inst_box[0] -= 2
inst_box[2] -= 2
inst_box[1] += 2
inst_box[3] += 2

is the comment referring to some obsolete code that is not used anymore ?

In my case before expanding:
inst_box = [0, 254, 118, 256]
After expanding:
inst_box = [-2, 256, 116, 258]
This resulted in inst_map having the shape of (2, 140), containing the region outside the bounding box -> all zeros.
The nan values come from calculating the center of mass for an array containing all zeros.

This can be avoided by limiting extension with:

inst_box[0] = max(inst_box[0]-2, 0)
inst_box[2] = max(inst_box[2]-2, 0)
inst_box[1] = min(inst_box[1]+2, orig_ann.shape[0])
inst_box[3] = min(inst_box[3]+2, orig_ann.shape[1])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants