Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about the loss #4

Open
anyang1996 opened this issue May 21, 2020 · 13 comments
Open

question about the loss #4

anyang1996 opened this issue May 21, 2020 · 13 comments

Comments

@anyang1996
Copy link

when training on kitti dataset, two kinds of printed losses akways increased(mean loss and mean box loss), while other types decreased. Have you ever encountered such problem?

Thanks~

@qiqihaer
Copy link
Owner

I've never seen a problem like this. Have you changed the code?

@anyang1996
Copy link
Author

I've never seen a problem like this. Have you changed the code?

thanks for reply

actually I didn't change the code and used the provided data(./kitti/gt_database/train_gt_database_3level_Car.pkl) for training......

@qiqihaer
Copy link
Owner

You can check the log_train.txt in the log_kitti folder. That's the training log for 200 epochs. And I train 65 epochs again to check the code. The problem you mentioned does not come up in these two experiments. You can try to clone the code again.

@anyang1996
Copy link
Author

You can check the log_train.txt in the log_kitti folder. That's the training log for 200 epochs. And I train 65 epochs again to check the code. The problem you mentioned does not come up in these two experiments. You can try to clone the code again.


sorry to bother you again, I re-download the code without any change and try several times but still require the same result......If possible, could I get the lateset version of code you used yesterday with email [email protected]?

Thanks so much.

@qiqihaer
Copy link
Owner

I have sent the code to you

@hova88
Copy link

hova88 commented Jun 17, 2020

You can check the log_train.txt in the log_kitti folder. That's the training log for 200 epochs. And I train 65 epochs again to check the code. The problem you mentioned does not come up in these two experiments. You can try to clone the code again.

sorry to bother you again, I re-download the code without any change and try several times but still require the same result......If possible, could I get the lateset version of code you used yesterday with email [email protected]?

Thanks so much.

解决了么?我也有同样的问题

@qiqihaer
Copy link
Owner

You can check the log_train.txt in the log_kitti folder. That's the training log for 200 epochs. And I train 65 epochs again to check the code. The problem you mentioned does not come up in these two experiments. You can try to clone the code again.

sorry to bother you again, I re-download the code without any change and try several times but still require the same result......If possible, could I get the lateset version of code you used yesterday with email [email protected]?
Thanks so much.

解决了么?我也有同样的问题

为什么会有这个问题?我这里训练就没有出现过啊。

@anyang1996
Copy link
Author

anyang1996 commented Jun 17, 2020 via email

@qiqihaer
Copy link
Owner

把train.py里面dataloader那里的num_workers改成32!总之不是1就好,不过原因我不太知道,刚接触pytorch,之前一直tf来着…

这是什么原因?num_workers怎么会对训练产生影响。能不能把不收敛的loss的log发个邮件给我,我看看是什么问题。

@hova88
Copy link

hova88 commented Jun 18, 2020

把train.py里面dataloader那里的num_workers改成32!总之不是1就好,不过原因我不太知道,刚接触pytorch,之前一直tf来着…

这是什么原因?num_workers怎么会对训练产生影响。能不能把不收敛的loss的log发个邮件给我,我看看是什么问题。

邮箱多少,我把我的log发给你哈。我觉得应该是Loss function对kitti调整的问题,我也还没搞明白。这里我截取两端部分大家看下。box_loss 和center_loss是在累加。
片段一

**** EVAL EPOCH 009 END****
**** EPOCH 010 ****
Current learning rate: 0.001000
Current BN decay momentum: 0.500000
2020-06-17 13:28:24.704650
---- batch: 010 ----
mean box_loss: 21.451081
mean center_loss: 21.343650
mean heading_acc: 0.000000
mean heading_cls_loss: 0.661935
mean heading_reg_loss: 0.039943
mean loss: 247.758943
mean neg_ratio: 0.998730
mean obj_acc: 0.999902
mean objectness_loss: 0.000772
mean pos_ratio: 0.000098
mean sem_acc: 0.200000
mean sem_cls_loss: 0.009268
mean size_acc: 0.200000
mean size_cls_loss: 0.008473
mean size_reg_loss: 0.000448
mean vote_loss: 3.323501

片段二

**** EVAL EPOCH 049 END****
**** EPOCH 050 ****
Current learning rate: 0.001000
Current BN decay momentum: 0.125000
2020-06-17 17:01:50.644762
---- batch: 010 ----
mean box_loss: 82.634578
mean center_loss: 82.634578
mean heading_acc: 0.000000
mean heading_cls_loss: 0.000000
mean heading_reg_loss: 0.000000
mean loss: 901.370782
mean neg_ratio: 0.999512
mean obj_acc: 1.000000
mean objectness_loss: 0.000021
mean pos_ratio: 0.000000
mean sem_acc: 0.000000
mean sem_cls_loss: 0.000000
mean size_acc: 0.000000
mean size_cls_loss: 0.000000
mean size_reg_loss: 0.000000
mean vote_loss: 7.502488
---- batch: 020 ----
mean box_loss: 109.684003
mean center_loss: 109.675870
mean heading_acc: 0.100000
mean heading_cls_loss: 0.062090
mean heading_reg_loss: 0.000375
mean loss: 1182.991797
mean neg_ratio: 0.999512
mean obj_acc: 0.999951
mean objectness_loss: 0.000595
mean pos_ratio: 0.000049
mean sem_acc: 0.100000
mean sem_cls_loss: 0.000003
mean size_acc: 0.100000
mean size_cls_loss: 0.000004
mean size_reg_loss: 0.001548
mean vote_loss: 8.614880

后来,我修改了eval.py。无论是否使用 --use_3d_nms --use_cls_nms --per_class_proposal
均出现这个报错。

Traceback (most recent call last):
File "eval.py", line 210, in
eval()
File "eval.py", line 207, in eval
loss = evaluate_one_epoch()
File "eval.py", line 178, in evaluate_one_epoch
batch_pred_map_cls = parse_predictions(end_points, CONFIG_DICT)
File "/home/hova/Documents/Git_projects/votenet_kitti/models/kitti_ap_helper.py", line 133, in parse_predictions
assert (len(pick) > 0)

当我 # assert (len(pick) > 0) 之后,尝试dump result 。结果完全不行。

可以参考下这个ISSUE

@anyang1996
Copy link
Author

把train.py里面dataloader那里的num_workers改成32!总之不是1就好,不过原因我不太知道,刚接触pytorch,之前一直tf来着…

这是什么原因?num_workers怎么会对训练产生影响。能不能把不收敛的loss的log发个邮件给我,我看看是什么问题。

之前删掉了,重新训练下再发你;除此之外我写错了,不能是0,已修正

@hova88
Copy link

hova88 commented Jun 18, 2020

把train.py里面dataloader那里的num_workers改成32!总之不是1就好,不过原因我不太知道,刚接触pytorch,之前一直tf来着…

这是什么原因?num_workers怎么会对训练产生影响。能不能把不收敛的loss的log发个邮件给我,我看看是什么问题。

之前删掉了,重新训练下再发你;除此之外我写错了,不能是0,已修正

我搜了下平num_workers是CPU与GPU的内存访问设置,理论上影响的是训练时间,为什么会影响训练精度?

@anyang1996
Copy link
Author

把train.py里面dataloader那里的num_workers改成32!总之不是1就好,不过原因我不太知道,刚接触pytorch,之前一直tf来着…

这是什么原因?num_workers怎么会对训练产生影响。能不能把不收敛的loss的log发个邮件给我,我看看是什么问题。

之前删掉了,重新训练下再发你;除此之外我写错了,不能是0,已修正

我搜了下平num_workers是CPU与GPU的内存访问设置,理论上影响的是训练时间,为什么会影响训练精度?

不清楚,不过你自己可以试一下,num_workers从0改成32,其他的不变,看一下loss变化;你刚刚发的loss趋势和我之前一样。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants