Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Backward Time Complexity to O(MK) #1

Open
wants to merge 6 commits into
base: caffe-face
Choose a base branch
from

Conversation

mfs6174
Copy link

@mfs6174 mfs6174 commented Oct 21, 2016

In the original implementation, the time complexity of the backward process of the center loss layer is O(MK+NM). It will be very slow when training with a large number of classes since the running time of the backward pass is related to the class number (N). Unfortunately, it is a common case when training face recognition model (e.g. 750k unique persons).

This pull request rewrites the backward code. The time complexity is optimized to O(MK) with additional O(N) space. Because M (batch size) << N and K (feature length) << N usually hold for face recognition problem, this modification will improve the training speed significantly.

For a Googlenet v2 model trained with Everphoto's 750k unique person dataset, on a single Nvidia GTX Titan X, with 24 batch size and iter_size = 5, the average backward iteration time for different cases is:

  1. Softmax only: 230ms
  2. Softmax + Center loss, original implementation: 3485ms, center loss layer: 3332ms
  3. Softmax + Center loss, implementation in this PR: 235.6ms, center loss layer: 5.4ms

There is more than 600x improvement.

For the paper author's "minit_example", running on a single GTX Titan X, training time of the original implementation and the PR is 4min20s V.S. 3min50s. It is shown that even when training with small dataset with only 10 classes, there still is some improvement.

The PR also fix the code style to pass the Caffe's lint test (make lint).

@jiangxuehan
Copy link

@mfs6174 , have you reproduced the result on LFW or Megaface? I have trained the model on CASIA and tested on LFW ,but it didn't work well (EER~96.5%).

@mfs6174
Copy link
Author

mfs6174 commented Oct 24, 2016

Hi, @jiangxuehan

I am still working on that with both paper author's code and my code.

I have only tested my PR's code with the MNIST toy example. Starting from the same snapshot and training data (no shuffle during training), my code produced exactly the same center_diff values and nearly the same test result with the author's code.

Which code did you trained the model with? Was it the paper author's code or my PR's code? If you can reproduce the result with the paper author's code but not my code, I will check my code again. If you can neither reproduce the result with the paper author's code, I will discuss with you further when I finish my experiments for reproducing the result on LFW.

@chichan01
Copy link

chichan01 commented Oct 24, 2016

Hi @jiangxuehan and @mfs6174,
I use MTCNN to get the 5pts landmarks and use his provided model. The accuracy in LFW in view 2 is just 96.55 +/-0.229129. If I use his provided feature, the accuracy is 98.98 +/- 0.186685.
If I use other way to get 5pts landmarks and use his provide model, result is 98.75 +/-0.194754.
If I train from scratch using my 5 pts landmarks, the result is 98.47 +/- 0.211986.
I hope this will give you some guidelines. Besides, I found that their system cannot use dropout as the dropout will make the loss to NAN. Do you guys have the same problem when dropout is applied?

Would you like to verify two points? 1, Does lamda in paper be "loss_weight"? if some, where can I set alpha?

@jiangxuehan
Copy link

Hi, @mfs6174 @chichan01 ,
to mfs6174 :
I used the paper author's code. Further discussions after you finish experiments will be appreciated. (By the way,your optimized code is correct and faster)
to chichan01:
Your result seems reasonable. Did you use the same code\model\data as the author's PR? Have you ever made some changes except landmarks?
I have tried to use dropout for the 512d-fc layer, the loss didn't produce NAN value. I think center loss acts as a regular term , it's more reasonable to compare softmax+dropout vs. softmax+center_loss. Have you tried softmax+dropout?

@chichan01
Copy link

Hi @jiangxuehan,
I only try to reproduce their result in this moment and therefore I did not change anything. I did try to apply centre loss in other architecture but it seems that it need to tune loss_weight. Anyway, I will try to compare it with dropout.

@jiangxuehan
Copy link

@chichan01 :
Could you please send your train log files to me(both 98.47 and NAN )? I want to compare loss curves and hope it can provide some useful information. My email is [email protected], Thanks.

@mfs6174
Copy link
Author

mfs6174 commented Oct 24, 2016

Hi, @jiangxuehan @chichan01

Regarding reproducing the face result, I also have some questions. When training the network, did you use the author's prototxt directly (center loss parameters: lr_multi = 1, weight_decay_multi = 2 and loss_weight = 0.008 ) or change it following the description in the paper ( the parameters should be lr_multi = 5 (so the alpha is 0.1*5 = 0.5) and loss_weight = 0.003 ) ?

@chichan01
Copy link

chichan01 commented Oct 25, 2016

Hi @jiangxuehan and @mfs6174,
I directly use author's prototxt as their network in prototxt is not similar to their paper, therefore, I think following the description in the paper may not right. Certainly, if you test it based on their description, please let me know the result. Also, I would like to thank for telling me that how I get the alpha. By the way, I only use subset of CASIA-webface original version (not cleaned version) because of non-overlapping with IJCB, which only have 10549 subjects.
Another point I would like to highlight is that all of my results are cosine distance without applying PCA. Perhaps, applying PCA will improve the performance. However, the question is that which dataset they used to train PCA (LFW or CASIA)?
To @jiangxuehan , I am sorry that I am not able to give you the log file as the log files were corrupted. Perhaps, I can give a few days later if you do not mind.

By the way, did you guys able to reproduce their megaface result?

@jiangxuehan
Copy link

Hi @mfs6174 @chichan01
As mfs6174 metioned, center = center - lr_mult_lr_d(center), so alpha = lr_mult_lr.With lr decrease from 0.1 to 0.01/0.001, we should set lr_mult to 5/50/500 theoretically. Besides , weight_decay should be set to 0. Another method is rewrite the Backward function : center = center - alpha_d(center). If @chichan01 can reproduce the result with this PR, maybe alpha is not so important?
(I have not do any experiment about alpha, above content is just my own comprehension. Could @ydwen explain about alpha for us? Thanks.)

To chichan01: I am still working on LFW, and meeting some obsessions to get ~99% accuracy. How about your experiment on Megaface?

@mfs6174
Copy link
Author

mfs6174 commented Oct 25, 2016

Hi,

I am still working on that with both paper author's code and my code.

I have only tested my PR's code with the MNIST toy example. Starting from
the same snapshot and training data (no shuffle during training), my code
produced exactly the same center_diff values and nearly the same test
result with the author's code.

Which code did you trained the model with? Was it the paper author's code
or my PR's code? If you can reproduce the result with the paper author's
code but not my code, I will check my code again. If you can neither
reproduce the result with the paper author's code, I will discuss with you
further when I finish my experiments for reproducing the result on LFW.

2016-10-24 8:59 GMT+08:00 jiangxuehan [email protected]:

@mfs6174 https://github.com/mfs6174 , have you reproduced the result on
LFW or Megaface? I have trained the model on CASIA and tested on LFW ,but
it didn't work well (EER~96.5%).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAvOsgRKov1iq31hSBSj-KV03zDjR8tvks5q3AL8gaJpZM4KdJSj
.


mfs6174
Web mfs6174.org

@ydwen
Copy link
Owner

ydwen commented Oct 27, 2016

@jiangxuehan @chichan01 @ @mfs6174
Sorry for replying this late. I was quite busy these days. Thanks for the commit from @mfs6174, I will check it ASAP.

Here I try to answer some of your questions. If anything is not clear, please feel free to let me know.

Common issues
Dropout: if you want to combine dropout, the current code may not support. Concretely, if some of the elements in x is omitted, the corresponding elements in center should be omit as well. If not, the loss becomes unstable.

Alpha: At the beginning, we use the update strategy as in paper. It performs well in our experiments (deepidnet & mixed dataset). When refactoring the code, we find that the implementation of updating alpha in caffe is not elegant and becomes complicated with multi-gpu. Finally we try lr and weight_decay in caffe and find it is very convenient and works pretty well (achieving near the same performance as before). So you guys can try modifying the lr and weight_decay for different alphas.

Network Architecture: At the beginning we use the deepidnet so we have to implement the local convolution layer. We are not going to released our implementation since it is quite inelegant (^_^). The most important is, local convolution layer is complicated and not efficient in time (~8h for resnet with 28 layers v.s. ~14h for deepidnet with 6 layers) and space (~100M for resnet with 28 layers v.s. ~200M for deepidnet with 6 layers). Therefore, we use a modified resnet as our network. It performs better with less parameters and training time.

To @chichan01 , I guess the patch you used is not the same as in the demo or the given model. Please double check the positions of eyes, nose, mouth corner in the cropped face and use our provided template (5 points landmarks) given in demo.

To @chichan01 , loss weight is related to the total number of classes, i.e. num_output in fc6. Generally speaking, the more classes, the smaller loss weight is.

To @jiangxuehan Please provide more details of your experiment (96.5% EER), or I can't give you some hints.

@kkirtac
Copy link

kkirtac commented Oct 27, 2016

Hi people,
I have tried to train ydwen's prototxt from scratch, using my own dataset with about 1.5M images of 13,650 subjects (each image is resized to 112 pixels height, 96 pixels width). However, center loss is increasing (softmax loss decreasing fine). It decreases in the beginning, but after iteration 800 it starts to increase and never decrease again. The values (without 0.008 multiplier) at the start and in every 100 iterations after then, seems somewhat like this: 20, 3, 11, 3, 4, 3, 4, 8, 10, 12, 24, 24, 23, 30, 27, 35, 40, 42, 46, 58, .... I just stopped training after then. What would be wrong with my settings or dataset. I already have other models (with softmax loss) converged with the same dataset? Is center loss very sensitive to an inbalance in the dataset?

@ZHAIXINGZHAIYUE
Copy link

ZHAIXINGZHAIYUE commented Oct 27, 2016

@chichan01

If I use other way to get 5pts landmarks and use his provide model, result is 98.75 +/-0.194754.

could you tell me that which way you used to get 5pts landmarks and how you use these 5 points ? just as the author's demo code?

@ydwen
Copy link
Owner

ydwen commented Oct 27, 2016

@kkirtac There is nothing wrong that center loss keeps increasing during the training. Just make sure the total loss (softmax loss + \lambda center loss) is decreasing.

@mfs6174
Copy link
Author

mfs6174 commented Oct 28, 2016

Hi, guys,

I have nearly reproduced the paper author's LFW result with both the author's code and my code. The result on LFW is nearly 98.9% with cleaned CAISA-webface dataset using author's network and solver prototxt. The only change is the step-size of the learning rate decay which improves the loss stability during training.

With my Everphoto 750k unique person dataset, I found that modifying the usage and the training scheme of the center loss could lead to possibly better result and much better convergence speed with more stable loss values when training with a dataset with very large number of classes. I will consider releasing the details after further experiments.

@kkirtac
Copy link

kkirtac commented Oct 28, 2016

@ydwen Due to the increasing behavior of my center loss, my total loss shows fluctuations. With a proper configuration, center loss should not be increasing I think. How about your trainings? Do you have similar increasing center loss values as mine?

@chichan01
Copy link

chichan01 commented Oct 28, 2016

@mfs6174,
Your performance of the trained model is very near to ydwen's feature. If you do the PCA, you may approach to 99.27% as I follow their paper to project ydwen's feature to pca space before computing the cosine angle.
@ydwen, @ZHAIXINGZHAIYUE and @twinsyssy1018,
I would like to thank for your response. Your comment basically covers all of my questions. Regarding on the landmark, I did refer the 5 pts landmark you provided in extractDeepFeature.m for Jennifer_Aniston_0016.jpg (provided template ) and found that MTCNN cannot provide the same 5 pt locations. Therefore, I use TCDCN (http://mmlab.ie.cuhk.edu.hk/projects/TCDCN.html) to extract.
@ydwen,
Regarding on dropout, it seems that your model A in the paper did not apply dropout and I found that applying dropout for softmax does improve the performance and the performance of model A can be near to model C in no PCA space. Also I found that the model A is suffered by overfitting if you increase the number of iteration while, model C did not. Another observation is that softmax with center loss is good for PCA space but softmax (with/without dropout) is not good.

Anyway, I will try to reproduce it on MegaFace in this coming week(s) as LFW is saturated and cannot see the real difference.

@yaxiongchi
Copy link

@mfs6174
hi, where can i get the cleaned CASIA-dataset? I can not found this cleaned version in the Internet.


Yaxiong Chi Email: [email protected]
Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education

Xidian University
No.2 Southern Tai Bai Rd
Xi An, Shaanxi

China

At 2016-10-28 12:40:18, "mfs6174" [email protected] wrote:

Hi, guys,

I have nearly reproduced the paper author's LFW result with both the author's code and my code. The result on LFW is nearly 98.9% with cleaned CAISA-webface dataset using author's network and solver prototxt. The only change is the step-size of the learning rate decay which improves the loss stability during training.

With my Everphoto 750k unique person dataset, I found that modifying the usage and the training scheme of the center loss could lead to possibly better result and much better convergence speed with more stable loss values when training with a dataset with very large number of classes. I will consider releasing the details after further experiments.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@yaxiongchi
Copy link

@twinsyssy1018, hi there, where can I get the clearn CASIA dataset?


Yaxiong Chi Email: [email protected]
Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education

Xidian University
No.2 Southern Tai Bai Rd
Xi An, Shaanxi

China

At 2016-10-28 18:45:52, "twinsyssy1018" [email protected] wrote:

@mfs6174 , I trained on clean CASIA dataset , and reach 98.4% at last . Waiting for the details


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@kkirtac
Copy link

kkirtac commented Oct 31, 2016

@chichan01 where do you put dropout actually?

@twinsyssy1018
Copy link

@chichan01, i also have a question ,which dataset used to train PCA? i used the lfw to train pca,and got 99.0% ,not as good as yours 99.27%.

@twinsyssy1018
Copy link

@ yaxiongchi try this http://pan.baidu.com/s/1kUdRRJT password:3zbb . i can not promise it is the right version ,but i use it.

@zjchuyp
Copy link

zjchuyp commented Nov 1, 2016

@ydwen

Hi, My dataset with about 4M images of 80000 subjects. When I set loss_weight: 0.008 or loss_weight: 0.0005, softmax_loss can not decrease. When I set loss_weight: 0.0001, I get below training log, the softmax_loss is decreasing, but center_loss is still very high. I wonder when I can finish the training?

I1101 08:16:01.237038 6772 sgd_solver.cpp:106] Iteration 322000, lr = 1e-005
I1101 08:17:48.097226 6772 solver.cpp:228] Iteration 322100, loss = 0.640129
I1101 08:17:48.097226 6772 solver.cpp:244] Train net output #0: center_loss
= 4113.34 (* 0.0001 = 0.411334 loss)
I1101 08:17:48.097226 6772 solver.cpp:244] Train net output #1: softmax_los
s = 0.228794 (* 1 = 0.228794 loss)
I1101 08:17:48.112826 6772 sgd_solver.cpp:106] Iteration 322100, lr = 1e-005
I1101 08:19:34.614213 6772 solver.cpp:228] Iteration 322200, loss = 0.639561
I1101 08:19:34.614213 6772 solver.cpp:244] Train net output #0: center_loss
= 4207.33 (* 0.0001 = 0.420733 loss)
I1101 08:19:34.614213 6772 solver.cpp:244] Train net output #1: softmax_los
s = 0.218827 (* 1 = 0.218827 loss)
I1101 08:19:34.614213 6772 sgd_solver.cpp:106] Iteration 322200, lr = 1e-005
I1101 08:21:21.521200 6772 solver.cpp:228] Iteration 322300, loss = 0.529362
I1101 08:21:21.521200 6772 solver.cpp:244] Train net output #0: center_loss
= 3991.51 (* 0.0001 = 0.399151 loss)
I1101 08:21:21.521200 6772 solver.cpp:244] Train net output #1: softmax_los
s = 0.130211 (* 1 = 0.130211 loss)
I1101 08:21:21.521200 6772 sgd_solver.cpp:106] Iteration 322300, lr = 1e-005
I1101 08:23:08.209789 6772 solver.cpp:228] Iteration 322400, loss = 0.548019
I1101 08:23:08.209789 6772 solver.cpp:244] Train net output #0: center_loss
= 3857.7 (* 0.0001 = 0.38577 loss)
I1101 08:23:08.209789 6772 solver.cpp:244] Train net output #1: softmax_los
s = 0.162249 (* 1 = 0.162249 loss)

@gzp001015
Copy link

@twinsyssy1018 你好 我用作者的网络训练 最高也只能得到98.45%的成绩,和你差不多,请问你现在找到问题的症结所在了吗? 对齐方法还是网络参数影响更大一些?

@gzp001015
Copy link

@mfs6174 请大神不吝赐教啊,究竟该怎么训练 ,我最高只能得到98.45%,我目前只是调整lossweight而已。

@duanLH
Copy link

duanLH commented Nov 3, 2016

@twinsyssy1018 would you mind to give me the LFW dataset link? I can't find this due to the link from Internet is error,thanks

@duanLH
Copy link

duanLH commented Nov 3, 2016

@ydwen ,I want to change the λ, where should I do?

@kli-casia
Copy link

kli-casia commented Dec 21, 2016

@luoyetx @happynear @ydwen
I think @luoyetx point out a very critical problem.
We can't simple using iter_size to simulate big batch size, since there are dependencies between batches in the calculations at the centerloss layer.

It is somewhat similar with Batch Normalization where we can't use iter_size to simulate big batch size.

@louielu1027
Copy link

@kli-nlpr So,what can we do to simulate big batch size? @luoyetx 's modification is right?

@kli-casia
Copy link

@louielu1027
I am not sure, I think the suspicious part is

/(count + (Dtype)1.)

The count is not same at different batches.

@luoyetx
Copy link

luoyetx commented Dec 21, 2016

@kli-nlpr @louielu1027
My modification seems not right. Since center_diff doesn't depend on batch size, it seems not possible to accumulate the diff across small batches. As Caffe itself will normal the diff by iter_size in this line, the diff will be smaller as usual.

@soldier828
Copy link

@mfs6174 I'm fresh to caffe source code. I thoroughly read src/caffe/proto/caffe.proto, include/caffe/layers/center_loss_layer.hpp and src/caffe/layers/center_loss_layer.cpp. My confusion is that where is the code of calculating each center vector? I didn't find the relevant code in the mentioned files. Could you give me some clues? Thank you.

@getengqing
Copy link

@LeonSCZ Can you share the code of testing on lfw.

@LeonSCZ
Copy link

LeonSCZ commented Dec 23, 2016

@getengqing I'm glad to share it with you then we can talk about if there's something wrong with it. Could you tell me your result and how I should send my code to you, by email or ~?

@AlexeyV11
Copy link

@luoyetx is it possible to compensate caffe normalisation by multiplication on iter_size?

center_diff[index * K + k] += iter_size * variation_sum[index * K + k] /(count + (Dtype)1.);

@happynear
Copy link

@luoyetx I can't find an elegant way to solve this problem. Nonetheless, before we find a solution, I think we should still replace = by += because small gradients are better than lossing most of them.

@happynear
Copy link

@kli-nlpr ,
I think we should not add weight decay on center_diff because most values in center_diff are 0. Weight decay will keep pushing the centers to 0 if decay_mult>0.

@varun-suresh
Copy link

I wanted to try the MNIST example and see the effect of center loss in the joint supervision. Unlike the paper, even at lambda = 1, my centers are not well separated like in the paper. @mfs6174 - Can you throw some light on this? http://imgur.com/a/PGk0L

@qiufan
Copy link

qiufan commented Jan 28, 2017

@mfs6174 hi~ I'm also try to reproduce the result. I have tried many times. however, the best result I got is arount 98.5%. Will you share the experience how you change the step-size of learning rate decay in detail? I will be grateful very much!

@rajiv235
Copy link

rajiv235 commented Feb 9, 2017

Does anyone have python implementation for extracting features using pycaffe?
If so can they share python code?

Also has anyone tried without aligning and only resizing faces?

Thanks in advance.!!

@stupiding
Copy link

@louielu1027 你弄明白为什么是flipud而不是fliplr了么?我用那张测试图像试,fliplr的两张图像之间欧氏距离有25,但是flipud的两张只有6.6的欧氏距离

@wangzzdeepir
Copy link

Hi, could someone explain that why the center loss punishes Euclidean distance, i. e. Euclidean distances between embeddings are directly optimized, but instead, cosine distance was used when testing? Thanks in advance!

@happynear
Copy link

@wangzzdeepir

If the feature is normalized, cosine distance is equivalent to Euclidean distance.

2 - 2 * cosine = Euclidean^2.

@wangzzdeepir
Copy link

@happynear

Thank you for your help. But I am wondering why cosine distance is a must (because when optimizing center loss, the embeddings are not normalized, IMO it would be more straightforward to compare Euc distance directly). I did a threshold search experiment that directly comparing Euclidean distance without PCA (to find accuracy upperbound), but found that comparing Euclidean distance on LFW is slightly worse (about 0.5 %) than comparing cosine distance. So I am wondering is there any interesting intuition behind. Thanks again!

@happynear
Copy link

@wangzzdeepir,
Definitely yes, I am writing a paper to describe how to optimize it.

@happynear
Copy link

Hi @wangzzdeepir , my paper is uploaded to arxiv now, https://arxiv.org/abs/1704.06369 . The codes and models are also released in https://github.com/happynear/NormFace . Hope it may help you.

@wangzzdeepir
Copy link

@happynear Thanks a lot, your paper gives very intuitive visualization and careful derivation, I am going to try it out!

@David-Jin
Copy link

net: "face_example/face_train_test.prototxt"
test_iter: 100
test_interval: 2000

base_lr: 0.05
lr_policy: "multistep"
gamma: 0.1

stepvalue: 16000
stepvalue: 24000
stepvalue: 28000
max_iter: 28000

display: 100
momentum: 0.9
weight_decay: 0.0005
snapshot: 1000
snapshot_prefix: "face_example/face_snapshot/face_train_test"

solver_mode: GPU

I train the net, when Iteration 3600, loss = nan. I don't known what's the problem.
following is train log.

.........
I0619 21:08:10.313042 3915 solver.cpp:228] Iteration 3300, loss = 7.92752
I0619 21:08:10.313130 3915 solver.cpp:244] Train net output #0: center_loss = 31.0071 (* 0.008 = 0.248056 loss)
I0619 21:08:10.313153 3915 solver.cpp:244] Train net output #1: softmax_loss = 7.67946 (* 1 = 7.67946 loss)
I0619 21:08:10.313158 3915 sgd_solver.cpp:106] Iteration 3300, lr = 0.1
I0619 21:08:36.171747 3915 solver.cpp:228] Iteration 3400, loss = 7.52846
I0619 21:08:36.171774 3915 solver.cpp:244] Train net output #0: center_loss = 19.4122 (* 0.008 = 0.155298 loss)
I0619 21:08:36.171780 3915 solver.cpp:244] Train net output #1: softmax_loss = 7.37317 (* 1 = 7.37317 loss)
I0619 21:08:36.171799 3915 sgd_solver.cpp:106] Iteration 3400, lr = 0.1
I0619 21:09:02.032228 3915 solver.cpp:228] Iteration 3500, loss = 7.57454
I0619 21:09:02.032327 3915 solver.cpp:244] Train net output #0: center_loss = 26.9467 (* 0.008 = 0.215574 loss)
I0619 21:09:02.032349 3915 solver.cpp:244] Train net output #1: softmax_loss = 7.35897 (* 1 = 7.35897 loss)
I0619 21:09:02.032354 3915 sgd_solver.cpp:106] Iteration 3500, lr = 0.1
I0619 21:09:27.619887 3915 solver.cpp:228] Iteration 3600, loss = nan
I0619 21:09:27.619915 3915 solver.cpp:244] Train net output #0: center_loss = nan (* 0.008 = nan loss)
I0619 21:09:27.619925 3915 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I0619 21:09:27.619946 3915 sgd_solver.cpp:106] Iteration 3600, lr = 0.1
I0619 21:09:52.784669 3915 solver.cpp:228] Iteration 3700, loss = nan
I0619 21:09:52.784795 3915 solver.cpp:244] Train net output #0: center_loss = nan (* 0.008 = nan loss)
I0619 21:09:52.784801 3915 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I0619 21:09:52.784806 3915 sgd_solver.cpp:106] Iteration 3700, lr = 0.1
I0619 21:10:17.931406 3915 solver.cpp:228] Iteration 3800, loss = nan
I0619 21:10:17.931434 3915 solver.cpp:244] Train net output #0: center_loss = nan (* 0.008 = nan loss)
I0619 21:10:17.931443 3915 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I0619 21:10:17.931463 3915 sgd_solver.cpp:106] Iteration 3800, lr = 0.1
I0619 21:10:43.066721 3915 solver.cpp:228] Iteration 3900, loss = nan
I0619 21:10:43.066864 3915 solver.cpp:244] Train net output #0: center_loss = nan (* 0.008 = nan loss)
I0619 21:10:43.066872 3915 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I0619 21:10:43.066891 3915 sgd_solver.cpp:106] Iteration 3900, lr = 0.1
I0619 21:11:07.956351 3915 solver.cpp:454] Snapshotting to binary proto file face_example/face_snapshot/face_train_test_iter_4000.caffemodel
I0619 21:11:23.210160 3915 sgd_solver.cpp:273] Snapshotting solver state to binary proto file
.......................
I0619 22:57:06.232100 3915 solver.cpp:454] Snapshotting to binary proto file face_example/face_snapshot/face_train_test_iter_28000.caffemodel
I0619 22:57:19.429186 3915 sgd_solver.cpp:273] Snapshotting solver state to binary proto file face_example/face_snapshot/face_train_test_iter_28000.solverstate
I0619 22:57:19.670956 3915 solver.cpp:317] Iteration 28000, loss = nan
I0619 22:57:19.670977 3915 solver.cpp:337] Iteration 28000, Testing net (#0)
I0619 22:57:23.466572 3915 solver.cpp:404] Test net output #0: center_loss = nan (* 0.008 = nan loss)
I0619 22:57:23.466588 3915 solver.cpp:404] Test net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I0619 22:57:23.466593 3915 solver.cpp:322] Optimization Done.
I0619 22:57:23.466938 3915 caffe.cpp:254] Optimization Done.

@tianyu06030020
Copy link

@David-Jin
HI, did you solve the center_loss nan problem? I meet the same question as follow:
I0710 16:00:59.146741 958 solver.cpp:337] Iteration 0, Testing net (#0)
I0710 16:01:36.799926 958 solver.cpp:404] Test net output #0: center_loss = 11.7722 (* 0.008 = 0.0941774 loss)
I0710 16:01:36.800058 958 solver.cpp:404] Test net output #1: softmax_loss = 8.70807 (* 1 = 8.70807 loss)
I0710 16:01:37.358547 958 solver.cpp:228] Iteration 0, loss = 8.75433
I0710 16:01:37.358587 958 solver.cpp:244] Train net output #0: center_loss = 11.1917 (* 0.008 = 0.0895339 loss)
I0710 16:01:37.358599 958 solver.cpp:244] Train net output #1: softmax_loss = 8.66479 (* 1 = 8.66479 loss)
I0710 16:01:37.358620 958 sgd_solver.cpp:106] Iteration 0, lr = 0.1
I0710 16:02:30.796794 958 solver.cpp:228] Iteration 100, loss = nan
I0710 16:02:30.797089 958 solver.cpp:244] Train net output #0: center_loss = nan (* 0.008 = nan loss)
I0710 16:02:30.797147 958 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I0710 16:02:30.797180 958 sgd_solver.cpp:106] Iteration 100, lr = 0.1
I0710 16:03:24.598029 958 solver.cpp:228] Iteration 200, loss = nan
I0710 16:03:24.598278 958 solver.cpp:244] Train net output #0: center_loss = nan (* 0.008 = nan loss)
I0710 16:03:24.598330 958 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I0710 16:03:24.598361 958 sgd_solver.cpp:106] Iteration 200, lr = 0.1
I0710 16:04:18.405632 958 solver.cpp:228] Iteration 300, loss = nan
I0710 16:04:18.405877 958 solver.cpp:244] Train net output #0: center_loss = nan (* 0.008 = nan loss)
I0710 16:04:18.405936 958 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I0710 16:04:18.405995 958 sgd_solver.cpp:106] Iteration 300, lr = 0.1
I0710 16:05:12.201696 958 solver.cpp:228] Iteration 400, loss = nan
I0710 16:05:12.201946 958 solver.cpp:244] Train net output #0: center_loss = nan (* 0.008 = nan loss)
I0710 16:05:12.201987 958 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I0710 16:05:12.202005 958 sgd_solver.cpp:106] Iteration 400, lr = 0.1
I0710 16:06:06.010473 958 solver.cpp:228] Iteration 500, loss = nan
I0710 16:06:06.010663 958 solver.cpp:244] Train net output #0: center_loss = nan (* 0.008 = nan loss)
I0710 16:06:06.010702 958 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I0710 16:06:06.010720 958 sgd_solver.cpp:106] Iteration 500, lr = 0.1
I0710 16:06:59.846146 958 solver.cpp:228] Iteration 600, loss = nan
I0710 16:06:59.846403 958 solver.cpp:244] Train net output #0: center_loss = nan (* 0.008 = nan loss)
I0710 16:06:59.846501 958 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)

I use lfw dataset , and mtcnn to detect and align face, and final crop face size is 182*182. Can you help me?Thx.

@tianyu06030020
Copy link

@mfs6174 你好 我用原作者的代码及网络模型训练的时候center_loss总是为nan,请问怎么解决啊?
I0710 16:00:59.146741 958 solver.cpp:337] Iteration 0, Testing net (#0)
I0710 16:01:36.799926 958 solver.cpp:404] Test net output #0: center_loss = 11.7722 (* 0.008 = 0.0941774 loss)
I0710 16:01:36.800058 958 solver.cpp:404] Test net output #1: softmax_loss = 8.70807 (* 1 = 8.70807 loss)
I0710 16:01:37.358547 958 solver.cpp:228] Iteration 0, loss = 8.75433
I0710 16:01:37.358587 958 solver.cpp:244] Train net output #0: center_loss = 11.1917 (* 0.008 = 0.0895339 loss)
I0710 16:01:37.358599 958 solver.cpp:244] Train net output #1: softmax_loss = 8.66479 (* 1 = 8.66479 loss)
I0710 16:01:37.358620 958 sgd_solver.cpp:106] Iteration 0, lr = 0.1
I0710 16:02:30.796794 958 solver.cpp:228] Iteration 100, loss = nan
I0710 16:02:30.797089 958 solver.cpp:244] Train net output #0: center_loss = nan (* 0.008 = nan loss)
I0710 16:02:30.797147 958 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I0710 16:02:30.797180 958 sgd_solver.cpp:106] Iteration 100, lr = 0.1
I0710 16:03:24.598029 958 solver.cpp:228] Iteration 200, loss = nan
I0710 16:03:24.598278 958 solver.cpp:244] Train net output #0: center_loss = nan (* 0.008 = nan loss)
I0710 16:03:24.598330 958 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I0710 16:03:24.598361 958 sgd_solver.cpp:106] Iteration 200, lr = 0.1
I0710 16:04:18.405632 958 solver.cpp:228] Iteration 300, loss = nan
I0710 16:04:18.405877 958 solver.cpp:244] Train net output #0: center_loss = nan (* 0.008 = nan loss)
I0710 16:04:18.405936 958 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I0710 16:04:18.405995 958 sgd_solver.cpp:106] Iteration 300, lr = 0.1
I0710 16:05:12.201696 958 solver.cpp:228] Iteration 400, loss = nan
I0710 16:05:12.201946 958 solver.cpp:244] Train net output #0: center_loss = nan (* 0.008 = nan loss)
I0710 16:05:12.201987 958 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I0710 16:05:12.202005 958 sgd_solver.cpp:106] Iteration 400, lr = 0.1
I0710 16:06:06.010473 958 solver.cpp:228] Iteration 500, loss = nan
I0710 16:06:06.010663 958 solver.cpp:244] Train net output #0: center_loss = nan (* 0.008 = nan loss)
I0710 16:06:06.010702 958 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)
I0710 16:06:06.010720 958 sgd_solver.cpp:106] Iteration 500, lr = 0.1
I0710 16:06:59.846146 958 solver.cpp:228] Iteration 600, loss = nan
I0710 16:06:59.846403 958 solver.cpp:244] Train net output #0: center_loss = nan (* 0.008 = nan loss)
I0710 16:06:59.846501 958 solver.cpp:244] Train net output #1: softmax_loss = 87.3365 (* 1 = 87.3365 loss)

train_test.protxt:
layer {
name: "data"
type: "ImageData"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mean_value: 127.5
mean_value: 127.5
mean_value: 127.5
scale: 0.0078125
mirror: true
}
image_data_param {
source: "data/face/deepfunneled_lfw_alligned_by_mtcnn/train.txt"
batch_size: 32
shuffle: true
#backend: LMDB
}
}
layer {
name: "data"
type: "ImageData"
top: "data"
top: "label"
include {
phase: TEST
}
transform_param {
mean_value: 127.5
mean_value: 127.5
mean_value: 127.5
scale: 0.0078125
mirror: true
}
image_data_param {
source: "data/face/deepfunneled_lfw_alligned_by_mtcnn/test.txt"
batch_size: 32
#backend: LMDB
shuffle: true
}
}

我用的是lfw人脸数据集,我看上面评论batch_size设置的时候要保证每一类都要多于一张图片吗, 我的train.txt/test.txt里边有很多类只有一张图片 , 能解释下吗, 谢谢!

@CFAndy
Copy link

CFAndy commented Jul 28, 2017

There's a potential bug in the code of @mfs6174
should be replaced by atomic instructions. atomicAdd(xxx,-xxx).

@goodluckcwl
Copy link

@mfs6174 If we train a CNN model with softmax loss on a dataset where the number of classes is large, the number of parameters in the final fc layer will be too large. How to deal with this case?

@bingjielulu
Copy link

softmax_loss is always keeping at 9.,i don't know why, i use mtcnn to align the dataset, how to solve this problem?

@zys1994
Copy link

zys1994 commented Dec 5, 2018

In author's experimenta he set a= 0.5, where to change the hyper parm?

@rmxhhh
Copy link

rmxhhh commented Mar 27, 2019

@ydwen Hi, if i donot download this repo, can i use centor_loss by adding centor_loss layer to BVLC/caffe and compiling. Do i need to do anything else besides that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.