-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Backward Time Complexity to O(MK) #1
base: caffe-face
Are you sure you want to change the base?
Conversation
…is now O(MK) instead of O(NMK); fix the code style to pass the lint test.
@mfs6174 , have you reproduced the result on LFW or Megaface? I have trained the model on CASIA and tested on LFW ,but it didn't work well (EER~96.5%). |
Hi, @jiangxuehan I am still working on that with both paper author's code and my code. I have only tested my PR's code with the MNIST toy example. Starting from the same snapshot and training data (no shuffle during training), my code produced exactly the same center_diff values and nearly the same test result with the author's code. Which code did you trained the model with? Was it the paper author's code or my PR's code? If you can reproduce the result with the paper author's code but not my code, I will check my code again. If you can neither reproduce the result with the paper author's code, I will discuss with you further when I finish my experiments for reproducing the result on LFW. |
Hi @jiangxuehan and @mfs6174, Would you like to verify two points? 1, Does lamda in paper be "loss_weight"? if some, where can I set alpha? |
Hi, @mfs6174 @chichan01 , |
Hi @jiangxuehan, |
@chichan01 : |
Regarding reproducing the face result, I also have some questions. When training the network, did you use the author's prototxt directly (center loss parameters: lr_multi = 1, weight_decay_multi = 2 and loss_weight = 0.008 ) or change it following the description in the paper ( the parameters should be lr_multi = 5 (so the alpha is 0.1*5 = 0.5) and loss_weight = 0.003 ) ? |
Hi @jiangxuehan and @mfs6174, By the way, did you guys able to reproduce their megaface result? |
Hi @mfs6174 @chichan01 To chichan01: I am still working on LFW, and meeting some obsessions to get ~99% accuracy. How about your experiment on Megaface? |
Hi, I am still working on that with both paper author's code and my code. I have only tested my PR's code with the MNIST toy example. Starting from Which code did you trained the model with? Was it the paper author's code 2016-10-24 8:59 GMT+08:00 jiangxuehan [email protected]:
mfs6174 |
@jiangxuehan @chichan01 @ @mfs6174 Here I try to answer some of your questions. If anything is not clear, please feel free to let me know. Common issues Alpha: At the beginning, we use the update strategy as in paper. It performs well in our experiments (deepidnet & mixed dataset). When refactoring the code, we find that the implementation of updating alpha in caffe is not elegant and becomes complicated with multi-gpu. Finally we try lr and weight_decay in caffe and find it is very convenient and works pretty well (achieving near the same performance as before). So you guys can try modifying the lr and weight_decay for different alphas. Network Architecture: At the beginning we use the deepidnet so we have to implement the local convolution layer. We are not going to released our implementation since it is quite inelegant (^_^). The most important is, local convolution layer is complicated and not efficient in time (~8h for resnet with 28 layers v.s. ~14h for deepidnet with 6 layers) and space (~100M for resnet with 28 layers v.s. ~200M for deepidnet with 6 layers). Therefore, we use a modified resnet as our network. It performs better with less parameters and training time. To @chichan01 , I guess the patch you used is not the same as in the demo or the given model. Please double check the positions of eyes, nose, mouth corner in the cropped face and use our provided template (5 points landmarks) given in demo. To @chichan01 , loss weight is related to the total number of classes, i.e. num_output in fc6. Generally speaking, the more classes, the smaller loss weight is. To @jiangxuehan Please provide more details of your experiment (96.5% EER), or I can't give you some hints. |
Hi people, |
could you tell me that which way you used to get 5pts landmarks and how you use these 5 points ? just as the author's demo code? |
@kkirtac There is nothing wrong that center loss keeps increasing during the training. Just make sure the total loss (softmax loss + \lambda center loss) is decreasing. |
Hi, guys, I have nearly reproduced the paper author's LFW result with both the author's code and my code. The result on LFW is nearly 98.9% with cleaned CAISA-webface dataset using author's network and solver prototxt. The only change is the step-size of the learning rate decay which improves the loss stability during training. With my Everphoto 750k unique person dataset, I found that modifying the usage and the training scheme of the center loss could lead to possibly better result and much better convergence speed with more stable loss values when training with a dataset with very large number of classes. I will consider releasing the details after further experiments. |
@ydwen Due to the increasing behavior of my center loss, my total loss shows fluctuations. With a proper configuration, center loss should not be increasing I think. How about your trainings? Do you have similar increasing center loss values as mine? |
@mfs6174, Anyway, I will try to reproduce it on MegaFace in this coming week(s) as LFW is saturated and cannot see the real difference. |
@mfs6174 Yaxiong Chi Email: [email protected] Xidian University ChinaAt 2016-10-28 12:40:18, "mfs6174" [email protected] wrote: Hi, guys, I have nearly reproduced the paper author's LFW result with both the author's code and my code. The result on LFW is nearly 98.9% with cleaned CAISA-webface dataset using author's network and solver prototxt. The only change is the step-size of the learning rate decay which improves the loss stability during training. With my Everphoto 750k unique person dataset, I found that modifying the usage and the training scheme of the center loss could lead to possibly better result and much better convergence speed with more stable loss values when training with a dataset with very large number of classes. I will consider releasing the details after further experiments. — |
@twinsyssy1018, hi there, where can I get the clearn CASIA dataset? Yaxiong Chi Email: [email protected] Xidian University ChinaAt 2016-10-28 18:45:52, "twinsyssy1018" [email protected] wrote: @mfs6174 , I trained on clean CASIA dataset , and reach 98.4% at last . Waiting for the details — |
@chichan01 where do you put dropout actually? |
@chichan01, i also have a question ,which dataset used to train PCA? i used the lfw to train pca,and got 99.0% ,not as good as yours 99.27%. |
@ yaxiongchi try this http://pan.baidu.com/s/1kUdRRJT password:3zbb . i can not promise it is the right version ,but i use it. |
Hi, My dataset with about 4M images of 80000 subjects. When I set loss_weight: 0.008 or loss_weight: 0.0005, softmax_loss can not decrease. When I set loss_weight: 0.0001, I get below training log, the softmax_loss is decreasing, but center_loss is still very high. I wonder when I can finish the training?I1101 08:16:01.237038 6772 sgd_solver.cpp:106] Iteration 322000, lr = 1e-005 |
@twinsyssy1018 你好 我用作者的网络训练 最高也只能得到98.45%的成绩,和你差不多,请问你现在找到问题的症结所在了吗? 对齐方法还是网络参数影响更大一些? |
@mfs6174 请大神不吝赐教啊,究竟该怎么训练 ,我最高只能得到98.45%,我目前只是调整lossweight而已。 |
@twinsyssy1018 would you mind to give me the LFW dataset link? I can't find this due to the link from Internet is error,thanks |
@ydwen ,I want to change the λ, where should I do? |
@luoyetx @happynear @ydwen It is somewhat similar with Batch Normalization where we can't use iter_size to simulate big batch size. |
@kli-nlpr So,what can we do to simulate big batch size? @luoyetx 's modification is right? |
@louielu1027
The |
@kli-nlpr @louielu1027 |
@mfs6174 I'm fresh to caffe source code. I thoroughly read src/caffe/proto/caffe.proto, include/caffe/layers/center_loss_layer.hpp and src/caffe/layers/center_loss_layer.cpp. My confusion is that where is the code of calculating each center vector? I didn't find the relevant code in the mentioned files. Could you give me some clues? Thank you. |
@LeonSCZ Can you share the code of testing on lfw. |
@getengqing I'm glad to share it with you then we can talk about if there's something wrong with it. Could you tell me your result and how I should send my code to you, by email or ~? |
@luoyetx is it possible to compensate caffe normalisation by multiplication on iter_size? center_diff[index * K + k] += iter_size * variation_sum[index * K + k] /(count + (Dtype)1.); |
@luoyetx I can't find an elegant way to solve this problem. Nonetheless, before we find a solution, I think we should still replace |
@kli-nlpr , |
I wanted to try the MNIST example and see the effect of center loss in the joint supervision. Unlike the paper, even at lambda = 1, my centers are not well separated like in the paper. @mfs6174 - Can you throw some light on this? http://imgur.com/a/PGk0L |
@mfs6174 hi~ I'm also try to reproduce the result. I have tried many times. however, the best result I got is arount 98.5%. Will you share the experience how you change the step-size of learning rate decay in detail? I will be grateful very much! |
Does anyone have python implementation for extracting features using pycaffe? Also has anyone tried without aligning and only resizing faces? Thanks in advance.!! |
@louielu1027 你弄明白为什么是flipud而不是fliplr了么?我用那张测试图像试,fliplr的两张图像之间欧氏距离有25,但是flipud的两张只有6.6的欧氏距离 |
Hi, could someone explain that why the center loss punishes Euclidean distance, i. e. Euclidean distances between embeddings are directly optimized, but instead, cosine distance was used when testing? Thanks in advance! |
If the feature is normalized, cosine distance is equivalent to Euclidean distance. 2 - 2 * cosine = Euclidean^2. |
Thank you for your help. But I am wondering why cosine distance is a must (because when optimizing center loss, the embeddings are not normalized, IMO it would be more straightforward to compare Euc distance directly). I did a threshold search experiment that directly comparing Euclidean distance without PCA (to find accuracy upperbound), but found that comparing Euclidean distance on LFW is slightly worse (about 0.5 %) than comparing cosine distance. So I am wondering is there any interesting intuition behind. Thanks again! |
@wangzzdeepir, |
Hi @wangzzdeepir , my paper is uploaded to arxiv now, https://arxiv.org/abs/1704.06369 . The codes and models are also released in https://github.com/happynear/NormFace . Hope it may help you. |
@happynear Thanks a lot, your paper gives very intuitive visualization and careful derivation, I am going to try it out! |
net: "face_example/face_train_test.prototxt" base_lr: 0.05 stepvalue: 16000 display: 100 solver_mode: GPU I train the net, when Iteration 3600, loss = nan. I don't known what's the problem. ......... |
@David-Jin I use lfw dataset , and mtcnn to detect and align face, and final crop face size is 182*182. Can you help me?Thx. |
@mfs6174 你好 我用原作者的代码及网络模型训练的时候center_loss总是为nan,请问怎么解决啊? train_test.protxt: 我用的是lfw人脸数据集,我看上面评论batch_size设置的时候要保证每一类都要多于一张图片吗, 我的train.txt/test.txt里边有很多类只有一张图片 , 能解释下吗, 谢谢! |
There's a potential bug in the code of @mfs6174 |
@mfs6174 If we train a CNN model with softmax loss on a dataset where the number of classes is large, the number of parameters in the final fc layer will be too large. How to deal with this case? |
softmax_loss is always keeping at 9.,i don't know why, i use mtcnn to align the dataset, how to solve this problem? |
In author's experimenta he set a= 0.5, where to change the hyper parm? |
@ydwen Hi, if i donot download this repo, can i use centor_loss by adding centor_loss layer to BVLC/caffe and compiling. Do i need to do anything else besides that? |
In the original implementation, the time complexity of the backward process of the center loss layer is O(MK+NM). It will be very slow when training with a large number of classes since the running time of the backward pass is related to the class number (N). Unfortunately, it is a common case when training face recognition model (e.g. 750k unique persons).
This pull request rewrites the backward code. The time complexity is optimized to O(MK) with additional O(N) space. Because M (batch size) << N and K (feature length) << N usually hold for face recognition problem, this modification will improve the training speed significantly.
For a Googlenet v2 model trained with Everphoto's 750k unique person dataset, on a single Nvidia GTX Titan X, with 24 batch size and iter_size = 5, the average backward iteration time for different cases is:
There is more than 600x improvement.
For the paper author's "minit_example", running on a single GTX Titan X, training time of the original implementation and the PR is 4min20s V.S. 3min50s. It is shown that even when training with small dataset with only 10 classes, there still is some improvement.
The PR also fix the code style to pass the Caffe's lint test (make lint).