criterions of training and test are mixed up #3

DingQiang2018 · 2021-12-28T03:41:04Z

Lines 200 to 201 in dc55593

    
           train_loss, train_acc = train(trainloader, model, criterion, optimizer, epoch, use_cuda) 
        
           test_loss, test_acc = test(testloader, model, criterion, epoch, use_cuda)

It might be a mistake to use the same criterion in function train and function test, which mixes up histroy of predictions of the model on training set and that on test set .

The text was updated successfully, but these errors were encountered:

LayneH · 2021-12-31T07:55:09Z

Thank you for pointing it out.
This is indeed a bug in our code that we should not pass the SAT criterion to the test() function.

I have rerun the experiments after fixing this bug and found that the performance is slightly improved.

DingQiang2018 · 2022-01-03T09:51:29Z

Could you push your updated code to this repository? I did not get better performances after I fixed the bug and reran the experiments.

LayneH · 2022-01-05T06:35:38Z

Hi,

Please refer to the latest commit. The scripts should produce slightly (if noticeable) better results than the reported ones.

DingQiang2018 · 2022-01-07T11:00:54Z

Hi,
I find that even though I use updated code, I can not reproduce the results on CIFAR10 as reported in your paper. My results are following:

coverage	mean	stantard devariance
100	6.008	0.138
95	3.724	0.028
90	2.064	0.045
85	1.187	0.031
80	0.656	0.002
75	0.406	0.051
70	0.298	0.055

As the table shows, the selective error rate for 95% coverage is 3.72%, which is far away from (3.37±0.05)%. Could you help me solve this problem?

DingQiang2018 · 2022-01-08T02:20:51Z

I am sorry for not explaining mean and standard deviation in the last comment. In the table of the last comment, mean and standard deviation refer to mean of selective error rate and standard deviation of selective error rate respectively, which are calculated over 3 trials.

LayneH · 2022-01-10T10:22:47Z

Hi,

It seems that most entries are pretty close to or better than the reported ones in the paper except the case of 95% coverage.

I have checked the experiment logs and found that some of the CIFAR10 experiments (but none of the experiments on other datasets) are based on an earlier implementation of SAT, which slightly differs from the current implementation in this line

# current implementation
soft_label[torch.arange(y.shape[0]), y] = prob[torch.arange(y.shape[0]), y]
# earlier implementation
soft_label[torch.arange(y.shape[0]), y] = 1

You can try this to see the performance.

DingQiang2018 · 2022-01-11T08:25:34Z

Hi,
I reran the experiments and got results as following (with the earlier implementation of SAT). mean and std dev refer to mean of selective error rate and standard deviation of selective error rate respectively in this table.

Test	mean	std dev
100	5.854	0.216
95	3.603	0.133
90	1.978	0.117
85	1.109	0.046
80	0.683	0.070
75	0.433	0.044
70	0.303	0.031

The performance is better than that of the current implementaton of SAT. But the selective error rate of coverage 95%, 3.603%, is still not so good as the reported one, (3.37±0.05)%, in your paper. Perhaps you had made a clerical mistake in your paper?

Jordy-VL · 2023-07-17T11:45:37Z

Interesting reproduction analysis, did this eventually get resolved?
Should one use the main branch for reproductions?

DingQiang2018 · 2023-07-18T03:17:08Z

Interesting reproduction analysis, did this eventually get resolved? Should one use the main branch for reproductions?

No, I gave up. This repository does not provide the random seed manualSeed, making it challenging to reproduce the results.

Jordy-VL · 2023-07-18T08:40:28Z

Might I ask you if you know of any other selective classification methods that 'actually work'?
I was looking into self-adaptive training as well, which seems related.

DingQiang2018 · 2023-07-18T12:48:37Z

As far as I know, Deep Ensemble [1] really works and might be the most powerful method. However, considering the heavy computational overhead of ensemble models, recent work in selective classification focuses on individual models. These models (e.g., [2][3]) exhibit marginal improvement from Softmax Response [4]. The advance in this line of work seems neither significant nor exciting. Nevertheless, my survey might be not comprehensive. A more comprehensive survey might be found in [5][6].

[1] Lakshminarayanan et al. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In NIPS, 2017.
[2] Liu et al. Deep Gamblers: Learning to Abstain with Portfolio Theory. In NeurIPS, 2019.
[3] Feng et al. Towards Better Selective Classification. In ICLR, 2023.
[4] Geifman and El-Yaniv. Selective Classification for Deep Neural Networks. In NIPS, 2017.
[5] Gawlikowski et al. A Survey of Uncertainty in Deep Neural Networks. arXiv:2107.03342.
[6] Galil et al. What Can we Learn from The Selective Prediction and Uncertainty Estimation Performance Of 523 ImageNet Classifiers? In ICLR, 2023.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

criterions of training and test are mixed up #3

criterions of training and test are mixed up #3

DingQiang2018 commented Dec 28, 2021

LayneH commented Dec 31, 2021

DingQiang2018 commented Jan 3, 2022

LayneH commented Jan 5, 2022

DingQiang2018 commented Jan 7, 2022

DingQiang2018 commented Jan 8, 2022

LayneH commented Jan 10, 2022

DingQiang2018 commented Jan 11, 2022

Jordy-VL commented Jul 17, 2023

DingQiang2018 commented Jul 18, 2023

Jordy-VL commented Jul 18, 2023

DingQiang2018 commented Jul 18, 2023

criterions of training and test are mixed up #3

criterions of training and test are mixed up #3

Comments

DingQiang2018 commented Dec 28, 2021

LayneH commented Dec 31, 2021

DingQiang2018 commented Jan 3, 2022

LayneH commented Jan 5, 2022

DingQiang2018 commented Jan 7, 2022

DingQiang2018 commented Jan 8, 2022

LayneH commented Jan 10, 2022

DingQiang2018 commented Jan 11, 2022

Jordy-VL commented Jul 17, 2023

DingQiang2018 commented Jul 18, 2023

Jordy-VL commented Jul 18, 2023

DingQiang2018 commented Jul 18, 2023