Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

criterions of training and test are mixed up #3

Open
DingQiang2018 opened this issue Dec 28, 2021 · 11 comments
Open

criterions of training and test are mixed up #3

DingQiang2018 opened this issue Dec 28, 2021 · 11 comments

Comments

@DingQiang2018
Copy link

SAT-selective-cls/train.py

Lines 200 to 201 in dc55593

train_loss, train_acc = train(trainloader, model, criterion, optimizer, epoch, use_cuda)
test_loss, test_acc = test(testloader, model, criterion, epoch, use_cuda)

It might be a mistake to use the same criterion in function train and function test, which mixes up histroy of predictions of the model on training set and that on test set .

@LayneH
Copy link
Owner

LayneH commented Dec 31, 2021

Thank you for pointing it out.
This is indeed a bug in our code that we should not pass the SAT criterion to the test() function.

I have rerun the experiments after fixing this bug and found that the performance is slightly improved.

@DingQiang2018
Copy link
Author

Could you push your updated code to this repository? I did not get better performances after I fixed the bug and reran the experiments.

@LayneH
Copy link
Owner

LayneH commented Jan 5, 2022

Hi,

Please refer to the latest commit. The scripts should produce slightly (if noticeable) better results than the reported ones.

@DingQiang2018
Copy link
Author

Hi,
I find that even though I use updated code, I can not reproduce the results on CIFAR10 as reported in your paper. My results are following:

coverage mean stantard devariance
100 6.008 0.138
95 3.724 0.028
90 2.064 0.045
85 1.187 0.031
80 0.656 0.002
75 0.406 0.051
70 0.298 0.055

As the table shows, the selective error rate for 95% coverage is 3.72%, which is far away from (3.37±0.05)%. Could you help me solve this problem?

@DingQiang2018
Copy link
Author

I am sorry for not explaining mean and standard deviation in the last comment. In the table of the last comment, mean and standard deviation refer to mean of selective error rate and standard deviation of selective error rate respectively, which are calculated over 3 trials.

@LayneH
Copy link
Owner

LayneH commented Jan 10, 2022

Hi,

It seems that most entries are pretty close to or better than the reported ones in the paper except the case of 95% coverage.

I have checked the experiment logs and found that some of the CIFAR10 experiments (but none of the experiments on other datasets) are based on an earlier implementation of SAT, which slightly differs from the current implementation in this line

# current implementation
soft_label[torch.arange(y.shape[0]), y] = prob[torch.arange(y.shape[0]), y]
# earlier implementation
soft_label[torch.arange(y.shape[0]), y] = 1

You can try this to see the performance.

@DingQiang2018
Copy link
Author

Hi,
I reran the experiments and got results as following (with the earlier implementation of SAT). mean and std dev refer to mean of selective error rate and standard deviation of selective error rate respectively in this table.

Test	mean	std dev
100	5.854	0.216
95	3.603	0.133
90	1.978	0.117
85	1.109	0.046
80	0.683	0.070
75	0.433	0.044
70	0.303	0.031

The performance is better than that of the current implementaton of SAT. But the selective error rate of coverage 95%, 3.603%, is still not so good as the reported one, (3.37±0.05)%, in your paper. Perhaps you had made a clerical mistake in your paper?

@Jordy-VL
Copy link

Interesting reproduction analysis, did this eventually get resolved?
Should one use the main branch for reproductions?

@DingQiang2018
Copy link
Author

Interesting reproduction analysis, did this eventually get resolved? Should one use the main branch for reproductions?

No, I gave up. This repository does not provide the random seed manualSeed, making it challenging to reproduce the results.

@Jordy-VL
Copy link

Might I ask you if you know of any other selective classification methods that 'actually work'?
I was looking into self-adaptive training as well, which seems related.

@DingQiang2018
Copy link
Author

As far as I know, Deep Ensemble [1] really works and might be the most powerful method. However, considering the heavy computational overhead of ensemble models, recent work in selective classification focuses on individual models. These models (e.g., [2][3]) exhibit marginal improvement from Softmax Response [4]. The advance in this line of work seems neither significant nor exciting. Nevertheless, my survey might be not comprehensive. A more comprehensive survey might be found in [5][6].

[1] Lakshminarayanan et al. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In NIPS, 2017.
[2] Liu et al. Deep Gamblers: Learning to Abstain with Portfolio Theory. In NeurIPS, 2019.
[3] Feng et al. Towards Better Selective Classification. In ICLR, 2023.
[4] Geifman and El-Yaniv. Selective Classification for Deep Neural Networks. In NIPS, 2017.
[5] Gawlikowski et al. A Survey of Uncertainty in Deep Neural Networks. arXiv:2107.03342.
[6] Galil et al. What Can we Learn from The Selective Prediction and Uncertainty Estimation Performance Of 523 ImageNet Classifiers? In ICLR, 2023.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants