Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the final test_acc in cifar10 experiment. #16

Open
somethingsong opened this issue Nov 16, 2023 · 13 comments
Open

Question about the final test_acc in cifar10 experiment. #16

somethingsong opened this issue Nov 16, 2023 · 13 comments

Comments

@somethingsong
Copy link

Hi, I tried to reproduce the experiment results in the paper. I am using the following commands:

python cv_train.py --dataset_name CIFAR10 --model ResNet9 --mode fedavg --num_clients 200 --num_workers 10 --num_rows 1 --num_cols 50000 --error_type none --local_momentum 0.0 --virtual_momentum 0.9 --max_grad_norm 1.0 --num_devices=1 --lr_scale 0 --local_batch_size -1 --share_ps_gpu

the accuracy seems not correct, could you help me solve the problem? the logs are:

MY PID: 3424
Namespace(do_test=False, mode='fedavg', robustagg='none', use_tensorboard=False, seed=21, model='ResNet9', do_finetune=False, do_dp_finetune=False, do_checkpoint=False, checkpoint_path='/data/nvme/ashwinee/CommEfficient/CommEfficient/checkpoints/', finetune_path='./finetune', finetuned_from=None, num_results_train=2, num_results_val=2, dataset_name='CIFAR10', dataset_dir='./dataset', do_batchnorm=False, nan_threshold=999, k=50000, num_cols=50000, num_rows=1, num_blocks=20, do_topk_down=False, local_momentum=0.0, virtual_momentum=0.9, weight_decay=0.0005, num_epochs=24, num_fedavg_epochs=1, fedavg_batch_size=-1, fedavg_lr_decay=1.0, error_type='none', lr_scale=0.4, pivot_epoch=5, port=5315, num_clients=200, num_workers=10, device='cuda', num_devices=1, share_ps_gpu=True, do_iid=False, train_dataloader_workers=0, val_dataloader_workers=0, model_checkpoint='gpt2', num_candidates=2, max_history=2, local_batch_size=-1, valid_batch_size=8, microbatch_size=-1, lm_coef=1.0, mc_coef=1.0, max_grad_norm=1.0, personality_permutations=1, eval_before_start=False, checkpoint_epoch=-1, finetune_epoch=12, do_malicious=False, mal_targets=1, mal_boost=1.0, mal_epoch=0, mal_type=None, do_mal_forecast=False, do_pgd=False, do_data_ownership=False, mal_num_clients=-1, layer_freeze_idx=0, mal_layer_freeze_idx=0, mal_num_epochs=1, backdoor=-1, do_perfect_knowledge=False, do_dp=False, dp_mode='worker', l2_norm_clip=1.0, noise_multiplier=0.0, client_lr=0.1)
50000 125
Using BatchNorm: False
grad size 6568640
Finished initializing in 1.91 seconds
epoch lr train_time train_loss train_acc test_loss test_acc total_time
1 0.0800 25.2243 2.3028 0.1038 2.3012 0.1405 30.4418
2 0.1600 23.8377 2.3025 0.1017 2.2936 0.1460 57.5426
3 0.2400 24.0562 2.2886 0.1157 2.2449 0.1507 84.8176
4 0.3200 23.0985 2.2479 0.1461 2.1887 0.1535 111.1938
5 0.4000 22.0071 2.2901 0.0944 2.2941 0.0930 136.4487
6 0.3789 21.9321 2.3150 0.1301 3.3015 0.0997 161.6546
7 0.3579 21.9460 2.3782 0.1078 2.2771 0.1324 186.8818
8 0.3368 21.8156 2.2793 0.1264 2.2281 0.1360 211.9292
9 0.3158 21.6892 2.2410 0.1775 2.2307 0.1417 236.9210
10 0.2947 21.9432 2.2989 0.1024 2.2831 0.1175 262.0983
11 0.2737 21.9095 2.2511 0.1332 2.1657 0.1901 287.2876
12 0.2526 27.3621 2.1729 0.1771 2.1231 0.1734 321.2075
13 0.2316 37.6449 2.1274 0.1580 2.1067 0.2008 365.1934
14 0.2105 32.6825 2.3116 0.1308 2.0721 0.2026 401.1535
15 0.1895 22.3018 2.1435 0.1707 2.0014 0.2332 426.7760
16 0.1684 30.7159 2.0729 0.1982 2.1173 0.2312 460.7642
17 0.1474 22.4368 2.1110 0.2006 2.0027 0.2580 489.7420
18 0.1263 39.1600 2.0538 0.1897 2.0412 0.2377 535.3520
19 0.1053 38.9138 2.0614 0.2156 2.0193 0.2655 580.7346
20 0.0842 21.9821 1.9763 0.2441 2.0301 0.2679 605.9769
21 0.0632 32.8850 1.9892 0.2655 2.0524 0.2711 645.6084
22 0.0421 38.2427 1.9478 0.2627 1.8612 0.3094 690.2626
23 0.0211 38.3396 1.9010 0.2778 1.8869 0.2993 735.1110
HACK STEP
WARNING: LR is 0
WARNING: LR is 0
24 0.0000 33.1543 1.9016 0.2929 1.8394 0.3032 771.5566
done training

@kiddyboots216
Copy link
Owner

Hi, I'm not sure exactly what experimental results you're trying to reproduce. I'm not sure whether we had any results in the paper that have mode=fedavg, num_clients=200, num_workers=10, local_batch_size=1. Could you tell me what results you're trying to reproduce so I can tell you what hparams to use? Thanks.

@somethingsong
Copy link
Author

Thank you very much for your response. I am trying to reproduce the results:
2023-11-17_12-23
2023-11-17_12-24
Could you please tell me what hparams to use? I genuinely appreciate your assistance.

@kiddyboots216
Copy link
Owner

python cv_train.py --dataset_name CIFAR10 --model ResNet9 --mode sketch --num_clients 10000 --num_workers 100 --num_rows 1 --num_cols 50000 --error_type virtual --local_momentum 0.0 --virtual_momentum 0.9 --max_grad_norm 10.0 --num_devices=1 --lr_scale 0.4 --local_batch_size -1 --share_ps_gpu

@somethingsong
Copy link
Author

Thanks very much! I will try it now.

@somethingsong
Copy link
Author

I used the hparams you give, but I got the results:
MY PID: 9007
Namespace(do_test=False, mode='sketch', use_tensorboard=False, seed=21, model='ResNet9', do_finetune=False, do_checkpoint=False, checkpoint_path='./checkpoint', finetune_path='./finetune', finetuned_from=None, num_results_train=2, num_results_val=2, dataset_name='CIFAR10', dataset_dir='./dataset', do_batchnorm=False, nan_threshold=999, k=50000, num_cols=50000, num_rows=1, num_blocks=20, do_topk_down=False, local_momentum=0.0, virtual_momentum=0.9, weight_decay=0.0005, num_epochs=24, num_fedavg_epochs=1, fedavg_batch_size=-1, fedavg_lr_decay=1, error_type='virtual', lr_scale=0.4, pivot_epoch=5, port=5315, num_clients=10000, num_workers=100, device='cuda', num_devices=1, share_ps_gpu=True, do_iid=False, train_dataloader_workers=0, val_dataloader_workers=0, model_checkpoint='gpt2', num_candidates=2, max_history=2, local_batch_size=-1, valid_batch_size=8, microbatch_size=-1, lm_coef=1.0, mc_coef=1.0, max_grad_norm=10.0, personality_permutations=1, eval_before_start=False, do_dp=False, dp_mode='worker', l2_norm_clip=1.0, noise_multiplier=0.0)
50000 13
Using BatchNorm: False
Finished initializing in 1.40 seconds
HACK STEP
WARNING: LR is 0
/home/user/CommEfficient-master/CommEfficient/utils.py:258: UserWarning: This overload of add_ is deprecated:
add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, *, Number alpha) (Triggered internally at /opt/conda/conda-bld/pytorch_1656352645774/work/torch/csrc/utils/python_arg_parser.cpp:1174.)
grad_vec.add_(args.weight_decay / args.num_workers, weights)
WARNING: LR is 0
epoch lr train_time train_loss train_acc test_loss test_acc down (MiB) up (MiB) total_time
1 0.0800 130.7500 2.2992 0.1194 2.2792 0.1104 64929 1907 134.3366
2 0.1600 128.0034 2.1600 0.1880 1.9858 0.2506 117816 1907 264.4537
3 0.2400 119.8669 2.0835 0.2073 2.4260 0.1237 125085 1907 386.4207
4 0.3200 117.9984 2.3528 0.1367 2.3253 0.1243 131643 1907 506.5437
5 0.4000 117.7414 2.3391 0.1059 2.3031 0.1077 144526 1907 626.3972
the accuracy seems not correct

@kiddyboots216
Copy link
Owner

Can you change num_cols -> 500000?

@somethingsong
Copy link
Author

I will try, thanks!

@somethingsong
Copy link
Author

The result has some improvement,
epoch lr train_time train_loss train_acc test_loss test_acc down (MiB) up (MiB) total_time
1 0.0800 127.2544 2.2991 0.1230 2.2790 0.1240 58051 19073 130.7855
2 0.1600 127.1086 2.1490 0.1968 1.9640 0.2827 102534 19073 260.0100
3 0.2400 118.3132 1.9173 0.2945 1.7883 0.3346 118299 19073 380.4323
4 0.3200 116.6946 1.7728 0.3523 1.7535 0.3253 118763 19073 499.2499
5 0.4000 116.9935 1.7242 0.3734 1.6723 0.4013 120322 19073 618.3396
6 0.3789 117.6542 1.6713 0.4001 1.4073 0.4968 121114 19073 738.1218
7 0.3579 119.8809 1.4616 0.4794 1.2784 0.5414 121111 19073 860.1666
8 0.3368 120.2340 1.2608 0.5548 1.1843 0.6088 117029 19073 982.5226
9 0.3158 119.8844 1.1325 0.6084 0.9695 0.6726 115286 19073 1104.5195
10 0.2947 118.9704 1.0010 0.6570 0.9692 0.6685 116108 19073 1225.6146
11 0.2737 117.5381 0.9638 0.6698 0.8361 0.7195 116691 19073 1345.2715
12 0.2526 119.0444 0.8628 0.7088 0.8417 0.7188 117465 19073 1466.4860
13 0.2316 121.0632 0.7987 0.7293 0.7648 0.7514 117877 19073 1589.6690
14 0.2105 118.3240 0.7337 0.7524 0.7069 0.7697 119160 19073 1710.1443
15 0.1895 114.6481 0.7083 0.7597 0.7099 0.7623 118396 19073 1826.9046

If I want more improvements, try num_rows->5?
Thank you very much!

@kiddyboots216
Copy link
Owner

try this

bash submit_cifar.sh CIFAR10 ResNet9 fedavg 1000 10 -1 none 24 5 0.2 0 0.9 1 50 0 0 50026 21 1 1 1 0 A 0 0 0 0 1 -1 worker --malicious  --iid    

@miliable
Copy link

miliable commented Dec 7, 2023

Hello, sorry to bother you, I also used the parameters you gave, but I reported the following error, how to solve it
1
2

@kiddyboots216
Copy link
Owner

It seems that the error is occurring because the labs we are passing in is just an integer denoting the class label and for some reason the cuda kernel doesn't work with ints? That's pretty weird. What's your torch version and cuda version? Can you print out the types of the inputs in the backward pass? Can you try just casting the label to a torch data type?

@miliable
Copy link

miliable commented Dec 8, 2023

My CUDA version is 12.0 and my torch version is 2.1.1

@miliable
Copy link

Add a line of code to solve the problem:Add a line of code to solve the problem:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants