Question about the final test_acc in cifar10 experiment. #16

somethingsong · 2023-11-16T09:21:32Z

Hi, I tried to reproduce the experiment results in the paper. I am using the following commands:

python cv_train.py --dataset_name CIFAR10 --model ResNet9 --mode fedavg --num_clients 200 --num_workers 10 --num_rows 1 --num_cols 50000 --error_type none --local_momentum 0.0 --virtual_momentum 0.9 --max_grad_norm 1.0 --num_devices=1 --lr_scale 0 --local_batch_size -1 --share_ps_gpu

the accuracy seems not correct, could you help me solve the problem? the logs are:

MY PID: 3424
Namespace(do_test=False, 50000 125
Using BatchNorm: False
grad size 6568640
Finished initializing epoch lr 1 0.0800 25.2243 2 0.1600 23.8377 3 0.2400 24.0562 4 0.3200 23.0985 5 0.4000 22.0071 6 0.3789 21.9321 7 0.3579 21.9460 8 0.3368 21.8156 9 0.3158 21.6892 10 0.2947 21.9432 11 0.2737 21.9095 12 0.2526 27.3621 13 0.2316 37.6449 14 0.2105 32.6825 15 0.1895 22.3018 16 0.1684 30.7159 17 0.1474 22.4368 18 0.1263 39.1600 19 0.1053 38.9138 20 0.0842 21.9821 21 0.0632 32.8850 22 0.0421 38.2427 23 0.0211 38.3396 HACK STEP
WARNING: LR is 0
WARNING: LR is 0
24 0.0000 33.1543 done training mode='fedavg', robustagg='none', use_tensorboard=False, seed=21, model='ResNet9', do_finetune=False, do_dp_finetune=False, do_checkpoint=False, checkpoint_path='/data/nvme/ashwinee/CommEfficient/CommEfficient/checkpoints/', finetune_path='./finetune', finetuned_from=None, num_results_train=2, num_results_val=2, dataset_name='CIFAR10', dataset_dir='./dataset', do_batchnorm=False, nan_threshold=999, k=50000, num_cols=50000, num_rows=1, num_blocks=20, do_topk_down=False, local_momentum=0.0, virtual_momentum=0.9, weight_decay=0.0005, num_epochs=24, num_fedavg_epochs=1, fedavg_batch_size=-1, fedavg_lr_decay=1.0, error_type='none', lr_scale=0.4, pivot_epoch=5, port=5315, num_clients=200, num_workers=10, device='cuda', num_devices=1, share_ps_gpu=True, do_iid=False, train_dataloader_workers=0, val_dataloader_workers=0, model_checkpoint='gpt2', num_candidates=2, max_history=2, local_batch_size=-1, valid_batch_size=8, microbatch_size=-1, lm_coef=1.0, mc_coef=1.0, max_grad_norm=1.0, personality_permutations=1, eval_before_start=False, checkpoint_epoch=-1, finetune_epoch=12, do_malicious=False, mal_targets=1, mal_boost=1.0, mal_epoch=0, mal_type=None, do_mal_forecast=False, do_pgd=False, do_data_ownership=False, mal_num_clients=-1, layer_freeze_idx=0, mal_layer_freeze_idx=0, mal_num_epochs=1, backdoor=-1, do_perfect_knowledge=False, do_dp=False, dp_mode='worker', l2_norm_clip=1.0, noise_multiplier=0.0, client_lr=0.1)
in 1.91 seconds
train_time train_loss train_acc test_loss test_acc total_time
2.3028 0.1038 2.3012 0.1405 30.4418
2.3025 0.1017 2.2936 0.1460 57.5426
2.2886 0.1157 2.2449 0.1507 84.8176
2.2479 0.1461 2.1887 0.1535 111.1938
2.2901 0.0944 2.2941 0.0930 136.4487
2.3150 0.1301 3.3015 0.0997 161.6546
2.3782 0.1078 2.2771 0.1324 186.8818
2.2793 0.1264 2.2281 0.1360 211.9292
2.2410 0.1775 2.2307 0.1417 236.9210
2.2989 0.1024 2.2831 0.1175 262.0983
2.2511 0.1332 2.1657 0.1901 287.2876
2.1729 0.1771 2.1231 0.1734 321.2075
2.1274 0.1580 2.1067 0.2008 365.1934
2.3116 0.1308 2.0721 0.2026 401.1535
2.1435 0.1707 2.0014 0.2332 426.7760
2.0729 0.1982 2.1173 0.2312 460.7642
2.1110 0.2006 2.0027 0.2580 489.7420
2.0538 0.1897 2.0412 0.2377 535.3520
2.0614 0.2156 2.0193 0.2655 580.7346
1.9763 0.2441 2.0301 0.2679 605.9769
1.9892 0.2655 2.0524 0.2711 645.6084
1.9478 0.2627 1.8612 0.3094 690.2626
1.9010 0.2778 1.8869 0.2993 735.1110
1.9016 0.2929 1.8394 0.3032 771.5566

kiddyboots216 · 2023-11-16T18:04:27Z

Hi, I'm not sure exactly what experimental results you're trying to reproduce. I'm not sure whether we had any results in the paper that have mode=fedavg, num_clients=200, num_workers=10, local_batch_size=1. Could you tell me what results you're trying to reproduce so I can tell you what hparams to use? Thanks.

somethingsong · 2023-11-17T04:38:45Z

Thank you very much for your response. I am trying to reproduce the results:

Could you please tell me what hparams to use? I genuinely appreciate your assistance.

kiddyboots216 · 2023-11-17T06:08:42Z

python cv_train.py --dataset_name CIFAR10 --model ResNet9 --mode sketch --num_clients 10000 --num_workers 100 --num_rows 1 --num_cols 50000 --error_type virtual --local_momentum 0.0 --virtual_momentum 0.9 --max_grad_norm 10.0 --num_devices=1 --lr_scale 0.4 --local_batch_size -1 --share_ps_gpu

somethingsong · 2023-11-17T06:19:53Z

Thanks very much! I will try it now.

somethingsong · 2023-11-17T06:24:41Z

I used the hparams you give, but I got the results:
MY PID: 9007
Namespace(do_test=False, mode='sketch', use_tensorboard=False, seed=21, model='ResNet9', do_finetune=False, do_checkpoint=False, checkpoint_path='./checkpoint', finetune_path='./finetune', finetuned_from=None, num_results_train=2, num_results_val=2, dataset_name='CIFAR10', dataset_dir='./dataset', do_batchnorm=False, nan_threshold=999, k=50000, num_cols=50000, num_rows=1, num_blocks=20, do_topk_down=False, local_momentum=0.0, virtual_momentum=0.9, weight_decay=0.0005, num_epochs=24, num_fedavg_epochs=1, fedavg_batch_size=-1, fedavg_lr_decay=1, error_type='virtual', lr_scale=0.4, pivot_epoch=5, port=5315, num_clients=10000, num_workers=100, device='cuda', num_devices=1, share_ps_gpu=True, do_iid=False, train_dataloader_workers=0, val_dataloader_workers=0, model_checkpoint='gpt2', num_candidates=2, max_history=2, local_batch_size=-1, valid_batch_size=8, microbatch_size=-1, lm_coef=1.0, mc_coef=1.0, max_grad_norm=10.0, personality_permutations=1, eval_before_start=False, do_dp=False, dp_mode='worker', l2_norm_clip=1.0, noise_multiplier=0.0)
50000 13
Using BatchNorm: False
Finished initializing in 1.40 seconds
HACK STEP
WARNING: LR is 0
/home/user/CommEfficient-master/CommEfficient/utils.py:258: UserWarning: This overload of add_ is deprecated:
add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, *, Number alpha) (Triggered internally at /opt/conda/conda-bld/pytorch_1656352645774/work/torch/csrc/utils/python_arg_parser.cpp:1174.)
grad_vec.add_(args.weight_decay / args.num_workers, weights)
WARNING: LR is 0
epoch lr train_time train_loss train_acc test_loss test_acc down (MiB) up (MiB) total_time
1 0.0800 130.7500 2.2992 0.1194 2.2792 0.1104 64929 1907 134.3366
2 0.1600 128.0034 2.1600 0.1880 1.9858 0.2506 117816 1907 264.4537
3 0.2400 119.8669 2.0835 0.2073 2.4260 0.1237 125085 1907 386.4207
4 0.3200 117.9984 2.3528 0.1367 2.3253 0.1243 131643 1907 506.5437
5 0.4000 117.7414 2.3391 0.1059 2.3031 0.1077 144526 1907 626.3972
the accuracy seems not correct

kiddyboots216 · 2023-11-17T06:47:37Z

Can you change num_cols -> 500000?

somethingsong · 2023-11-17T06:49:27Z

I will try, thanks!

somethingsong · 2023-11-17T07:26:06Z

The result has some improvement,
epoch lr train_time train_loss train_acc test_loss test_acc down (MiB) up (MiB) total_time
1 0.0800 127.2544 2.2991 0.1230 2.2790 0.1240 58051 19073 130.7855
2 0.1600 127.1086 2.1490 0.1968 1.9640 0.2827 102534 19073 260.0100
3 0.2400 118.3132 1.9173 0.2945 1.7883 0.3346 118299 19073 380.4323
4 0.3200 116.6946 1.7728 0.3523 1.7535 0.3253 118763 19073 499.2499
5 0.4000 116.9935 1.7242 0.3734 1.6723 0.4013 120322 19073 618.3396
6 0.3789 117.6542 1.6713 0.4001 1.4073 0.4968 121114 19073 738.1218
7 0.3579 119.8809 1.4616 0.4794 1.2784 0.5414 121111 19073 860.1666
8 0.3368 120.2340 1.2608 0.5548 1.1843 0.6088 117029 19073 982.5226
9 0.3158 119.8844 1.1325 0.6084 0.9695 0.6726 115286 19073 1104.5195
10 0.2947 118.9704 1.0010 0.6570 0.9692 0.6685 116108 19073 1225.6146
11 0.2737 117.5381 0.9638 0.6698 0.8361 0.7195 116691 19073 1345.2715
12 0.2526 119.0444 0.8628 0.7088 0.8417 0.7188 117465 19073 1466.4860
13 0.2316 121.0632 0.7987 0.7293 0.7648 0.7514 117877 19073 1589.6690
14 0.2105 118.3240 0.7337 0.7524 0.7069 0.7697 119160 19073 1710.1443
15 0.1895 114.6481 0.7083 0.7597 0.7099 0.7623 118396 19073 1826.9046

If I want more improvements, try num_rows->5?
Thank you very much!

kiddyboots216 · 2023-11-17T07:30:17Z

try this

bash submit_cifar.sh CIFAR10 ResNet9 fedavg 1000 10 -1 none 24 5 0.2 0 0.9 1 50 0 0 50026 21 1 1 1 0 A 0 0 0 0 1 -1 worker --malicious  --iid

miliable · 2023-12-07T15:03:32Z

Hello, sorry to bother you, I also used the parameters you gave, but I reported the following error, how to solve it

kiddyboots216 · 2023-12-07T15:06:19Z

It seems that the error is occurring because the labs we are passing in is just an integer denoting the class label and for some reason the cuda kernel doesn't work with ints? That's pretty weird. What's your torch version and cuda version? Can you print out the types of the inputs in the backward pass? Can you try just casting the label to a torch data type?

miliable · 2023-12-08T08:01:45Z

My CUDA version is 12.0 and my torch version is 2.1.1

miliable · 2023-12-11T09:34:00Z

Add a line of code to solve the problem:Add a line of code to solve the problem:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the final test_acc in cifar10 experiment. #16

Question about the final test_acc in cifar10 experiment. #16

somethingsong commented Nov 16, 2023

kiddyboots216 commented Nov 16, 2023

somethingsong commented Nov 17, 2023

kiddyboots216 commented Nov 17, 2023

somethingsong commented Nov 17, 2023

somethingsong commented Nov 17, 2023

kiddyboots216 commented Nov 17, 2023

somethingsong commented Nov 17, 2023

somethingsong commented Nov 17, 2023

kiddyboots216 commented Nov 17, 2023

miliable commented Dec 7, 2023

kiddyboots216 commented Dec 7, 2023

miliable commented Dec 8, 2023

miliable commented Dec 11, 2023

Question about the final test_acc in cifar10 experiment. #16

Question about the final test_acc in cifar10 experiment. #16

Comments

somethingsong commented Nov 16, 2023

kiddyboots216 commented Nov 16, 2023

somethingsong commented Nov 17, 2023

kiddyboots216 commented Nov 17, 2023

somethingsong commented Nov 17, 2023

somethingsong commented Nov 17, 2023

kiddyboots216 commented Nov 17, 2023

somethingsong commented Nov 17, 2023

somethingsong commented Nov 17, 2023

kiddyboots216 commented Nov 17, 2023

miliable commented Dec 7, 2023

kiddyboots216 commented Dec 7, 2023

miliable commented Dec 8, 2023

miliable commented Dec 11, 2023