Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce the results in the paper #8

Open
jiahuigeng opened this issue Oct 17, 2022 · 3 comments
Open

Reproduce the results in the paper #8

jiahuigeng opened this issue Oct 17, 2022 · 3 comments

Comments

@jiahuigeng
Copy link

Hi, I tried to reproduce the experiment results in the paper. I am using the following commands. But the logs seem not correct. Could you share the command line you are using in the paper? I am really interested in your work and willing to explore more about sketch techniques.

python cv_train.py --dataset_name CIFAR10 --iid --num_workers 2 --lr_scale 0.4 --local_momentum=0.0 --num_devices 2 --num_devices=2 --num_clients 2

MY PID: 31280
5315 port in use, trying next...
Namespace(checkpoint_path='./checkpoint', dataset_dir='./dataset', dataset_name='CIFAR10', device='cuda', do_batchnorm=False, do_checkpoint=False, do_dp=False, do_finetune=False, do_iid=True, do_test=False, do_topk_down=False, dp_mode='worker', error_type='none', eval_before_start=False, fedavg_batch_size=-1, fedavg_lr_decay=1, finetune_path='./finetune', finetuned_from=None, k=50000, l2_norm_clip=1.0, lm_coef=1.0, local_batch_size=8, local_momentum=0.0, lr_scale=0.4, max_grad_norm=None, max_history=2, mc_coef=1.0, microbatch_size=-1, mode='sketch', model='ResNet9', model_checkpoint='gpt2', nan_threshold=999, noise_multiplier=0.0, num_blocks=20, num_candidates=2, num_clients=2, num_cols=500000, num_devices=2, num_epochs=24, num_fedavg_epochs=1, num_results_train=2, num_results_val=2, num_rows=5, num_workers=2, personality_permutations=1, pivot_epoch=5, port=5646, seed=21, share_ps_gpu=False, train_dataloader_workers=0, use_tensorboard=False, val_dataloader_workers=0, valid_batch_size=8, virtual_momentum=0, weight_decay=0.0005)
50000 625
Using BatchNorm: False
Finished initializing in 11.00 seconds
miniconda3/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). "
CommEfficient/CommEfficient/utils.py:258: UserWarning: This overload of add_ is deprecated:
add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, *, Number alpha) (Triggered internally at ../torch/csrc/utils/python_arg_parser.cpp:1055.)
grad_vec.add_(args.weight_decay / args.num_workers, weights)
epoch lr train_time train_loss train_acc test_loss test_acc down (MiB) up (MiB) total_time
1 0.0800 655.4752 2.3025 0.1009 2.3025 0.1014 0 59606 679.6477
2 0.1600 649.9156 2.3025 0.1008 2.3025 0.1014 0 59606 1343.1710
3 0.2400 649.3290 2.3025 0.1011 2.3025 0.1014 0 59606 2006.0574

@kiddyboots216
Copy link
Owner

Hello. Could you try using the settings that we use in the paper? So don't add the --iid flag and use the number of workers and number of clients that we use instead of 2. When you use 2 clients and 2 workers this means that you are splitting the entire CIFAR10 dataset into 2 chunks, and then doing training with the entire dataset at each epoch. For this setting, that is near identical to full-batch training, you may need to follow the optimization guidelines in something like the LAMB optimizer.

@Antonio-demo
Copy link

Hello, I'm trying to reproduce your experimental results through the code provided by this paper, but I cannot correctly run your paper's code. So I want to know how to correctly run this code?

@kiddyboots216
Copy link
Owner

Hi @Antonio-demo I think you can create a new issue and provide some more details, e.g. the command that you are running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants