This tutorial will show you how to set up and start an FL experiment to train ShuffleNet on the FEMNIST dataset using FedScale.
Check the instructions to set up your environment and instructions to download the FEMNIST dataset.
Please make sure you are using the correct environment.
conda activate fedscale
Go to benchmark/configs/femnist/
directory and modify/create your configuration file to submit your job.
Modify the configurations such as the number of participants per round, the aggregation algorithm, the client optimizer, the training model, etc. based on your need.
Use fedscale driver submit [conf_yml_path]
(Or python docker/driver.py submit
) to submit your FL job. It will automatically launch the aggregator.py
and executor.py
to start the FL evaluation.
You can either choose to evaluate your large-scale FL experiment over a GPU cluster or test your code on your local machine.
To stop your job:
fedscale driver stop [job_name]
# Or python docker/driver.py stop [job_name] (specified in the yml config)
It is more convenient to first test your code without a GPU cluster.
First add an argument - use_cuda: False
under job_conf
in your configuration file benchmark/configs/femnist/conf.yml
if you are training without using any GPU.
Set ps_ip
and worker_ips
to be localhost
and localhost:[x]
by default, where x represent how many executors you want to run on your local machine.
Then run the following command to start your FL job:
python driver.py start benchmark/configs/femnist/conf.yml
We also provide jupyter notebook examples to run your code locally. You can first start running server, and then run the client.
Once you have followed the instructions to set up your experiment cluster, you will be able to submit your FL job to the cluster!
Change ps_ip
and worker_ips
to the host name of your nodes in the configuration file by cat \etc\hosts
.
For example, set 10.0.0.2:[4,4]
as one of the worker_ips
means launching 4 executors
on each of the first two GPUs on 10.0.0.2
to train your model in a space/time sharing fashion.
Make sure the node you submit the job has access to the computation nodes. Also make sure you have synchronized the code across all the nodes. Then run the following command to submit your FL job:
fedscale driver submit [conf_yml_path]
# Or python docker/driver.py submit benchmark/configs/femnist/conf.yml
You can find the job logging job_name
under the path log_path
specified in the conf.yml
file. To check the training loss or test accuracy, you can do:
cat job_name_logging |grep 'Training loss'
cat job_name_logging |grep 'FL Testing'
You can also use Tensorboard to better visualize the progress.