Skip to content

Commit

Permalink
self-paced-training: chapter 3 (#3160)
Browse files Browse the repository at this point in the history
Fixes # .

### Description

A few sentences describing the changes proposed in this pull request.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

---------

Co-authored-by: Ziyue Xu <[email protected]>
  • Loading branch information
chesterxgchen and ZiyueXu77 authored Feb 4, 2025
1 parent de829ce commit d20f4cc
Show file tree
Hide file tree
Showing 108 changed files with 4,046 additions and 260 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,12 @@
"root": {
"level": "INFO",
"handlers": ["consoleHandler", "logFileHandler", "errorFileHandler", "jsonFileHandler", "FLFileHandler"]
},
"nvflare.app_opt.pt.client_api_launcher_executor.PTClientAPILauncherExecutor": {
"level": "ERROR"
},
"nvflare.app_opt.pt.in_process_client_api_executor.PTInProcessClientAPIExecutor": {
"level": "ERROR"
}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -104,19 +104,6 @@ def main():

flare.send(output_model)

print(
f"\n"
f"Result Summary\n"
" Training parameters:\n"
" number of clients = 5\n"
f" round = {round + 1},\n"
f" batch_size = {batch_size},\n"
f" epochs = {epochs},\n"
f" lr = {lr},\n"
f" total data batches = {n_loaders},\n"
f" Metrics: last_loss = {last_loss}\n"
)


if __name__ == "__main__":
main()
Original file line number Diff line number Diff line change
Expand Up @@ -39,47 +39,20 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"id": "951d0fe6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"/home/chester/projects/NVFlare/examples/tutorials/self-paced-training/part-1_federated_learning_introduction/Chapter-1_running_federated_learning_applications/01.1_running_federated_learning_job/code\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/chester/.local/lib/python3.10/site-packages/IPython/core/magics/osm.py:417: UserWarning: This is now an optional IPython functionality, setting dhist requires you to install the `pickleshare` library.\n",
" self.shell.db['dhist'] = compress_dhist(dhist)[-100:]\n"
]
}
],
"outputs": [],
"source": [
"%cd code "
]
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"id": "ecc3a0cc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Traceback (most recent call last):\n",
" File \"/home/chester/projects/NVFlare/examples/tutorials/self-paced-training/part-1_federated_learning_introduction/Chapter-1_running_federated_learning_applications/01.1_running_federated_learning_job/code/fl_job.py\", line 17, in <module>\n",
" from nvflare.app_opt.pt.job_config.fed_avg import FedAvgJob\n",
"ModuleNotFoundError: No module named 'nvflare'\n"
]
}
],
"outputs": [],
"source": [
"! python3 fl_job.py"
]
Expand Down Expand Up @@ -125,7 +98,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "nvflare_env",
"language": "python",
"name": "python3"
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -104,20 +104,10 @@
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"id": "87a13909",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"100.0%\n",
"Extracting /tmp/nvflare/data/cifar10/cifar-10-python.tar.gz to /tmp/nvflare/data/cifar10\n",
"Files already downloaded and verified\n"
]
}
],
"outputs": [],
"source": [
"!python3 code/data/download.py"
]
Expand All @@ -132,28 +122,10 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"id": "08bbe572",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[01;34m/tmp/nvflare/data/cifar10/cifar-10-batches-py/\u001b[0m\n",
"├── \u001b[00mbatches.meta\u001b[0m\n",
"├── \u001b[00mdata_batch_1\u001b[0m\n",
"├── \u001b[00mdata_batch_2\u001b[0m\n",
"├── \u001b[00mdata_batch_3\u001b[0m\n",
"├── \u001b[00mdata_batch_4\u001b[0m\n",
"├── \u001b[00mdata_batch_5\u001b[0m\n",
"├── \u001b[00mreadme.html\u001b[0m\n",
"└── \u001b[00mtest_batch\u001b[0m\n",
"\n",
"0 directories, 8 files\n"
]
}
],
"outputs": [],
"source": [
"!tree /tmp/nvflare/data/cifar10/cifar-10-batches-py/"
]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@
"Each section is designed to provide comprehensive guidance and practical examples to help you implement and customize federated learning in your applications. For detailed instructions and examples, refer to the respective notebooks linked in each section.\n",
"\n",
"\n",
"Now let's move on to the [Chapter 2](../../Chapter-2_develop_federated_learning_applications/02.0_introduction/introduction.ipynb\n",
"Now let's move on to the [Chapter 2](../../chapter-2_develop_federated_learning_applications/02.0_introduction/introduction.ipynb\n",
")"
]
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"[Chapter 1.1: Running federated learning applications](./Chapter-1_running_federated_learning_applications/01.0_introduction/introduction.ipynb)\n",
"[Chapter 1.1: Running federated learning applications](./chapter-1_running_federated_learning_applications/01.0_introduction/introduction.ipynb)\n",
"\n",
"[Chapter 1.2: Develop federated learning applications](./Chapter-2_develop_federated_learning_applications/02.0_introduction/introduction.ipynb)\n"
"[Chapter 1.2: Develop federated learning applications](./chapter-2_develop_federated_learning_applications/02.0_introduction/introduction.ipynb)\n"
]
},
{
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -1,11 +1,170 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"cell_type": "markdown",
"id": "397e8d18-2aab-4aa4-b186-68e8acbfc71a",
"metadata": {},
"outputs": [],
"source": [
"# NVIDIA FLARE's Federated Computing Platform\n",
"\n",
"In this chapter, we will overview the core concepts and system architecture of NVIDIA FLARE (NVFlare). We will explore different aspects of the NVFlare system, simulate deployment, and learn how to interact with the system.\n",
"\n",
"## Federated Learning vs. Federated Computing\n",
"\n",
"At its core, FLARE serves as a federated computing framework, with applications such as Federated Learning and Federated Analytics built upon this foundation. Notably, it is agnostic to datasets, workloads, and domains. Unlike centralized data lake solutions that require copying data to a central location, FLARE brings computing capabilities directly to distributed datasets. This approach ensures that data remains within the compute node, with only pre-approved, selected results shared among collaborators. Moreover, FLARE is system agnostic, offering easy integration with various data processing frameworks through the implementation of the FLARE client. This client facilitates deployment in sub-processes, Docker containers, Kubernetes pods, HPC, or specialized systems.\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "29e73876",
"metadata": {},
"source": [
"## Core Concepts\n",
"\n",
"In NVIDIA FLARE (NVFlare), there are a few core concepts:\n",
"\n",
"* Server-side component: Controller\n",
"* Client-side component: Executor\n",
"* Communication message: Shareable\n",
"* Filtering mechanism\n",
"* Building Block: FLComponent\n",
"* Job\n",
"\n",
"In Part 1, we only encountered Job. We will discuss the rest in this section.\n",
"\n",
"### Controller\n",
"\n",
"The controller is the object that defines the logic for the clients to follow. The controller API makes it possible to create any client coordination logic in a federated learning workflow.\n",
"\n",
"In other words, the controller defines the workflow: i.e., how the federated execution will be carried out. For example, whether the execution is in a round-robin style or scatter & gather style is defined by the controller.\n",
"\n",
"The controller, in most cases, is executed on the FL server. Some refer to this as the server strategy. The controller can also be executed on the client side (referred to as client-side-controller). This can be used to define peer-to-peer styles of workflow, such as swarm learning.\n",
"\n",
"### Executor\n",
"\n",
"The Executor is the object that defines the logic to execute on the client side. It handles the tasks defined by the Controller and responds back to the task requests.\n",
"\n",
"The interaction between the Controller and Executor can be found in the following picture:\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "abe8109c",
"metadata": {},
"source": [
"### Shareable \n",
"\n",
"A [Shareable](https://nvflare.readthedocs.io/en/main/programming_guide/shareable.html) object represents communication between the server and client. Technically, a Shareable object is implemented as a Python dict. This dict contains two kinds of information:\n",
"* Header \n",
" * Peer Properties\n",
" * Cookie \n",
" * Return code\n",
"* Content\n",
"\n",
"In other words, a Shareable is nothing but a dictionary with some metadata information.\n"
]
},
{
"cell_type": "markdown",
"id": "51e9e4fa",
"metadata": {},
"source": [
"The Controller and Executor exchange Shareable\n",
"\n",
"<img src=\"controller_executor_no_filter.png\" alt=\"Controller and executor\" width=\"700\" height=\"400\">\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "187ed5ee",
"metadata": {},
"source": [
"### Filters\n",
"\n",
"NVIDIA FLARE also introduces a filtering mechanism to allow users to limit the input & outputs. Filters in NVIDIA FLARE are a way to transform the Shareable object between the communicating parties. A [Filter](https://nvflare.readthedocs.io/en/main/programming_guide/filters.html) can be used to provide additional processing to shareable data before sending or after receiving from the peer.\n",
"\n",
"<img src=\"controller_worker_flow.png\" alt=\"Controller and executor with filters\" width=\"700\" height=\"400\">\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "fd3c8200",
"metadata": {},
"source": [
"### FLComponent\n",
"\n",
"NVIDIA FLARE is built with components. FLComponent is the building block of all components (Base Class). Controller, Executor, Filter, and Shareable are all types of FLComponent.\n",
"\n",
"The core property of FLComponent is event support. FLComponent is able to fire and receive events, enabling the FLARE system to be an event-driven, pluggable system.\n",
"\n",
"### FLContext\n",
"One of the most important features of NVIDIA FLARE is ```nvflare.apis.fl_context``` to pass data between the FL components. FLContext is available to every method of all FLComponent types (Controller, Aggregator, Filter, Executor).\n",
"\n",
"\n",
"Through the FL Context, the component developer can:\n",
"\n",
"* Get services provided by the underlying infrastructure\n",
"\n",
"* Share data with other components of the FL system, even including components in the peer endpoints (between server and clients)\n",
"\n",
"FLContext can be thought of as a Python dictionary that stores key/value pairs. Data items stored in FLContext are called properties, or props for short. Props have two attributes: visibility and stickiness.\n",
"\n",
"\n",
"### Events\n",
"\n",
"NVIDIA FLARE fires and manages events in the lifecycle of the system. There are two categories of event types: Local Event and Fed Event.\n",
"\n",
"Both client and server have local events for their respective system activities. The client's local event can also be converted to a \"Fed Event,\" which means the event will propagate and fire on the server side.\n"
]
},
{
"cell_type": "markdown",
"id": "2afae41d",
"metadata": {},
"source": [
"## High-Level Concepts\n",
"\n",
"Although understanding these core concepts will enable FLARE users to build powerful federated computing algorithms, some data scientists may prefer higher-level constructs.\n",
"\n",
"NVFLARE also introduced a few concepts to reduce the learning curve.\n",
"\n",
" * FLModel -- higher-level communication data structure\n",
"\n",
"### FLModel\n",
"\n",
"The FLModel structure is a higher-level data structure designed for data scientists. This structure may not be general for common federated computing messaging communication, but it is suitable for federated learning applications.\n",
"\n",
"We define a standard data structure, FLModel, that captures the common attributes needed for exchanging learning results. This is particularly useful when the NVFlare system needs to exchange learning information with external training scripts/systems. The external training script/system only needs to extract the required information from the received FLModel, run local training, and put the results in a new FLModel to be sent back.\n",
"\n",
"Behind the scenes, we will convert the FLModel structure to and from Shareable.\n",
"\n",
"**FLModel**\n",
"\n",
"A standardized data structure for NVFlare to communicate with external systems.\n",
"\n",
"**Parameters:**\n",
"* params_type – type of the parameters. It only describes the “params”. If params_type is None, params need to be None. If params are provided but params_type is not provided, then it will be treated as FULL.\n",
"* params – model parameters, for example, model weights for deep learning.\n",
"* optimizer_params – optimizer parameters. In many cases, the optimizer parameters don’t need to be transferred during FL training.\n",
"* metrics – evaluation metrics such as loss and scores.\n",
"* start_round – the start FL round. A round means a round trip between client/server during training. None for inference.\n",
"* current_round – the current FL round. A round means a round trip between client/server during training. None for inference.\n",
"* total_rounds – total number of FL rounds. A round means a round trip between client/server during training. None for inference.\n",
"* meta – metadata dictionary used to contain any key-value pairs to facilitate the process.\n",
"\n",
"Now with a few concepts, let's take a look at the [system architecture](../03.1_federated_computing_architecture/system_architecture.ipynb).\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "93d3e92a",
"metadata": {},
"source": []
}
],
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

This file was deleted.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit d20f4cc

Please sign in to comment.