Proposal: Enable Self-Supervised Learning in OpenFL #1316
porteratzo
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Proposal: Enable Self-Supervised Learning in OpenFL
Summary
Introduce Self-Supervised Learning (SSL) algorithms into OpenFL to enable training on unlabeled data. This can be achieved by creating workflows that utilize techniques such as Masked Autoencoders (MAE) or DinoV2. These algorithms can pretrain models on unlabeled data, which can then be fine-tuned on labeled data for specific tasks. The final model is expected to achieve better accuracy compared to models trained solely on labeled data.
Motivation
Collaborators often face challenges in contributing to federated learning training due to the scarcity of labeled data. Generally, unlabeled data is much more abundant compared to labeled data. However, without a method to train with unlabeled data, this resource remains underutilized. With SSL, we can allow collaborators to contribute to training with their unlabeled data, improving the final accuracy of the trained models.
High-Level Design
The feature will contain two workspaces: one dedicated to SSL pretraining and another for traditional fine-tuning that can utilize the weights from the pre-trained model.
Technical Details
Dataset
We propose using the BraTS2020 dataset, which is already approved for Intel. The dataset can be accessed here.
Pretraining Workspace
PyTorchTaskRunner
and implementsTransformers Trainer
for an DinoV2 SSL task for training with OpenFL TaskRunner API.openfl.federated.data.DataLoader
and handles preprocessing (utilizing the entire dataset without labels) and provides the dataloaders needed by the TaskRunner.label_uniformity_alpha
, will determine the uniformity of the label distribution among collaborators using Dirichlet Distribution-Based Partitioning.Fine-Tuning Workspace
PyTorchTaskRunner
and implementsTransformers Trainer
for a fine-tuning task for training with OpenFL TaskRunner API.openfl.federated.data.DataLoader
and handles preprocessing (utilizing a subset for training with labels) and provides the dataloaders needed by the TaskRunner.label_uniformity_alpha
, will determine the uniformity of the label distribution among collaborators using Dirichlet Distribution-Based Partitioning.Documentation
label_uniformity_alpha
to compare Independent and Identically Distributed (IID) data vs. Non-Independent and Identically Distributed (Non-IID) data, where SSL pretraining excels.Result Comparison Notebook
API Changes
fx workspace create --template
.Dependencies
Backward Compatibility
Alternatives Considered
Risks and Mitigations
Open Questions
Beta Was this translation helpful? Give feedback.
All reactions