Proposal: Enable Self-Supervised Learning in OpenFL #1316

porteratzo · 2025-01-22T22:57:00Z

porteratzo
Jan 22, 2025
Maintainer

Proposal: Enable Self-Supervised Learning in OpenFL

Summary

Introduce Self-Supervised Learning (SSL) algorithms into OpenFL to enable training on unlabeled data. This can be achieved by creating workflows that utilize techniques such as Masked Autoencoders (MAE) or DinoV2. These algorithms can pretrain models on unlabeled data, which can then be fine-tuned on labeled data for specific tasks. The final model is expected to achieve better accuracy compared to models trained solely on labeled data.

Motivation

Collaborators often face challenges in contributing to federated learning training due to the scarcity of labeled data. Generally, unlabeled data is much more abundant compared to labeled data. However, without a method to train with unlabeled data, this resource remains underutilized. With SSL, we can allow collaborators to contribute to training with their unlabeled data, improving the final accuracy of the trained models.

High-Level Design

The feature will contain two workspaces: one dedicated to SSL pretraining and another for traditional fine-tuning that can utilize the weights from the pre-trained model.

Technical Details

Dataset

We propose using the BraTS2020 dataset, which is already approved for Intel. The dataset can be accessed here.

Pretraining Workspace

Data_download.sh: Bash script to download the BraTS2020 dataset.
HuggingFace Transformers TaskRunner: Inherits from PyTorchTaskRunner and implements Transformers Trainer for an DinoV2 SSL task for training with OpenFL TaskRunner API.
BraTS2020Dataloader: Inherits openfl.federated.data.DataLoader and handles preprocessing (utilizing the entire dataset without labels) and provides the dataloaders needed by the TaskRunner.
- A parameter, label_uniformity_alpha, will determine the uniformity of the label distribution among collaborators using Dirichlet Distribution-Based Partitioning.

Fine-Tuning Workspace

Data_download.sh: Bash script to download the BraTS2020 dataset.
HuggingFace Transformers TaskRunner: Inherits from PyTorchTaskRunner and implements Transformers Trainer for a fine-tuning task for training with OpenFL TaskRunner API.
BraTS2020Dataloader: Inherits openfl.federated.data.DataLoader and handles preprocessing (utilizing a subset for training with labels) and provides the dataloaders needed by the TaskRunner.
- A parameter, label_uniformity_alpha, will determine the uniformity of the label distribution among collaborators using Dirichlet Distribution-Based Partitioning.

Documentation

Steps to run pretraining followed by fine-tuning on the pretrained model.
Steps to run only fine-tuning with a randomly initialized model.
Instructions on using label_uniformity_alpha to compare Independent and Identically Distributed (IID) data vs. Non-Independent and Identically Distributed (Non-IID) data, where SSL pretraining excels.

Result Comparison Notebook

A Jupyter notebook to compare the final results of the fine-tuned model against the pretrained model.
This will help users evaluate the effectiveness of the SSL pretraining process and make informed decisions about model performance.

API Changes

Introduce new options for fx workspace create --template.

Dependencies

Deep learning libraries related to training HuggingFace Transformers.

Backward Compatibility

The feature will not affect backward compatibility.

Alternatives Considered

Fully Supervised Learning with Synthetic Data Generation: This approach was not chosen due to the complexity and potential inaccuracies in generating synthetic data.
Semi-Supervised Learning: While effective, it still requires a significant amount of labeled data, which may not always be available.

Risks and Mitigations

Data Privacy Concerns: Mitigation through the use of secure federated learning.

Open Questions

Random initialization vs transfer learning from pretrained DinoV2
Any other algorithms preferable to is DinoV2 the most suitable SSL algorithm for the proposed use case, or should other SSL algorithms be considered?
Support out of domain pretraining? Maybe look for other MRI datasets of different organs and fine-tune on only one organ.
Include Parameter Efficient Fine-Tuning? Can be much faster to train than training the entire model, vastly more bandwidth efficient.
Implement the DinoV2 data curation pipeline? If customers want to use DinoV2 SSL it could be beneficial to use this pipeline to curate their unlabeled data and remove any noisy/unrepresentative data.
HuggingFace Transformers has a robust configuration system, should we consider integrating part of it into OpenFL plan.yaml?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Enable Self-Supervised Learning in OpenFL #1316

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Proposal: Enable Self-Supervised Learning in OpenFL #1316

porteratzo Jan 22, 2025 Maintainer

Proposal: Enable Self-Supervised Learning in OpenFL

Summary

Motivation

High-Level Design

Technical Details

Dataset

Pretraining Workspace

Fine-Tuning Workspace

Documentation

Result Comparison Notebook

API Changes

Dependencies

Backward Compatibility

Alternatives Considered

Risks and Mitigations

Open Questions

Replies: 0 comments

porteratzo
Jan 22, 2025
Maintainer