Skip to content

Commit

Permalink
[TRACKING] feat: Integrate SwanLab for experiment tracking with onlin…
Browse files Browse the repository at this point in the history
…e/offline mode and local dashboard support (#218)

---

### Pull Request Description  

This PR introduces **SwanLab**, a lightweight open-source experiment
tracking tool, as a new logging option for the training framework. The
integration provides both online and offline tracking capabilities,
along with a local dashboard for visualizing results. Below is a
detailed overview of the changes and usage instructions:

---

#### **Key Features of SwanLab Integration**

1. **Online and Offline Tracking**:
- **Online Mode**: Track experiments remotely and store data on
SwanLab's cloud platform.
- **Offline Mode**: Use a local dashboard to visualize training logs
without an internet connection.

2. **Hardware Monitoring**:
- Automatically tracks GPU usage, power consumption, temperature, and
other hardware metrics.
   - Supports NVIDIA GPUs and Huawei Ascend NPUs.

3. **Remote Access**:
- View training progress remotely via the SwanLab web interface or
mobile app.

4. **Local Dashboard**:
- Includes an open-source local dashboard for offline visualization of
training logs.

---

#### **Usage Instructions**

##### **Step 1: Set Up Online Tracking (Optional)**

To use SwanLab's online tracking, log in to the [SwanLab
website](https://swanlab.cn) and obtain your API key from the [Settings
page](https://swanlab.cn/space/~/settings). Then, authenticate using the
following command:

```bash
swanlab login
```

If you prefer offline mode, skip this step.

---

##### **Step 2: Configure SwanLab as the Logger**

To enable SwanLab as the experiment tracker, add
`trainer.logger=['swanlab']` to your training command. For example,
using the [Post-train a LLM using PPO with GSM8K
dataset](https://verl.readthedocs.io/en/latest/start/quickstart.html)
workflow:

```bash
PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
 data.train_files=$HOME/data/gsm8k/train.parquet \
 data.val_files=$HOME/data/gsm8k/test.parquet \
 data.train_batch_size=256 \
 data.val_batch_size=1312 \
 data.max_prompt_length=512 \
 data.max_response_length=256 \
 actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
 actor_rollout_ref.actor.optim.lr=1e-6 \
 actor_rollout_ref.actor.ppo_mini_batch_size=64 \
 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
 actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
 actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
 actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
 actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
 critic.optim.lr=1e-5 \
 critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
 critic.ppo_micro_batch_size_per_gpu=4 \
 algorithm.kl_ctrl.kl_coef=0.001 \
 trainer.logger=['console','swanlab'] \
 +trainer.val_before_train=False \
 trainer.default_hdfs_dir=null \
 trainer.n_gpus_per_node=1 \
 trainer.nnodes=1 \
 trainer.save_freq=10 \
 trainer.test_freq=10 \
 trainer.total_epochs=15 2>&1 | tee verl_demo.log
```

If you are not logged in, you will be prompted to choose a tracking
mode:

1. **Cloud Mode**: Upload logs to SwanLab's cloud platform.
2. **Cloud-Only Mode**: Upload logs to the cloud but do not save them
locally.
3. **Local Mode**: Save logs locally for offline tracking.

<img width="1325" alt="select"
src="https://github.com/user-attachments/assets/5c55fc45-79a9-4673-ae4e-ea9d0623dd29"
/>

Alternatively, you can configure SwanLab using environment variables:

```bash
export SWANLAB_API_KEY=<your_api_key>          # Set API key for online tracking
export SWANLAB_LOG_DIR=<local_log_path>        # Set local log directory
export SWANLAB_MODE=<mode>                    # Set tracking mode: cloud (default), cloud-only, local, or disabled
```

---

##### **Step 3: View Training Logs**

After logging in, you will see a confirmation message:

<img width="1415" alt="track"
src="https://github.com/user-attachments/assets/87c4ff2f-c8c4-4e7a-a41e-21afa935cb56"
/>

- **Online Tracking**: View logs on the [SwanLab
website](https://swanlab.cn).

<img width="1900" alt="remote"
src="https://github.com/user-attachments/assets/5b44b9f3-948f-4f93-9873-572bce56daf7"
/>

For more details, refer to the [SwanLab Cloud
Documentation](https://docs.swanlab.cn/guide_cloud/experiment_track/view-result.html).

- **Offline Tracking**: Use the local dashboard to visualize logs:

  ```bash
  swanlab watch
  ```

For advanced configurations, such as setting a custom port, refer to the
[Offline Dashboard
Documentation](https://docs.swanlab.cn/guide_cloud/self_host/offline-board.html)
and [CLI
Documentation](https://docs.swanlab.cn/api/cli-swanlab-watch.html#%E8%AE%BE%E7%BD%AEip%E5%92%8C%E7%AB%AF%E5%8F%A3%E5%8F%B7).

---

#### **Impact**

- Provides a lightweight, flexible, and user-friendly experiment
tracking solution.
- Supports both online and offline use cases, making it suitable for
environments with restricted internet access.
- Enhances hardware monitoring capabilities for better resource
utilization.

---

This PR is ready for review. Feedback and suggestions are welcome!
  • Loading branch information
ShaohonChen authored Feb 7, 2025
1 parent 3140cc2 commit 958a326
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 2 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ verl is fast with:
- Support model-based reward and function-based reward (verifiable reward)
- flash-attention, [sequence packing](examples/ppo_trainer/run_qwen2-7b_seq_balance.sh), [long context](examples/ppo_trainer/run_deepseek7b_llm_sp2.sh) support via DeepSpeed Ulysses, [LoRA](examples/sft/gsm8k/run_qwen_05_peft.sh), [Liger-kernel](examples/sft/gsm8k/run_qwen_05_sp2_liger.sh)
- scales up to 70B models and hundreds of GPUs
- experiment tracking with wandb and mlflow
- experiment tracking with wandb, swanlab and mlflow

## Upcoming Features
- Reward model training
Expand Down
20 changes: 19 additions & 1 deletion verl/utils/tracking.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@


class Tracking(object):
supported_backend = ['wandb', 'mlflow', 'console']
supported_backend = ["wandb", "mlflow", "swanlab", "console"]

def __init__(self, project_name, experiment_name, default_backend: Union[str, List[str]] = 'console', config=None):
if isinstance(default_backend, str):
Expand All @@ -47,6 +47,22 @@ def __init__(self, project_name, experiment_name, default_backend: Union[str, Li
mlflow.log_params(_compute_mlflow_params_from_objects(config))
self.logger['mlflow'] = _MlflowLoggingAdapter()

if "swanlab" in default_backend:
import swanlab
import os

SWANLAB_API_KEY = os.environ.get("SWANLAB_API_KEY", None)
SWANLAB_LOG_DIR = os.environ.get("SWANLAB_LOG_DIR", "swanlog")
SWANLAB_MODE = os.environ.get("SWANLAB_MODE", "cloud")
if SWANLAB_API_KEY:
swanlab.login(SWANLAB_API_KEY) # NOTE: previous login information will be overwritten
swanlab.init(project=project_name,
experiment_name=experiment_name,
config=config,
logdir=SWANLAB_LOG_DIR,
mode=SWANLAB_MODE)
self.logger["swanlab"] = swanlab

if 'console' in default_backend:
from verl.utils.logger.aggregate_logger import LocalLogger
self.console_logger = LocalLogger(print_to_console=True)
Expand All @@ -60,6 +76,8 @@ def log(self, data, step, backend=None):
def __del__(self):
if 'wandb' in self.logger:
self.logger['wandb'].finish(exit_code=0)
if 'swanlab' in self.logger:
self.logger['swanlab'].finish()


class _MlflowLoggingAdapter:
Expand Down

0 comments on commit 958a326

Please sign in to comment.