[TRACKING] feat: Integrate SwanLab for experiment tracking with onlin…

…e/offline mode and local dashboard support (#218) --- ### Pull Request Description This PR introduces **SwanLab**, a lightweight open-source experiment tracking tool, as a new logging option for the training framework. The integration provides both online and offline tracking capabilities, along with a local dashboard for visualizing results. Below is a detailed overview of the changes and usage instructions: --- #### **Key Features of SwanLab Integration** 1. **Online and Offline Tracking**: - **Online Mode**: Track experiments remotely and store data on SwanLab's cloud platform. - **Offline Mode**: Use a local dashboard to visualize training logs without an internet connection. 2. **Hardware Monitoring**: - Automatically tracks GPU usage, power consumption, temperature, and other hardware metrics. - Supports NVIDIA GPUs and Huawei Ascend NPUs. 3. **Remote Access**: - View training progress remotely via the SwanLab web interface or mobile app. 4. **Local Dashboard**: - Includes an open-source local dashboard for offline visualization of training logs. --- #### **Usage Instructions** ##### **Step 1: Set Up Online Tracking (Optional)** To use SwanLab's online tracking, log in to the [SwanLab website](https://swanlab.cn) and obtain your API key from the [Settings page](https://swanlab.cn/space/~/settings). Then, authenticate using the following command: ```bash swanlab login ``` If you prefer offline mode, skip this step. --- ##### **Step 2: Configure SwanLab as the Logger** To enable SwanLab as the experiment tracker, add `trainer.logger=['swanlab']` to your training command. For example, using the [Post-train a LLM using PPO with GSM8K dataset](https://verl.readthedocs.io/en/latest/start/quickstart.html) workflow: ```bash PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \ data.train_files=$HOME/data/gsm8k/train.parquet \ data.val_files=$HOME/data/gsm8k/test.parquet \ data.train_batch_size=256 \ data.val_batch_size=1312 \ data.max_prompt_length=512 \ data.max_response_length=256 \ actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.actor.ppo_mini_batch_size=64 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \ critic.optim.lr=1e-5 \ critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \ critic.ppo_micro_batch_size_per_gpu=4 \ algorithm.kl_ctrl.kl_coef=0.001 \ trainer.logger=['console','swanlab'] \ +trainer.val_before_train=False \ trainer.default_hdfs_dir=null \ trainer.n_gpus_per_node=1 \ trainer.nnodes=1 \ trainer.save_freq=10 \ trainer.test_freq=10 \ trainer.total_epochs=15 2>&1 | tee verl_demo.log ``` If you are not logged in, you will be prompted to choose a tracking mode: 1. **Cloud Mode**: Upload logs to SwanLab's cloud platform. 2. **Cloud-Only Mode**: Upload logs to the cloud but do not save them locally. 3. **Local Mode**: Save logs locally for offline tracking. <img width="1325" alt="select" src="https://github.com/user-attachments/assets/5c55fc45-79a9-4673-ae4e-ea9d0623dd29" /> Alternatively, you can configure SwanLab using environment variables: ```bash export SWANLAB_API_KEY=<your_api_key> # Set API key for online tracking export SWANLAB_LOG_DIR=<local_log_path> # Set local log directory export SWANLAB_MODE=<mode> # Set tracking mode: cloud (default), cloud-only, local, or disabled ``` --- ##### **Step 3: View Training Logs** After logging in, you will see a confirmation message: <img width="1415" alt="track" src="https://github.com/user-attachments/assets/87c4ff2f-c8c4-4e7a-a41e-21afa935cb56" /> - **Online Tracking**: View logs on the [SwanLab website](https://swanlab.cn). <img width="1900" alt="remote" src="https://github.com/user-attachments/assets/5b44b9f3-948f-4f93-9873-572bce56daf7" /> For more details, refer to the [SwanLab Cloud Documentation](https://docs.swanlab.cn/guide_cloud/experiment_track/view-result.html). - **Offline Tracking**: Use the local dashboard to visualize logs: ```bash swanlab watch ``` For advanced configurations, such as setting a custom port, refer to the [Offline Dashboard Documentation](https://docs.swanlab.cn/guide_cloud/self_host/offline-board.html) and [CLI Documentation](https://docs.swanlab.cn/api/cli-swanlab-watch.html#%E8%AE%BE%E7%BD%AEip%E5%92%8C%E7%AB%AF%E5%8F%A3%E5%8F%B7). --- #### **Impact** - Provides a lightweight, flexible, and user-friendly experiment tracking solution. - Supports both online and offline use cases, making it suitable for environments with restricted internet access. - Enhances hardware monitoring capabilities for better resource utilization. --- This PR is ready for review. Feedback and suggestions are welcome!
volcengine · Feb 7, 2025 · 958a326 · 958a326
1 parent 3140cc2
commit 958a326
Show file tree

Hide file tree

Showing 2 changed files with 20 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -45,7 +45,7 @@ verl is fast with:
   - Support model-based reward and function-based reward (verifiable reward)
 - flash-attention, [sequence packing](examples/ppo_trainer/run_qwen2-7b_seq_balance.sh), [long context](examples/ppo_trainer/run_deepseek7b_llm_sp2.sh) support via DeepSpeed Ulysses, [LoRA](examples/sft/gsm8k/run_qwen_05_peft.sh), [Liger-kernel](examples/sft/gsm8k/run_qwen_05_sp2_liger.sh)
 - scales up to 70B models and hundreds of GPUs
-- experiment tracking with wandb and mlflow
+- experiment tracking with wandb, swanlab and mlflow
 
 ## Upcoming Features
 - Reward model training

diff --git a/verl/utils/tracking.py b/verl/utils/tracking.py
@@ -22,7 +22,7 @@
 
 
 class Tracking(object):
-    supported_backend = ['wandb', 'mlflow', 'console']
+    supported_backend = ["wandb", "mlflow", "swanlab", "console"]
 
     def __init__(self, project_name, experiment_name, default_backend: Union[str, List[str]] = 'console', config=None):
         if isinstance(default_backend, str):
@@ -47,6 +47,22 @@ def __init__(self, project_name, experiment_name, default_backend: Union[str, Li
             mlflow.log_params(_compute_mlflow_params_from_objects(config))
             self.logger['mlflow'] = _MlflowLoggingAdapter()
 
+        if "swanlab" in default_backend:
+            import swanlab
+            import os
+
+            SWANLAB_API_KEY = os.environ.get("SWANLAB_API_KEY", None)
+            SWANLAB_LOG_DIR = os.environ.get("SWANLAB_LOG_DIR", "swanlog")
+            SWANLAB_MODE = os.environ.get("SWANLAB_MODE", "cloud")
+            if SWANLAB_API_KEY:
+                swanlab.login(SWANLAB_API_KEY)  # NOTE: previous login information will be overwritten
+            swanlab.init(project=project_name,
+                         experiment_name=experiment_name,
+                         config=config,
+                         logdir=SWANLAB_LOG_DIR,
+                         mode=SWANLAB_MODE)
+            self.logger["swanlab"] = swanlab
+
         if 'console' in default_backend:
             from verl.utils.logger.aggregate_logger import LocalLogger
             self.console_logger = LocalLogger(print_to_console=True)
@@ -60,6 +76,8 @@ def log(self, data, step, backend=None):
     def __del__(self):
         if 'wandb' in self.logger:
             self.logger['wandb'].finish(exit_code=0)
+        if 'swanlab' in self.logger:
+            self.logger['swanlab'].finish()
 
 
 class _MlflowLoggingAdapter: