Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[TRACKING] feat: Integrate SwanLab for experiment tracking with onlin…
…e/offline mode and local dashboard support (#218) --- ### Pull Request Description This PR introduces **SwanLab**, a lightweight open-source experiment tracking tool, as a new logging option for the training framework. The integration provides both online and offline tracking capabilities, along with a local dashboard for visualizing results. Below is a detailed overview of the changes and usage instructions: --- #### **Key Features of SwanLab Integration** 1. **Online and Offline Tracking**: - **Online Mode**: Track experiments remotely and store data on SwanLab's cloud platform. - **Offline Mode**: Use a local dashboard to visualize training logs without an internet connection. 2. **Hardware Monitoring**: - Automatically tracks GPU usage, power consumption, temperature, and other hardware metrics. - Supports NVIDIA GPUs and Huawei Ascend NPUs. 3. **Remote Access**: - View training progress remotely via the SwanLab web interface or mobile app. 4. **Local Dashboard**: - Includes an open-source local dashboard for offline visualization of training logs. --- #### **Usage Instructions** ##### **Step 1: Set Up Online Tracking (Optional)** To use SwanLab's online tracking, log in to the [SwanLab website](https://swanlab.cn) and obtain your API key from the [Settings page](https://swanlab.cn/space/~/settings). Then, authenticate using the following command: ```bash swanlab login ``` If you prefer offline mode, skip this step. --- ##### **Step 2: Configure SwanLab as the Logger** To enable SwanLab as the experiment tracker, add `trainer.logger=['swanlab']` to your training command. For example, using the [Post-train a LLM using PPO with GSM8K dataset](https://verl.readthedocs.io/en/latest/start/quickstart.html) workflow: ```bash PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \ data.train_files=$HOME/data/gsm8k/train.parquet \ data.val_files=$HOME/data/gsm8k/test.parquet \ data.train_batch_size=256 \ data.val_batch_size=1312 \ data.max_prompt_length=512 \ data.max_response_length=256 \ actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.actor.ppo_mini_batch_size=64 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \ critic.optim.lr=1e-5 \ critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \ critic.ppo_micro_batch_size_per_gpu=4 \ algorithm.kl_ctrl.kl_coef=0.001 \ trainer.logger=['console','swanlab'] \ +trainer.val_before_train=False \ trainer.default_hdfs_dir=null \ trainer.n_gpus_per_node=1 \ trainer.nnodes=1 \ trainer.save_freq=10 \ trainer.test_freq=10 \ trainer.total_epochs=15 2>&1 | tee verl_demo.log ``` If you are not logged in, you will be prompted to choose a tracking mode: 1. **Cloud Mode**: Upload logs to SwanLab's cloud platform. 2. **Cloud-Only Mode**: Upload logs to the cloud but do not save them locally. 3. **Local Mode**: Save logs locally for offline tracking. <img width="1325" alt="select" src="https://github.com/user-attachments/assets/5c55fc45-79a9-4673-ae4e-ea9d0623dd29" /> Alternatively, you can configure SwanLab using environment variables: ```bash export SWANLAB_API_KEY=<your_api_key> # Set API key for online tracking export SWANLAB_LOG_DIR=<local_log_path> # Set local log directory export SWANLAB_MODE=<mode> # Set tracking mode: cloud (default), cloud-only, local, or disabled ``` --- ##### **Step 3: View Training Logs** After logging in, you will see a confirmation message: <img width="1415" alt="track" src="https://github.com/user-attachments/assets/87c4ff2f-c8c4-4e7a-a41e-21afa935cb56" /> - **Online Tracking**: View logs on the [SwanLab website](https://swanlab.cn). <img width="1900" alt="remote" src="https://github.com/user-attachments/assets/5b44b9f3-948f-4f93-9873-572bce56daf7" /> For more details, refer to the [SwanLab Cloud Documentation](https://docs.swanlab.cn/guide_cloud/experiment_track/view-result.html). - **Offline Tracking**: Use the local dashboard to visualize logs: ```bash swanlab watch ``` For advanced configurations, such as setting a custom port, refer to the [Offline Dashboard Documentation](https://docs.swanlab.cn/guide_cloud/self_host/offline-board.html) and [CLI Documentation](https://docs.swanlab.cn/api/cli-swanlab-watch.html#%E8%AE%BE%E7%BD%AEip%E5%92%8C%E7%AB%AF%E5%8F%A3%E5%8F%B7). --- #### **Impact** - Provides a lightweight, flexible, and user-friendly experiment tracking solution. - Supports both online and offline use cases, making it suitable for environments with restricted internet access. - Enhances hardware monitoring capabilities for better resource utilization. --- This PR is ready for review. Feedback and suggestions are welcome!
- Loading branch information