Contributer: Gengyuan Zhang
Contact: [email protected]
I create a deep learning project template to help me to start a new project with convenient logging and visualization supported by mlflow and tensorboard.
It includes:
-
training/testing/resuming a new task
-
saving all checkpoints and artifacts to a local directory including git commit version, config file copy, metrics etc.
-
generate a new experiment folder once submitting a new script
-
using Distributed Data Parallel to realisze one-node multi-gpu training
- Define your model that inherits ConfigModel class
- Define your trainer in main.py
- Start mlflow server
mlflow ui -h 0.0.0.0 -p 5055
- Start tensorboard server'
tensorboard --host 0.0.0.0 --logdir mlruns/{run_id}