diff --git a/README.ch.md b/README.ch.md
index afbb7798..a9b4e24f 100644
--- a/README.ch.md
+++ b/README.ch.md
@@ -2,184 +2,270 @@
[![PyPI version](https://badge.fury.io/py/lumo.svg)](https://badge.fury.io/py/lumo)
![Python-Test](https://github.com/pytorch-lumo/lumo/actions/workflows/python-test.yml/badge.svg)
-[![license](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/pytorch-lumo/lumo/blob/master/LICENSE)
+[![license](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/Lightning-AI/lightning/blob/master/LICENSE)
+![Python-doc](./images/docstr_coverage_badge.svg)
+
+`lumo` 是一个精简高效的库,简化了实验所需的所有组件的管理,并特别关注增强深度学习实践者的体验。
+
+- 实验管理:: 为每次运行分配唯一路径,区分不同类型的文件并存储;通过 git 管理代码快照;记录实验中产生的一切信息,保障可回溯、可复现
+- 参数管理:基于 fire 提供比 argparser 更便捷的参数管理
+- 运行时配置:提供多级作用域下的配置管理
+- 可视化:基于 [Panel](https://panel.holoviz.org/index.html) 提供可交互的 jupyter 实验管理面板
+- 为深度学习提供额外的优化
+ - 训练:基于 Trainer 提供可任意扩展的训练逻辑简化,并提供完善的回调逻辑
+ - 优化器:参数与优化器构建一体化
+ - 数据: 数据集构建流程抽象、组合多个 DataLoader、...
+ - 分布式训练:同样支持多种训练加速框架,统一抽象,方便随时切换
+- 更多工具类...
+
+![lumo-framework](./images/lumo-intro.png)
+
+# :book: 目录
+
+- :book: [目录](#book-目录)
+- :cloud: [安装](#cloud-安装)
+- :book: [快速开始](#book-快速开始)
+ - :small_orange_diamond: [已有项目嵌入](#small_orange_diamond-已有项目嵌入)
+ - :small_orange_diamond: [从零开始](#small_orange_diamond-从零开始)
+ - :small_orange_diamond: [可视化界面](#small_orange_diamond-可视化界面)
+ - :small_orange_diamond: [复现实验](#small_orange_diamond-复现实验)
+ - :small_orange_diamond: [备份](#small_orange_diamond-备份)
+- [More](#more)
+- :pencil: [Acknowledge](#pencil-acknowledge)
+- :scroll: [License](#scroll-license)
+
+# :cloud: 安装
+
+安装已发布的通过了所有测试的版本
+```bash
+pip install -U lumo
+```
-`lumo`:轻量、可扩展、功能解耦合的 Pytorch 实验框架。
-
-lumo 的设计理念:
-
-- 模块解耦合:所有模块可以单独作为您现在使用的框架中的一个插件使用(而不像其他框架几乎耦合在一起)
-- 恰到好处的抽象:和模型相关的细节完全由使用者掌控,lumo 只封装了外部通用逻辑(而不像其他一些框架会代理模型初始化或损失迭代)
-- 覆盖整个生命周期:数据集构建、模型初始化、随机种子、训练/测试...,lumo 为所有步骤提供了功能包或流程简化
-- 极强的可扩展性:从单文件到包含多个领域多个方法的项目,lumo 都可以提供舒适的使用体验。已在两个领域有复现项目的最佳实践示例(见[Related Work](#Related Work))。
-
-# 如何使用
-
-## 安装
-
-从 pypi 或 github 主页安装最新的稳定版本:
+或从 dev1 分支安装最新版本:
```bash
-pip install -U lumo
pip install git+https://github.com/pytorch-lumo/lumo
```
+实验面板依赖于 panel,需要额外安装:
-## 快速开始
+```
+pip install panel
+```
-本节包含 lumo 最常用的几个子功能,帮助使用者快速利用这些功能减少已有项目中的冗余代码,这些功能包括:
+# :book: 快速开始
-- 一些常用功能的更优替代,如 [Params](#参数控制)(arguments 的平替),[Logger](#变量&日志记录)(logging 的平替)
-- 一些训练过程中部份流程的优化,如 [Experiment](#路径管理&版本控制)(提供无重复的实验路径管理、基于 git 的版本控制),[DatasetBuilder](#数据集构建)(更快构建数据集),
+以下是几个经典场景:
-### 参数控制
+## :small_orange_diamond: 已有项目嵌入
-`argparse` 的更优替代。`Params` 底层依托于 [omegaconf](https://github.com/omry/omegaconf) 和 [fire](https://github.com/google/python-fire)
-,只需要简单的配置,就可以从文件、命令行中读取参数。
+对已有项目,可以通过以下方式快速嵌入
-直接基于 Params 类定义参数:
+- 引入
```python
-# python main.py --epoch=30 --dataset=100
-from lumo import Params
-
-params = Params()
-params.epoch = 20
-# 集成优化器参数,自带补全提示
-params.optim = params.OPTIM.create_optim('Adam', lr=0.0001, weight_decay=4e-5)
-# 数据集只能从 cifar10/cifar100 中选择,且默认为 cifar10,其他的选择会报错
-params.dataset = params.choice('cifar10', 'cifar100')
+import random
+from lumo import SimpleExperiment, Params, Logger, Meter, Record
+```
-# 从命令行参数中更新
-params.from_args()
-print(params.epoch) # -> 30
-print(params.dataset) # -> cifar100
+- 初始化 Logger 和 Experiment
-# 保存到文件
-params.to_json('./config.json')
-params.to_yaml('./config.yaml')
-# 从文件中更新
-params.from_json('./config.json')
-params.from_yaml('./config.yaml')
+```python
+logger = Logger()
+# 定义及使用,无需转换
+exp = SimpleExperiment(exp_name='my_exp_a') # 为每种实验手动定义唯一名称
+exp.start()
+logger.add_log_dir(exp.mk_ipath())
```
-也可以通过继承、多重继承来嵌套,组合参数。即使在命令行中输入了不存在的参数,Params 也会正常读取。
-
-### 变量&日志记录
+- 初始化参数(代替 argparse 等)
-`logging` 的更优替代。通过 Meter、Record 和 Logger,可以实现变量的记录和格式化输出。其中:
+```python
+# 替换基于 argparse 等的参数定义方法
+params = Params()
+params.dataset = params.choice('cifar10', 'cifar100')
+params.alpha = params.arange(default=1, left=0, right=10)
+params.from_args() # python3 train.py --dataset=cifar100 --alpha=0.2
+print(params.to_dict()) # {"dataset": "cifar100", "alpha": 0.2}
+```
-- Meter 记录单次的值
-- Record 以一定规则归约 Meter 实例(如 mean、sum 等)
-- Logger 用于代替 logging,除常用的 info、warn 等方法外,还提供了 inline 方法,可以在屏幕能单行更新(实际中,屏幕打印时间远小于训练时间,因此单行更新带来的时间开销可以忽略不计)。
+- 在训练过程中记录参数、存储信息(代替手动管理路径、自己维护 AvgItem)
```python
-import random
-import time
+# 记录实验参数
+exp.dump_info('params', params.to_dict())
+print(exp.test_name) # 为每次实验自动分配唯一名称
+
+# 基于命名空间提供本次实验的唯一路径
+# 元数据和二进制大文件分离,方便清理
+params.to_yaml(exp.mk_ipath('params.yaml'))
-from lumo import Record, Meter, Logger
+for i in range(10):
+ # 记录实验指标
+ max_acc = exp.dump_metric('Acc', random.random(), cmp='max')
+ logger.info(f'Max acc {max_acc}')
-log = Logger()
+ # 存储大文件/二进制文件(如模型权重)
+ ckpt_fn = exp.mk_bpath('checkpoints', f'model_{i}.ckpt')
+ ... # 保存代码 given ckpt_fn
record = Record()
-for idx in range(256):
- meter = Meter()
- meter.last.i = idx
- meter.sum.acc = idx
- meter.mean.loss = random.random()
+for batch in range(10):
+ m = Meter()
+ m.mean.Lall = random.random()
+ m.last.lr = batch
+ record.record(m)
+ logger.info(record)
+
+# 主动结束实验,补充元信息。也可以在进程结束后由 hook 自动结束,支持针对异常的记录
+exp.end()
+```
- record.record(meter)
- log.inline(record) # 单行更新
- time.sleep(0.5)
- if idx % 50 == 0:
- log.newline()
- record.clear()
+## :small_orange_diamond: 从零开始
-log.info(record)
-```
+如果从新开始一个深度学习实验,那么可以使用 lumo 全方位的加速代码的构建,下面提供了多个不同规模下使用 lumo 训练的示例:
-### 路径管理&版本控制
+单文件:
-`Experiment` 主要提供路径管理,可以为每一次试验根据实验名、日期、次数等自动提供不一样的保存路径。此外,Experiment 还可以通过 hook
-提供如代码版本管理、元数据记录等功能。在实验中,可以使用其子类 `SimpleExperiment` 实现大部分需求。
+| 示例 | CoLab | 代码行数 |
+|----------------------------------------|-------|------|
+| [MNIST 示例](./examples/mnist.py) | | 118 |
+| [MocoV2 训练 CIFAR10](./examples/moco.py) || 284 |
+| [多卡训练 ImageNet]() |||
-```python
-from lumo import SimpleExperiment
-from lumo import Params
+实验项目:
-pm = Params()
-pm.module = 'example'
-pm.from_args()
+| 项目 | 说明 |
+|-----------------------------------------------------------------------------------------------------------|-------------------------------|
+| [image-classification](https://github.com/pytorch-lumo/image-classification) | 集成了全监督、半监督、自监督的多个论文的复现代码 |
+| [emotion-recognition-in-coversation](https://github.com/pytorch-lumo/emotion-recognition-in-conversation) | 集成了对话情感分类、多模态对话情感分类的多个论文的复现代码 |
-# 注册该次实验,实验名为 `pm.module`
-exp = SimpleExperiment(pm.module)
-# 实验开始,该方法会调用已注册的 ExpHooks,完成代码版本控制等功能。
-exp.start()
+## :small_orange_diamond: 可视化界面
-# 小数据通过 `.test_file()` 获得路径
-fn = exp.test_file('params.json')
-pm.to_json(fn)
+在 jupyter 中:
-# 大文件通过 `.blob_file()` 获得路径(这是约定,而不是强制,大文件也可以保存到 `.test_file()` 中)
-fn = exp.blob_file('checkpoint.pt')
-with open(fn, 'w') as w:
- w.write('write big data in blob file')
+```python
+from lumo import Watcher
-print(exp.test_root)
-print(exp.get_prop('git')) # see git commit history
-exp.end()
+w = Watcher()
+df = w.load()
+widget = w.panel(df)
+widget.servable()
```
-### 数据集构建
+![Panel](./images/panel-example.png)
-![DatasetBuilder](./images/DatasetBuilder.png)
+可视化手动筛选后的实验:
+![Panel](./images/panel-example2.png)
-`DatasetBuilder` 是采用有向无环图思路设计的数据集构建类,该类提供了一个恰当的抽象逻辑,避免了在一个实验里定义多个重复 Datasets 类。
+可以直接使用命令行打开页面查看当前所有实验:
-`DatasetBuilder `将数据集的构件划分为输入-输出两阶段,同时提供 `.chain()`(序列格式)和`.zip()`(字典格式) 两种输出方式。
+```
+lumo board [--port, --address, --open]
+```
-```python
-from lumo import DatasetBuilder
-from torchvision.transforms import transforms
-import torch
-
-# Create a mnist-like dummy dataset
-db = (
- DatasetBuilder()
- .add_input("xs", torch.rand(500, 28, 28))
- .add_input("ys", torch.randint(0, 10, (500,)))
- .add_idx('id')
- .add_output("xs", "xs1", transforms.RandomHorizontalFlip())
- .add_output("xs", "xs2", )
- .add_output("ys", "ys")
-)
-# Watch dataset structure
-print(db)
-# Builder(flow={'::idx::': ['id'], 'xs': ['xs1', 'xs2'], 'ys': ['ys']}, sized=True, size=500, iterable=True)
+## :small_orange_diamond: 复现实验
-print(db[0])
-# dict_keys(['id', 'xs1', 'xs2', 'ys'])
+对因为个中原因失败的实验,在藉由可视化界面观察并解决后,可以通过唯一实验 Id (test_name) 直接重跑,并对关键参数重新赋值:
+
+```
+lumo rerun 230313.030.57t --device=0
```
-# 更多教程
+## :small_orange_diamond: 备份
-# Related Work
+记录实验信息到 Github issue (基于 PyGitHub):
-- [image-classification](https://github.com/pytorch-lumo/image-classification): supervised/semi-supervised/self-supervised/noisy label learning on image-classfication
- field. (suporrted datasets: CIFAR10/CIFAR100/STL-10/SVHN/ImageNet/tinyimagenet)
-- [emotion-recognition-in-conversation](https://github.com/pytorch-lumo/emotion-recognition-in-conversation):Multimodel emotional recognition on conversation. (suporrted datasets: IEMOCAP/MELD/MOSEI)
+```python
+from lumo import Experiment, Watcher
+from lumo import glob
+
+glob['github_access_token'] = 'ghp_*' # `access_token` 的默认值,建议将 access_token 写在全局配置 `~/.lumorc.json` 中
+
+w = Watcher()
+df = w.load()
+
+# 选择单个实验备份
+exp = Experiment.from_cache(df.iloc[0].to_dict())
+issue = exp.backup('github', repo='pytorch-lumo/image-classification-private',
+ access_token='ghp_*',
+ update=True, # 如果已备份,则覆盖更新之前的 issue
+ labels=None, # 可选标签
+ )
+print(issue.number)
+
+# 批量备份,并且基于每个实验的参数添加标签
+issues = df.apply(
+ lambda x: Experiment.from_cache(x.to_dict()).backup(..., labels=[x['params'].get('dataset', '')]),
+ axis=1
+)
+```
+![backup_github](./images/backup_github.png)
-# Acknowledge
+# Full properties
- 一个人维护一个库四年,背后的动力是我持续不断的使用,感谢 lumo 陪我见证我的学术生涯。lumo 确实不一定适合所有人的习惯,但一定最适合我自己。lumo 取自 lumos,这是哈利波特里魔法杖发光的咒语。torch 是火炬,ignite 是点燃,所以 lumo 也向往着发光发热,希望 lumo 给大家带来美好的使用体验。
+```
+{'agent': nan,
+ 'backup': {'23-03-17-003438': {'backend': 'github',
+ 'number': 9,
+ 'repo': '...'},
+ },
+ 'exception': nan,
+ 'execute': {'cwd': '~/Documents/Python/lumo',
+ 'exec_argv': ['~/Documents/Python/lumo/a.py'],
+ 'exec_bin': '~/.pyenv/versions/3.9.16/bin/python3.9',
+ 'exec_file': '~/Documents/Python/lumo/a.py',
+ 'repo': '~/Documents/Python/lumo'},
+ 'exp_name': 'my_exp_a',
+ 'git': {'commit': '1014b6b5',
+ 'dep_hash': 'c93b8c4e340882f55cf0c8e125fa0203',
+ 'repo': '~/Documents/Python/lumo'},
+ 'hooks': {'Diary': {'loaded': True, 'msg': ''},
+ 'FinalReport': {'loaded': True, 'msg': ''},
+ 'GitCommit': {'loaded': True, 'msg': ''},
+ 'LastCmd': {'loaded': True, 'msg': ''},
+ 'LockFile': {'loaded': True, 'msg': ''},
+ 'RecordAbort': {'loaded': True, 'msg': ''}},
+ 'lock': {'accelerate': '0.16.0',
+ 'decorator': '5.1.1',
+ 'fire': '0.5.0',
+ 'hydra': '1.3.2',
+ 'joblib': '1.2.0',
+ 'lumo': '0.15.0',
+ 'numpy': '1.24.2',
+ 'omegaconf': '2.3.0',
+ 'psutil': '5.9.4',
+ 'torch': '1.13.1'},
+ 'note': 'This is a Note',
+ 'params': {'alpha': 1, 'dataset': 'cifar10'},
+ 'paths': {'blob_root': '~/.lumo/blob',
+ 'cache_root': '~/.lumo/cache',
+ 'info_root': '~/.lumo/experiments'},
+ 'pinfo': {'hash': '0af4b77497c85bc5b65ccbdd9ff4ca0f',
+ 'obj': {'argv': ['~/.pyenv/versions/3.9.16/bin/python3.9',
+ '~/Documents/Python/lumo/a.py'],
+ 'pid': 63975,
+ 'pname': 'python3.9',
+ 'pstart': 1678898740.099484},
+ 'pid': 63975},
+ 'progress': {'end': '23-03-16-004542',
+ 'end_code': 0,
+ 'last_edit_time': '23-03-16-004542',
+ 'ratio': 1,
+ 'start': '23-03-16-004542',
+ 'update_from': None},
+ 'tags': [],
+ 'test_name': '230316.000.8ct',
+ 'trainer': nan}
+```
-# License
+# :pencil: Acknowledge
-Distributed under the GNU General Public License 3.0. See [LICENSE](./LICENSE) for more information.
+从 2020 年维护至今。感谢 lumo 陪我见证我的研究生生涯。
-# Contact
+# :scroll: License
- - [sailist@outlook.com](mailto:sailist@outlook.com)
+采用 [Apache License Version 2.0](./LICENSE) 协议分发。
diff --git a/README.md b/README.md
index 8d3e8517..78ad25c3 100644
--- a/README.md
+++ b/README.md
@@ -1,3 +1,5 @@
+[中文](./README.ch.md)
+
# lumo
[![PyPI version](https://badge.fury.io/py/lumo.svg)](https://badge.fury.io/py/lumo)
@@ -5,162 +7,256 @@
[![license](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/Lightning-AI/lightning/blob/master/LICENSE)
![Python-doc](./images/docstr_coverage_badge.svg)
-`lumo` is a light-weight library to help construct your experiment code, record your experiment results, especially in the field of deep learning.
-
-
-## Features
-
-`lumo` is designed for reducing difficulty of the frequent code modification in experiments and simplify the redundant code.
-
-At present, `lumo` has these features:
-
- - Simplest code for **Hyperparameter Configuration**、**Dataset Building**、**Module Checkpoint**、**Meter and Log**.
- - Include Git support and random seed management. You can **reset** and **archive** and **reimplement your experiments** by using simple console command.
- - Include a **deep learning experiment code templete**. You can add any experiments with linearly increasing code complexity by using it.
- - The framework follows the design paradigm of **convention over configuration**, the more you follow the convention, the more the framework will do for you.
-
-> Better use Pycharm.
-
-See [document](https://sailist.github.io/lumo/) for details.
-
+`lumo` is a streamlined and efficient library that simplifies the management of all components required for experiments
+and focuses on enhancing the experience of deep learning practitioners.
+
+- **Experimental Management**: **Assign unique id and path** for each run, distinguish and store various file types;
+ manage **code snapshots** through git; **record all information** generated during the experiment to ensure
+ traceability and reproducibility.
+- **Parameter Management:** Provides more convenient **parameter management** than argparser based on fire.
+- **Runtime Configuration:** Provides configuration management under multi-level scopes.
+- **Visualization:** Provides an Jupyter-compatible interactive dashboard for experiment management based
+ on [Panel](https://panel.holoviz.org/index.html).
+- Additional optimization for deep learning:
+ - **Training:** Provides easily extendable training logic based on Trainer and provides comprehensive callback
+ logic.
+ - **Optimizer:** Integrated parameter and optimizer construction.
+ - **Data:** Abstraction of dataset construction process, combination of multiple DataLoaders, etc.
+ - **Distributed Training:** Also supports multiple training acceleration frameworks, unified abstraction, and easy
+ switching at any time.
+- More utilities...
+
+![lumo-framework](./images/lumo-intro.png)
+
+# :book: Table of Contents
+
+- :cloud: [Installation](#cloud-installation)
+- :book: [Quick Start](#book-quick-start)
+ - :small_orange_diamond: [Embedding into Existing Projects](#small_orange_diamond-embedding-into-existing-projects)
+ - :small_orange_diamond: [Building from Scratch](#small_orange_diamond-building-from-scratch)
+ - :small_orange_diamond: [Visual Interface](#small_orange_diamond-visual-interface)
+ - :small_orange_diamond: [re-run](#small_orange_diamond-re-run)
+ - :small_orange_diamond: [backup](#small_orange_diamond-backup)
+- :scroll: [License](#scroll-license)
+
+# :cloud: Installation
+
+Install the published and tested version:
-
-## Install
```bash
-pip install lumo
+pip install -U lumo
```
-or
+Or install the latest version from the dev1 branch:
```bash
-git clone https://github.com/sailist/lumo
-
-python setup.py install
+pip install git+https://github.com/pytorch-lumo/lumo@dev1
```
-### test
+The experiment panel depends on Panel, which needs to be installed separately:
```
-python -m pytest # or python3 -m pytest
+pip install panel
```
-> Only a part of code have unit test.
-
+# :book: Quick Start
-## Requirements
+Here are two classic scenarios:
- - install lumo will automatically install three light other libraries: [fire](https://github.com/google/python-fire), [psutil](https://github.com/giampaolo/psutil), [joblib](https://github.com/joblib/joblib).
- - lumo has mandatory dependencies on `pytorch`, `pandas` and `numpy`, you should manully install these before using lumo since they are usually common-used.
- - lumo has an optional dependency on `GitPython` as a plugin to execute git command, you can run `pip install GitPython` to install it.
-
-```shell
-pip install pandas numpy GitPython
-```
-and then see [pytorch](https://pytorch.org/) to install torch.
+## :small_orange_diamond: Embedding into Existing Projects
+For existing projects, you can quickly embed Lumo by following these steps:
+- Import Lumo and initialize Logger and Experiment:
-## Introduction
+```python
+import random
+from lumo import SimpleExperiment, Params, Logger, Meter, Record
-Unlike other pytorch tools, `lumo` mainly designed for research, there are two core idea of it:
+logger = Logger()
+exp = SimpleExperiment(exp_name='my_exp_a')
+exp.start()
+logger.add_log_dir(exp.mk_ipath())
+```
-1. Reduce repetition of your code.
-2. Make all operations **recordable**, **resumable**, **analyzable**.
+- Initialize parameters:
+```python
+params = Params()
+params.dataset = params.choice('cifar10', 'cifar100')
+params.alpha = params.arange(default=1, left=0, right=10)
+params.from_args() # python3 train.py --dataset=cifar100 --alpha=0.2
+print(params.to_dict()) # {"dataset": "cifar100", "alpha": 0.2}
+```
-Your can click [Tutorial](https://sailist.github.io/lumo/tutorial/) to learn the basic use of this framework. After that, you can view [Cookbook](https://sailist.github.io/lumo/cookbook/) to see some details of this library.
+- Record parameters and store information during training:
-A suggested learning order may be:
+```python
+exp.dump_info('params', params.to_dict())
+print(exp.test_name)
- - Learn highly frequency used module: [Define hyperparameter(Params)](https://sailist.github.io/lumo/params)、[Record variable(Meter)](https://sailist.github.io/lumo/meter)、[Log(Logger)](/lumo/logger)、[Reshape your dataloader(DataBundler)](https://sailist.github.io/lumo/bundler) and their aggregation [Trainer](https://sailist.github.io/lumo/trainer).
- - Learn how to manage/analyse your experiment by [Config](https://sailist.github.io/lumo/exp) and [Experiment](https://sailist.github.io/lumo/exp)
- - Learn how to simple manage random seed by [RndManager](https://sailist.github.io/lumo/rnd) and to create your dataset elegantly by [DatasetBuilder](https://sailist.github.io/lumo/builder)
+params.to_yaml(exp.mk_ipath('params.yaml'))
-After learning above contents, you can view [Cookbook](https://sailist.github.io/lumo/cookbook/) to learn the use of [tempelet code](https://sailist.github.io/lumo/structure) and other [details](https://sailist.github.io/lumo/details).
+for i in range(10):
+ max_acc = exp.dump_metric('Acc', random.random(), cmp='max')
+ logger.info(f'Max acc {max_acc}')
-You can also view another repository [lumo-implement](https://github.com/lumo/lumo-implement) to see a bigger example, it will continuously reimplement papers I interested by using the templete provided in `lumo`.
+ ckpt_fn = exp.mk_bpath('checkpoints', f'model_{i}.ckpt')
+ ... # save code given ckpt_fn
-## Examples
+record = Record()
+for batch in range(10):
+ m = Meter()
+ m.mean.Lall = random.random()
+ m.last.lr = batch
+ record.record(m)
+ logger.info(record)
-Before start, maybe you'd like to see some simple examples to learn what can `lumo` do.
+exp.end()
+```
-### Define hyperparameters
-By use `lumo.frame.Params`, you can define hyperparameters simply. See [Params](https://sailist.github.io/lumo/params) for details.
-```python
-from lumo import Params
-params = Params()
-params.batch_size = 128
-params.from_args() # from command args
+## :small_orange_diamond: Building from Scratch
->>> python ap.py --optim.lr=0.001 --epoch=400 --dataset=cifar10 --k=12
-```
-### Record variable
+If you want to start a new deep learning experiment from scratch, you can use Lumo to accelerate your code development.
+Below are examples of Lumo training at different scales:
-By using `lumo.frame.Meter`, you can record variable and update its average value with as little code as possible. See [Meter](https://sailist.github.io/lumo/meter) for details.
+one-fine training:
-```python
-from lumo import Meter,AvgMeter
-
-am = AvgMeter() # use for record average
-for j in range(500):
- meter = Meter()
- meter.percent(meter.c_) # when print, format 'c' as a percentage
- meter.a = 1
- meter.b = "2"
- meter.c = torch.rand(1)[0]
-
- meter.loss = loss_fn(...)
- meter.rand = torch.rand(2)
- meter.d = [4] # you can record any type of variable
- meter.e = {5: "6"}
-
- am.update(meter) # Update current value in meter. Average value will be calculated automatic by declaration and the type of the variable.
- print(am)
-```
+| Example | CoLab | Lines of Code |
+|---------------------------------------------|-------|---------------|
+| [MNIST example](./examples/mnist.py) | | 118 |
+| [MocoV2 trains CIFAR10](./examples/moco.py) | | 284 |
+| [Multi-GPU training ImageNet]() | ||
+Experimental project:
-## Contribute
+| Project | Description |
+|-----------------------------------------------------------------------------------------------------------|-------------------------------|
+| [image-classification](https://github.com/pytorch-lumo/image-classification) | Reproducible code for multiple papers with full supervision, semi-supervision, and self-supervision |
+| [emotion-recognition-in-coversation](https://github.com/pytorch-lumo/emotion-recognition-in-conversation) | Reproducible code for multiple papers on dialogue emotion classification and multimodal dialogue emotion classification |
-`lumo` will be better in the future, but there are still some lack exists currently, including:
+## :small_orange_diamond: Visual Interface
- - **Lack of more detail guide** because of the lacking of developer's energy and time.
- - **Lack more tests**. unit test only covers a part of the code. I hope I fixed all bugs during my using of it, but there is no guarantee of it. The compatibility is also unguaranteed. So, welcome to [issus](https://github.com/sailist/lumo/issues) it if you find it.
- - **Lack of development experience**. So the version number may be confused.
+In jupyter:
-Thanks for all contribution.
+```python
+from lumo import Watcher
+w = Watcher()
+df = w.load()
+widget = w.panel(df)
+widget.servable()
+```
+![Panel](./images/panel-example.png)
-For file read/write and get/set, I designed
+Manually filtered experiments for visualization:
+![Panel](./images/panel-example2.png)
-- [Params], to make runtime config get/set/load/dump easily,
-- [globs], a global/local/runtime environment variables manager.
-- [Saver], to help you save/load/manage your checkpoints/models in one class.
+You can directly use the command line:
-For data processing, I designed
+```
+lumo board [--port, --address, --open]
+```
-- [Builder], to hold nearly all dataset formats and special operations by one class,
+## :small_orange_diamond: re-run
-For managing experiments, I designed
+Experiment that failed due to certain reasons can be **re-run by using the unique experiment ID (test_name)** , extra
+parameters can be **reassigned and replaced**.
-- [Experiment], which can
- - make you build a suitable directory and file path in one place,
- - make you record lightweight data, and
- - help you make snapshot for your project code (based on git), which can make each result recoverable and
- reproducible
-- [random manager], a cross-lib(random/numpy/pytorch) random seed manager
+```
+lumo rerun 230313.030.57t --device=0
+```
-For log and meter variables produced during experiment, I designed
+## :small_orange_diamond: backup
-- [Meter] to meter every thing in appropriate format, and
-- [Logger] to log every thing in appropriate format.
+Backing up experiment information to a Github issue (based on PyGitHub):
-Finally, I designed [Trainer] to bundle all module above for deep learning experiment.
+```python
+from lumo import Experiment, Watcher
+from lumo import glob
+
+glob[
+ 'github_access_token'] = 'ghp_*' # Default value for `access_token`. It is recommended to store the access_token in the global configuration `~/.lumorc.json`.
+
+w = Watcher()
+df = w.load()
+
+# Selecting a single experiment for backup
+exp = Experiment.from_cache(df.iloc[0].to_dict())
+issue = exp.backup('github', repo='pytorch-lumo/image-classification-private',
+ access_token='ghp_*',
+ update=True, # If already backed up, overwrite the previous issue
+ labels=None, # Optional labels
+ )
+print(issue.number)
+
+# Batch backup and add labels based on each experiment's parameters
+issues = df.apply(
+ lambda x: Experiment.from_cache(x.to_dict()).backup(..., labels=[x['params'].get('dataset', '')]),
+ axis=1
+)
+```
+![backup_github](./images/backup_github.png)
-As you can see, These modules covered most demandings on deeplearning
+# Full properties
-You can find what you want and click the link to quickly learn HOW TO USE it! All module is designed easy to use, it's
-my principles.
+```
+{'agent': nan,
+ 'backup': {'23-03-17-003438': {'backend': 'github',
+ 'number': 9,
+ 'repo': '...'},
+ },
+ 'exception': nan,
+ 'execute': {'cwd': '~/Documents/Python/lumo',
+ 'exec_argv': ['~/Documents/Python/lumo/a.py'],
+ 'exec_bin': '~/.pyenv/versions/3.9.16/bin/python3.9',
+ 'exec_file': '~/Documents/Python/lumo/a.py',
+ 'repo': '~/Documents/Python/lumo'},
+ 'exp_name': 'my_exp_a',
+ 'git': {'commit': '1014b6b5',
+ 'dep_hash': 'c93b8c4e340882f55cf0c8e125fa0203',
+ 'repo': '~/Documents/Python/lumo'},
+ 'hooks': {'Diary': {'loaded': True, 'msg': ''},
+ 'FinalReport': {'loaded': True, 'msg': ''},
+ 'GitCommit': {'loaded': True, 'msg': ''},
+ 'LastCmd': {'loaded': True, 'msg': ''},
+ 'LockFile': {'loaded': True, 'msg': ''},
+ 'RecordAbort': {'loaded': True, 'msg': ''}},
+ 'lock': {'accelerate': '0.16.0',
+ 'decorator': '5.1.1',
+ 'fire': '0.5.0',
+ 'hydra': '1.3.2',
+ 'joblib': '1.2.0',
+ 'lumo': '0.15.0',
+ 'numpy': '1.24.2',
+ 'omegaconf': '2.3.0',
+ 'psutil': '5.9.4',
+ 'torch': '1.13.1'},
+ 'note': 'This is a Note',
+ 'params': {'alpha': 1, 'dataset': 'cifar10'},
+ 'paths': {'blob_root': '~/.lumo/blob',
+ 'cache_root': '~/.lumo/cache',
+ 'info_root': '~/.lumo/experiments'},
+ 'pinfo': {'hash': '0af4b77497c85bc5b65ccbdd9ff4ca0f',
+ 'obj': {'argv': ['~/.pyenv/versions/3.9.16/bin/python3.9',
+ '~/Documents/Python/lumo/a.py'],
+ 'pid': 63975,
+ 'pname': 'python3.9',
+ 'pstart': 1678898740.099484},
+ 'pid': 63975},
+ 'progress': {'end': '23-03-16-004542',
+ 'end_code': 0,
+ 'last_edit_time': '23-03-16-004542',
+ 'ratio': 1,
+ 'start': '23-03-16-004542',
+ 'update_from': None},
+ 'tags': [],
+ 'test_name': '230316.000.8ct',
+ 'trainer': nan}
+```
+# :scroll: License
+Distributed under the Apache License Version 2.0. See [LICENSE](./LICENSE) for more information.
diff --git a/docstr-coverage.sh b/docstr-coverage.sh
new file mode 100644
index 00000000..3d1a1029
--- /dev/null
+++ b/docstr-coverage.sh
@@ -0,0 +1,9 @@
+docstr-coverage \
+ src/lumo/cli \
+ src/lumo/core \
+ src/lumo/data \
+ src/lumo/exp \
+ src/lumo/proc \
+ src/lumo/trainer \
+ src/lumo/utils
+
diff --git a/examples/data/datamodule.py b/examples/data/datamodule.py
deleted file mode 100644
index 052e65e1..00000000
--- a/examples/data/datamodule.py
+++ /dev/null
@@ -1,35 +0,0 @@
-from lumo import DataModule, ParamsType, TrainStage, DatasetBuilder
-import torch
-
-
-class MyDataModule(DataModule):
- def idataloader(self, params: ParamsType = None, stage: TrainStage = None):
- if stage.is_train():
- print("init train dataloader")
- db = (
- DatasetBuilder()
- .add_input("xs", torch.rand(500, 28, 28))
- .add_input("ys", torch.randint(0, 10, (500,)))
- .add_output("xs", "xs")
- .add_output("ys", "ys")
- )
- loader = db.DataLoader(batch_size=10)
- else:
- print("init test dataloader")
- db = (
- DatasetBuilder()
- .add_input("xs", torch.rand(50, 28, 28))
- .add_input("ys", torch.randint(0, 10, (50,)))
- .add_output("xs", "xs")
- .add_output("ys", "ys")
- )
- loader = db.DataLoader(batch_size=10)
- self.regist_dataloader_with_stage(stage, loader)
-
-
-dm = MyDataModule()
-print(dm._prop)
-print(len(dm.train_dataloader))
-print(dm._prop)
-print(len(dm.test_dataloader))
-print(dm._prop)
diff --git a/examples/imagenet.py b/examples/imagenet.py
new file mode 100644
index 00000000..34fbf3a4
--- /dev/null
+++ b/examples/imagenet.py
@@ -0,0 +1,259 @@
+import sys
+from pathlib import Path
+from typing import Union
+
+import torch
+from PIL import Image
+from torch.utils.data import DataLoader
+import os
+import torch.multiprocessing as mp
+from torchvision.datasets.folder import default_loader
+
+from lumo import DatasetBuilder, MetricType, Trainer, TrainerParams, Meter, callbacks, DataModule
+from torchvision.datasets import FakeData, ImageFolder
+from torchvision import transforms
+from torchvision.models.resnet import resnet18
+from torch import nn
+from lumo.proc.dist import is_dist, is_main
+from torch.nn import functional as F
+from lumo.proc import glob
+from lumo.utils.subprocess import run_command
+
+"""define transforms"""
+
+
+def none(mean, std):
+ return transforms.Compose([
+ transforms.ToTensor(),
+ transforms.Normalize(mean, std)
+ ])
+
+
+def standard(mean, std, resize=None):
+ return transforms.Compose([
+ transforms.RandomResizedCrop(224),
+ transforms.RandomHorizontalFlip(),
+ transforms.ToTensor(),
+ transforms.Normalize(mean, std)
+ ])
+
+
+"""create datasets"""
+
+
+def imagenet(split='train'):
+ """
+ download from https://www.kaggle.com/c/imagenet-object-localization-challenge/overview/description
+ ```
+ mkdir imagenet
+ cd ./imagenet
+ kaggle competitions download -c imagenet-object-localization-challenge
+ unzip imagenet-object-localization-challenge.zip
+ tar -xvf imagenet_object_localization_patched2019.tar.gz
+ ls
+ >>> ILSVRC LOC_synset_mapping.txt LOC_val_solution.csv imagenet_object_localization_patched2019.tar.gz
+ >>> LOC_sample_submission.csv LOC_train_solution.csv imagenet-object-localization-challenge.zip
+ ```
+ """
+ root = glob['imagenet']
+ if split == 'train':
+ file = Path(root).joinpath('ILSVRC', 'ImageSets', 'CLS-LOC', 'train_cls.txt')
+ train_root = os.path.join(root, 'ILSVRC/Data/CLS-LOC/train')
+ with file.open('r') as r:
+ lines = r.readlines()
+ imgs = [line.split(' ')[0] for line in lines]
+ name_cls_map = {name: i for i, name in enumerate(sorted(set([i.split('/')[0] for i in imgs])))}
+ xs = [os.path.join(train_root, f'{i}.JPEG') for i in imgs]
+ ys = [name_cls_map[i.split('/')[0]] for i in imgs]
+ else:
+ file = Path(root).joinpath('LOC_val_solution.csv')
+ val_root = os.path.join(root, 'ILSVRC/Data/CLS-LOC/val')
+
+ with file.open('r') as r:
+ r.readline()
+ lines = r.readlines()
+ lines = [line.split(',') for line in lines]
+ lines = [[img, res.split(' ')[0]] for img, res in lines]
+
+ name_cls_map = {name: i for i, name in enumerate(sorted(set([i[1] for i in lines])))}
+ xs = [os.path.join(val_root, f'{img}.JPEG') for img, _ in lines]
+ ys = [name_cls_map[res] for _, res in lines]
+
+ return list(xs), list(ys)
+
+
+def take_first(item):
+ return item[0]
+
+
+def take_second(item):
+ return item[1]
+
+
+def make_dataset(dummy=False):
+ if dummy:
+ train_dataset = FakeData(1281167, (3, 224, 224), 1000, transforms.ToTensor())
+ val_dataset = FakeData(50000, (3, 224, 224), 1000, transforms.ToTensor())
+ ds = (
+ DatasetBuilder()
+ .add_input('fake', train_dataset)
+ .add_output('fake', 'xs', transform=take_first)
+ .add_output('fake', 'ys', transform=take_second)
+ )
+ test_ds = (
+ DatasetBuilder()
+ .add_input('fake', val_dataset)
+ .add_output('fake', 'xs', transform=take_first)
+ .add_output('fake', 'ys', transform=take_second)
+ )
+ else:
+ train_dataset = ImageFolder(os.path.join(glob['imagenet'], 'train'))
+ val_dataset = ImageFolder(os.path.join(glob['imagenet'], 'val'))
+
+ xs, ys = list(zip(*train_dataset.samples))
+ test_xs, test_ys = list(zip(*val_dataset.samples))
+
+ mean = [0.485, 0.456, 0.406]
+ std = [0.229, 0.224, 0.225]
+
+ ds = (
+ DatasetBuilder()
+ .add_input('xs', xs, transform=default_loader) # 注册样本来源,命名为 'xs'
+ .add_input('ys', ys) # 注册标签来源,命名为 'ys'
+ .add_output('xs', 'xs', transform=standard(mean, std)) # 添加一个弱增广输出 'xs0'
+ .add_output('ys', 'ys') # 添加标签输出
+ )
+
+ print(ds)
+ print(ds[0].keys())
+
+ test_ds = (
+ DatasetBuilder()
+ .add_input('xs', test_xs, transform=default_loader) # 注册样本来源,命名为 'xs'
+ .add_input('ys', test_ys) # 注册标签来源,命名为 'ys'
+ .add_output('xs', 'xs', transform=none(mean, std)) # 测试样本不使用增广
+ .add_output('ys', 'ys') # 添加标签输出
+ )
+
+ print(test_ds)
+ print(test_ds[0].keys())
+ return ds, test_ds
+
+
+class LargeParams(TrainerParams):
+ def __init__(self):
+ super().__init__()
+ self.optim = self.OPTIM.create_optim('SGD',
+ lr=0.06,
+ momentum=0.9,
+ weight_decay=5e-5,
+ )
+ self.lr_decay_end = 0.00001
+ self.batch_size = 512
+ self.dummy = False
+ self.multiprocessing_distributed = True
+
+
+ParamsType = LargeParams
+
+
+class LargeModel(nn.Module):
+
+ def __init__(self, feature_dim) -> None:
+ super().__init__()
+ self.backbone = resnet18()
+ in_feature = self.backbone.fc.in_features
+ self.backbone.fc = nn.Identity()
+ self.head = nn.Linear(in_feature, feature_dim, bias=True)
+
+ def forward(self, xs):
+ feature_map = self.backbone(xs)
+ feature = self.head(feature_map)
+ return feature
+
+
+class LargeTrainer(Trainer):
+
+ def icallbacks(self, params: ParamsType):
+ callbacks.LoggerCallback().hook(self)
+
+ def imodels(self, params: ParamsType):
+ self.model = resnet18(num_classes=1000)
+ self.optim = params.optim.build(self.model.parameters())
+
+ self.lr_sche = params.SCHE.Cos(
+ start=params.optim.lr, end=params.lr_decay_end,
+ left=0,
+ right=len(self.train_dataloader) * params.epoch
+ )
+ # manually trigger send_to_device method
+ self.to_device()
+
+ def train_step(self, batch, params: ParamsType = None) -> MetricType:
+ super().train_step(batch, params)
+ m = Meter()
+ xs, ys = batch['xs'], batch['ys']
+ logits = self.model(xs)
+
+ Lall = F.cross_entropy(logits, ys)
+ self.optim.zero_grad()
+ self.accelerate.backward(Lall)
+ self.optim.step()
+
+ # change lr by training epoch
+ cur_lr = self.lr_sche.apply(self.optim, self.eidx)
+
+ with torch.no_grad():
+ m.mean.Lall = Lall
+ m.mean.Ax = torch.eq(logits.argmax(dim=-1), ys).float().mean()
+ m.last.lr = cur_lr
+ return m
+
+ def test_step(self, batch, params: ParamsType = None) -> MetricType:
+ m = Meter()
+ xs, ys = batch['xs'], batch['ys']
+ logits = self.model(xs)
+
+ all_logits = self.accelerate.gather(logits)
+ all_ys = self.accelerate.gather(ys)
+
+ m.test_mean.Ax = torch.eq(all_logits.argmax(dim=-1), all_ys).float()
+ return m
+
+
+def main_worker(device, ngpus_per_node, args):
+ # create datamodule to contain dataloader
+ params = LargeParams()
+ params.device = device
+ params.from_args()
+
+ ds, test_ds = make_dataset(dummy=params.dummy)
+ dl = ds.DataLoader(batch_size=params.batch_size, num_workers=4)
+ test_dl = test_ds.DataLoader(batch_size=params.batch_size, num_workers=4)
+ dm = DataModule()
+ dm.regist_dataloader(train=dl, test=test_dl)
+
+ # with the input of params and dataloader, the initialization of models and optimizers in Trainer,
+ # then the output will be the trained parameters, metrics and logs.
+ trainer = LargeTrainer(params, dm=dm)
+
+ trainer.train() # or trainer.train(dm=dl) if dm are not given above
+ trainer.test() # or trainer.test(dm=dl)
+ trainer.save_last_model()
+
+
+def main():
+ # if params.multiprocessing_distributed and not is_dist():
+ # mp.spawn(main, nprocs=torch.cuda.device_count(), args=(ngpus_per_node, args))
+ # command = ' '.join([
+ # 'accelerate', 'launch',
+ # *sys.argv,
+ # ])
+ # print(command)
+ # run_command(command)
+ # else: # not distributed or in distribution environment
+ pass
+
+
+if __name__ == '__main__':
+ main()
diff --git a/examples/mnist.py b/examples/mnist.py
new file mode 100644
index 00000000..45cc6a67
--- /dev/null
+++ b/examples/mnist.py
@@ -0,0 +1,117 @@
+import torch
+from lumo import DatasetBuilder, MetricType, Trainer, TrainerParams, Meter, callbacks, DataModule
+from lumo.proc.path import cache_dir
+from torchvision.datasets.mnist import MNIST
+from torchvision.transforms import Compose, RandomCrop, Normalize
+from torch import nn
+from torch.nn import functional as F
+
+
+def make_dataset():
+ data = MNIST(root=cache_dir(), train=True, download=True)
+ test_data = MNIST(root=cache_dir(), train=False, download=True)
+
+ mean = torch.mean(data.data / 255.)
+ std = torch.std(data.data / 255.)
+
+ ds = (
+ DatasetBuilder()
+ .add_input('xs', data.data.float().unsqueeze(1)) # 注册样本来源,命名为 'xs'
+ .add_input('ys', data.targets) # 注册标签来源,命名为 'ys'
+ .add_output('xs', 'xs0', transform=Normalize(mean=(mean,), std=(std,))) # 添加一个弱增广输出 'xs0'
+ .add_output('xs', 'xs1',
+ transform=Compose(
+ [RandomCrop(28, padding=4), Normalize(mean=(mean,), std=(std,))])) # 添加一个强增广输出 'xs1'
+ .add_output('ys', 'ys') # 添加标签输出
+ )
+ print(ds)
+ print(ds[0].keys())
+
+ test_ds = (
+ DatasetBuilder()
+ .add_input('xs', test_data.data.float().unsqueeze(1)) # 注册样本来源,命名为 'xs'
+ .add_input('ys', test_data.targets) # 注册标签来源,命名为 'ys'
+ .add_output('xs', 'xs', transform=Normalize(mean=(mean,), std=(std,))) # 测试样本不使用增广
+ .add_output('ys', 'ys') # 添加标签输出
+ )
+
+ print(test_ds)
+ print(test_ds[0].keys())
+ return ds, test_ds
+
+
+class MNISTParams(TrainerParams):
+ def __init__(self):
+ super().__init__()
+ self.batch_size = 128
+ self.optim = self.OPTIM.create_optim('SGD', lr=0.0001, momentum=0.9)
+
+
+ParamsType = MNISTParams
+
+
+class MNISTTrainer(Trainer):
+
+ def icallbacks(self, params: ParamsType):
+ super().icallbacks(params)
+ callbacks.LoggerCallback().hook(self)
+
+ def imodels(self, params: ParamsType):
+ super().imodels(params)
+ self.model = nn.Sequential(
+ nn.Flatten(),
+ nn.Linear(28 * 28, 128),
+ nn.ReLU(),
+ nn.Linear(128, 10),
+ )
+ self.optim = params.optim.build(self.model.parameters())
+
+ # manually trigger send_to_device method
+ self.to_device()
+
+ def train_step(self, batch, params: ParamsType = None) -> MetricType:
+ super().train_step(batch, params)
+ m = Meter()
+ eval_xs, xs, ys = batch['xs0'], batch['xs1'], batch['ys']
+ logits = self.model(xs)
+ Lall = F.cross_entropy(logits, ys)
+ self.optim.zero_grad()
+ Lall.backward()
+ self.optim.step()
+ with torch.no_grad():
+ m.mean.Lall = Lall
+ eval_logits = self.model(eval_xs)
+ m.mean.Ax = torch.eq(eval_logits.argmax(dim=-1), ys).float().mean()
+ return m
+
+ def test_step(self, batch, params: ParamsType = None) -> MetricType:
+ m = Meter()
+ xs, ys = batch['xs'], batch['ys']
+ logits = self.model(xs)
+ m.test_mean.Ax = torch.eq(logits.argmax(dim=-1), ys).float()
+ return m
+
+
+def main():
+ ds, test_ds = make_dataset()
+
+ # create datamodule to contain dataloader
+ dl = ds.DataLoader(batch_size=32)
+ test_dl = test_ds.DataLoader(batch_size=32)
+ dm = DataModule()
+ dm.regist_dataloader(train=dl, test=test_dl)
+
+ params = MNISTParams()
+ params.epoch = 10
+ params.from_args()
+ # with the input of params and dataloader, the initialization of models and optimizers in Trainer,
+ # then the output will be the trained parameters, metrics and logs.
+ trainer = MNISTTrainer(params, dm=dm)
+
+ trainer.train() # or trainer.train(dm=dl) if dm are not given above
+ trainer.test() # or trainer.test(dm=dl)
+ trainer.save_last_model()
+
+
+if __name__ == '__main__':
+ main()
diff --git a/examples/moco.py b/examples/moco.py
new file mode 100644
index 00000000..14dff276
--- /dev/null
+++ b/examples/moco.py
@@ -0,0 +1,290 @@
+"""
+refer to
+https://colab.research.google.com/github/facebookresearch/moco/blob/colab-notebook/colab/moco_cifar10_demo.ipynb
+"""
+from typing import Union
+
+import torch
+from PIL import Image
+from torch.utils.data import DataLoader
+
+from lumo import DatasetBuilder, MetricType, Trainer, TrainerParams, Meter, callbacks, DataModule
+from lumo.contrib import EMA, MemoryBank, StorageBank
+
+from lumo.contrib.nn.loss import contrastive_loss2
+from lumo.proc.path import cache_dir
+from torchvision.datasets.cifar import CIFAR10
+from torchvision import transforms
+from torchvision.models.resnet import resnet18
+from torch import nn
+from torch.nn import functional as F
+
+from lumo.utils.device import send_to_device
+
+"""define transforms"""
+
+
+def none(mean, std):
+ return transforms.Compose([
+ transforms.ToTensor(),
+ transforms.Normalize(mean, std)
+ ])
+
+
+def simclr(mean, std):
+ return transforms.Compose([
+ transforms.RandomResizedCrop(32),
+ transforms.RandomHorizontalFlip(),
+ transforms.RandomApply([
+ transforms.ColorJitter(0.4, 0.4, 0.4, 0.1)
+ ], p=0.8),
+ transforms.RandomGrayscale(0.2),
+ transforms.ToTensor(),
+ transforms.Normalize(mean, std)
+ ])
+
+
+"""create datasets"""
+
+
+def make_dataset():
+ data = CIFAR10(root=cache_dir(), train=True, download=True)
+ data.data = [Image.fromarray(img) for img in data.data]
+ test_data = CIFAR10(root=cache_dir(), train=False, download=True)
+
+ mean = [0.4914, 0.4822, 0.4465]
+ std = [0.2023, 0.1994, 0.2010]
+
+ ds = (
+ DatasetBuilder()
+ .add_input('xs', data.data) # 注册样本来源,命名为 'xs'
+ .add_input('ys', data.targets) # 注册标签来源,命名为 'ys'
+ .add_output('xs', 'xs0', transform=simclr(mean, std)) # 添加一个弱增广输出 'xs0'
+ .add_output('xs', 'xs1', transform=simclr(mean, std)) # 添加一个强增广输出 'xs1'
+ .add_output('ys', 'ys') # 添加标签输出
+ )
+
+ # for knn test
+ memo_ds = (
+ DatasetBuilder()
+ .add_input('xs', data.data) # 注册样本来源,命名为 'xs'
+ .add_input('ys', data.targets) # 注册标签来源,命名为 'ys'
+ .add_output('xs', 'xs', transform=none(mean, std)) # 添加一个弱增广输出 'xs0'
+ .add_output('ys', 'ys') # 添加标签输出
+ )
+ print(ds)
+ print(ds[0].keys())
+
+ test_ds = (
+ DatasetBuilder()
+ .add_idx('idx') # add index key for sample
+ .add_input('xs', test_data.data) # 注册样本来源,命名为 'xs'
+ .add_input('ys', test_data.targets) # 注册标签来源,命名为 'ys'
+ .add_output('xs', 'xs', transform=none(mean, std)) # 测试样本不使用增广
+ .add_output('ys', 'ys') # 添加标签输出
+ )
+
+ print(test_ds)
+ print(test_ds[0].keys())
+ return ds, memo_ds, test_ds
+
+
+class MocoParams(TrainerParams):
+ def __init__(self):
+ super().__init__()
+ self.optim = self.OPTIM.create_optim('SGD',
+ lr=0.06,
+ momentum=0.9,
+ weight_decay=5e-5,
+ )
+ self.lr_decay_end = 0.00001
+ self.temperature = 0.1
+ self.ema_alpha = 0.99
+ self.feature_dim = 129
+ self.queue_size = 4096
+ self.batch_size = 512
+ self.symmetric = False
+
+
+ParamsType = MocoParams
+
+
+class MocoModel(nn.Module):
+
+ def __init__(self, feature_dim) -> None:
+ super().__init__()
+ self.backbone = resnet18()
+ in_feature = self.backbone.fc.in_features
+ self.backbone.fc = nn.Identity()
+ self.head = nn.Linear(in_feature, feature_dim, bias=True)
+
+ def forward(self, xs):
+ feature_map = self.backbone(xs)
+ feature = self.head(feature_map)
+ return feature
+
+
+def knn_predict(feature, feature_bank, feature_labels, classes, knn_k, knn_t):
+ # compute cos similarity between each feature vector and feature bank ---> [B, N]
+ sim_matrix = torch.mm(feature, feature_bank)
+ # [B, K]
+ sim_weight, sim_indices = sim_matrix.topk(k=knn_k, dim=-1)
+ # [B, K]
+ sim_labels = torch.gather(feature_labels.expand(feature.size(0), -1), dim=-1, index=sim_indices)
+ sim_weight = (sim_weight / knn_t).exp()
+
+ # counts for each class
+ one_hot_label = torch.zeros(feature.size(0) * knn_k, classes, device=sim_labels.device)
+ # [B*K, C]
+ one_hot_label = one_hot_label.scatter(dim=-1, index=sim_labels.view(-1, 1), value=1.0)
+ # weighted score ---> [B, C]
+ pred_scores = torch.sum(one_hot_label.view(feature.size(0), -1, classes) * sim_weight.unsqueeze(dim=-1), dim=1)
+
+ pred_labels = pred_scores.argsort(dim=-1, descending=True)
+ return pred_labels
+
+
+class MocoTrainer(Trainer):
+
+ def icallbacks(self, params: ParamsType):
+ callbacks.LoggerCallback().hook(self)
+
+ def imodels(self, params: ParamsType):
+ self.model = MocoModel(params.feature_dim)
+ self.ema_model = EMA(self.model, alpha=params.ema_alpha)
+
+ self.optim = params.optim.build(self.model.parameters())
+
+ self.tensors = StorageBank()
+ self.tensors.register('test_feature', dim=params.feature_dim, k=len(self.dm.test_dataset))
+ self.tensors.register('test_ys', dim=-1, k=len(self.dm.test_dataset), dtype=torch.long)
+
+ self.mem = MemoryBank()
+ # do not need normalize because normalize is applied in contrastive_loss2 function
+ self.mem.register('negative', dim=params.feature_dim, k=params.queue_size)
+ self.mem['negative'] = F.normalize(self.mem['negative'], dim=-1)
+
+ self.lr_sche = params.SCHE.Cos(
+ start=params.optim.lr, end=params.lr_decay_end,
+ left=0,
+ right=len(self.train_dataloader) * params.epoch
+ )
+ # manually trigger send_to_device method
+ self.to_device()
+
+ def train_step(self, batch, params: ParamsType = None) -> MetricType:
+ m = Meter()
+ im_query, im_key, ys = batch['xs0'], batch['xs1'], batch['ys']
+ feat_query = self.model.forward(im_query)
+
+ with torch.no_grad():
+ # shuffle for making use of BN
+ feat_key = self.ema_model.forward(im_key) # keys: NxC
+ feat_key = F.normalize(feat_key, dim=1) # already normalized
+
+ feat_query = F.normalize(feat_query, dim=1)
+
+ if params.symmetric:
+ Lcsa = contrastive_loss2(query=feat_query, key=feat_key,
+ memory=self.mem['negative'],
+ query_neg=False, key_neg=False,
+ temperature=params.temperature,
+ norm=False)
+ Lcsb = contrastive_loss2(query=feat_key, key=feat_query,
+ memory=self.mem['negative'],
+ query_neg=False, key_neg=False,
+ temperature=params.temperature,
+ norm=False)
+ Lcs = Lcsa + Lcsb
+ else:
+
+ Lcs = contrastive_loss2(query=feat_query, key=feat_key.detach(),
+ memory=self.mem['negative'].clone().detach(),
+ query_neg=False, key_neg=False,
+ temperature=params.temperature,
+ norm=False)
+
+ # memory bank
+ with torch.no_grad():
+ if params.symmetric:
+ self.mem.push('negative', torch.cat([feat_query, feat_key], dim=0))
+ else:
+ self.mem.push('negative', feat_key)
+
+ self.optim.zero_grad()
+ self.accelerate.backward(Lcs)
+ self.optim.step()
+ cur_lr = self.lr_sche.apply(self.optim, self.global_steps)
+
+ # metrics
+ with torch.no_grad():
+ m.mean.Lcs = Lcs
+ m.last.lr = cur_lr
+ return m
+
+ def test_step(self, batch, params: ParamsType = None) -> MetricType:
+ idx = batch['idx']
+ xs, ys = batch['xs'], batch['ys']
+ feature = self.model(xs)
+ self.tensors.scatter('test_feature', feature, idx)
+ self.tensors.scatter('test_ys', ys, idx)
+
+ def test(self, dm: Union[DataModule, DataLoader] = None, params: ParamsType = None, limit_step=None):
+ super().test(dm, params, limit_step) # run default test loop
+ self.save_last_model()
+
+ feature_bank = []
+ with torch.no_grad():
+ # generate feature bank
+ for batch in self.dm['memo']:
+ batch = send_to_device(batch, self.device)
+ data, target = batch['xs'], batch['ys']
+ feature = self.model(data)
+ feature = F.normalize(feature, dim=1)
+ feature_bank.append(feature)
+
+ feature_bank = torch.cat(feature_bank, dim=0).t().contiguous()
+ # [N]
+ feature_labels = torch.tensor(self.dm['memo'].dataset.inputs['ys'], device=feature_bank.device)
+ # loop test data to predict the label by weighted knn search
+
+ pred_labels = knn_predict(self.tensors['test_feature'],
+ feature_bank, feature_labels, params.n_classes, params.knn_k, params.knn_t)
+ total_num = pred_labels.shape[0]
+ total_top1 = torch.eq(pred_labels[:, 0], self.tensors['test_ys']).float().sum().item()
+
+ knn_acc = total_top1 / total_num * 100
+
+ max_knn_acc = self.metric.dump_metric('Knn', knn_acc, cmp='max', flush=True)
+ self.logger.info(f'Best Knn Top-1 acc: {max_knn_acc}, current: {knn_acc}')
+
+ if knn_acc >= max_knn_acc:
+ self.save_best_model()
+
+
+def main():
+ ds, memo_ds, test_ds = make_dataset()
+
+ params = MocoParams()
+ params.from_args()
+
+ # create datamodule to contain dataloader
+ dl = ds.DataLoader(batch_size=params.batch_size, num_workers=2)
+ memo_dl = memo_ds.DataLoader(batch_size=params.batch_size, num_workers=2)
+ test_dl = test_ds.DataLoader(batch_size=params.batch_size, num_workers=2)
+ dm = DataModule()
+ dm.regist_dataloader(train=dl,
+ test=test_dl,
+ memo=memo_dl) # add extra dataloader with any name
+
+ # with the input of params and dataloader, the initialization of models and optimizers in Trainer,
+ # then the output will be the trained parameters, metrics and logs.
+ trainer = MocoTrainer(params, dm=dm)
+
+ trainer.train() # or trainer.train(dm=dl) if dm are not given above
+ trainer.test() # or trainer.test(dm=dl)
+ trainer.save_last_model()
+
+
+if __name__ == '__main__':
+ main()
diff --git a/images/backup_github.png b/images/backup_github.png
new file mode 100644
index 00000000..01518629
Binary files /dev/null and b/images/backup_github.png differ
diff --git a/images/lumo-intro.png b/images/lumo-intro.png
new file mode 100644
index 00000000..1199de27
Binary files /dev/null and b/images/lumo-intro.png differ
diff --git a/images/panel-example.png b/images/panel-example.png
new file mode 100644
index 00000000..42f26f61
Binary files /dev/null and b/images/panel-example.png differ
diff --git a/images/panel-example2.png b/images/panel-example2.png
new file mode 100644
index 00000000..c0116eb9
Binary files /dev/null and b/images/panel-example2.png differ
diff --git a/pyproject.toml b/pyproject.toml
index e0133509..189fab03 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -42,6 +42,7 @@ omit = [
'src/lumo/vis/*',
'src/lumo/decorators/*',
'src/lumo/exp/agent.py',
+ 'src/lumo/exp/lazy_panel.py',
'src/lumo/analyse/*',
'src/lumo/sketch/*',
'src/lumo/core/record_backend/*',
diff --git a/setup.py b/setup.py
index a0d9f4c6..10eecd68 100644
--- a/setup.py
+++ b/setup.py
@@ -1,5 +1,4 @@
import re
-from datetime import datetime
from setuptools import setup, find_packages
diff --git a/src/lumo/__init__.py b/src/lumo/__init__.py
index 5893105d..35697cca 100644
--- a/src/lumo/__init__.py
+++ b/src/lumo/__init__.py
@@ -1,11 +1,12 @@
"""
"""
-__version__ = "0.15.0"
+__version__ = "1.0.0"
from .core import Params, ParamsType, MetricType, Meter, Record, TrainStage, BaseParams
+from .proc import glob
from .data import DataLoader, DataModule, DatasetBuilder, LumoDataLoader, CollateBase, DataLoaderSide
-from .exp import SimpleExperiment, Experiment
+from .exp import SimpleExperiment, Experiment, Metric, Watcher, C
from .trainer import Trainer, TrainerParams, callbacks, RndManager
from .utils import Logger
diff --git a/src/lumo/analyse/__init__.py b/src/lumo/analyse/__init__.py
index 7eeb38a8..0af59d33 100644
--- a/src/lumo/analyse/__init__.py
+++ b/src/lumo/analyse/__init__.py
@@ -1,2 +1,5 @@
+import warnings
+
from .collect import collect_table_rows, flatten_dict, flatten_params, flatten_metric
from .condition import C, filter_by_condition
+warnings.warn("lumo.analyse has been deprecated and will be removed soon, please use lumo.exp.Watcher instead.")
\ No newline at end of file
diff --git a/src/lumo/cli/__init__.py b/src/lumo/cli/__init__.py
index 714545ea..e9a39df9 100644
--- a/src/lumo/cli/__init__.py
+++ b/src/lumo/cli/__init__.py
@@ -13,12 +13,17 @@ def rerun(test_name, **kwarg):
Returns:
"""
- from lumo.exp.finder import retrieval_experiment
- exp = retrieval_experiment(test_name)
- if exp is not None:
- exp.rerun([f'--{k}={v}' for k, v in kwarg.items()])
- else:
+ from lumo.exp.watch import Watcher
+ from lumo import Experiment
+ w = Watcher()
+ df = w.load()
+ df = df[df['test_name'] == test_name]
+ if len(df) == 0:
+ print(f'{test_name} not found')
exit(1)
+ else:
+ exp = Experiment.from_cache(df.iloc[0].to_dict())
+ exp.rerun([f'--{k}={v}' for k, v in kwarg.items()])
def note(test_name, description):
@@ -36,7 +41,7 @@ def note(test_name, description):
print(f"Adding note '{description}' to {test_name}")
-def server(port=8080):
+def board(port=11606, address=None, open=True):
"""
Args:
@@ -45,12 +50,16 @@ def server(port=8080):
Returns:
"""
+ from lumo import Watcher
+ w = Watcher()
+ w.panel().show(port=port, address=address, open=open)
print(f"Starting server on port {port}")
def main():
+ """the entry"""
fire.Fire({
'rerun': rerun,
'note': note,
- 'server': server,
+ 'board': board,
})
diff --git a/src/lumo/contrib/__init__.py b/src/lumo/contrib/__init__.py
index a4f53fdd..e4f1925b 100644
--- a/src/lumo/contrib/__init__.py
+++ b/src/lumo/contrib/__init__.py
@@ -3,3 +3,4 @@
"""
from .module.ema import EMA
from .optim.grouper import ParamGrouper
+from .module.memoty_bank import MemoryBank, StorageBank
diff --git a/src/lumo/contrib/accelerate/__init__.py b/src/lumo/contrib/accelerate/__init__.py
deleted file mode 100644
index 59d25ad8..00000000
--- a/src/lumo/contrib/accelerate/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-from .accelerator import Accelerator
\ No newline at end of file
diff --git a/src/lumo/contrib/accelerate/data_loader.py b/src/lumo/contrib/accelerate/data_loader.py
deleted file mode 100644
index f05ac57f..00000000
--- a/src/lumo/contrib/accelerate/data_loader.py
+++ /dev/null
@@ -1,10 +0,0 @@
-from accelerate import synchronize_rng_states, DistributedType
-from accelerate.data_loader import DataLoaderDispatcher as _DataLoaderDispatcher, DataLoaderShard as _DataLoaderShard
-from accelerate.state import AcceleratorState
-from accelerate.utils import send_to_device
-
-from lumo import LumoDataLoader
-
-
-class DataLoaderDispatcher(_DataLoaderDispatcher):
- pass
diff --git a/src/lumo/contrib/accelerate/utils.py b/src/lumo/contrib/accelerate/utils.py
deleted file mode 100644
index 44d29cc0..00000000
--- a/src/lumo/contrib/accelerate/utils.py
+++ /dev/null
@@ -1,30 +0,0 @@
-from accelerate.utils import recursively_apply
-import torch
-
-
-def send_to_device(tensor, device, non_blocking=False):
- """
- Recursively sends the elements in a nested list/tuple/dictionary of tensors to a given device.
-
- Args:
- tensor (nested list/tuple/dictionary of `torch.Tensor`):
- The data to send to a given device.
- device (`torch.device`):
- The device to send the data to.
-
- Returns:
- The same data structure as `tensor` with all tensors sent to the proper device.
- """
- if not isinstance(device, torch.device):
- device = torch.device(device)
-
- if 'mps' in device.type:
- non_blocking = False
-
- def _send_to_device(t, device):
- return t.to(device, non_blocking=non_blocking)
-
- def _has_to_method(t):
- return hasattr(t, "to")
-
- return recursively_apply(_send_to_device, tensor, device, test_type=_has_to_method)
diff --git a/src/lumo/contrib/module/memoty_bank.py b/src/lumo/contrib/module/memoty_bank.py
index d080c87e..a73dfe59 100644
--- a/src/lumo/contrib/module/memoty_bank.py
+++ b/src/lumo/contrib/module/memoty_bank.py
@@ -1,6 +1,7 @@
-from accelerate.utils import gather
-from lumo.proc.dist import is_dist
+from torch import distributed
+from lumo.proc.dist import is_dist, gather
import torch.distributed
+
from torch import nn
import torch
diff --git a/src/lumo/contrib/scan.py b/src/lumo/contrib/scan.py
new file mode 100644
index 00000000..974fe70a
--- /dev/null
+++ b/src/lumo/contrib/scan.py
@@ -0,0 +1,85 @@
+"""
+基线对比
+"""
+import time
+from itertools import cycle
+from typing import List
+import torch
+
+from lumo import Params, Logger
+import sys
+
+DATE_FLAG = '2023.03.04'
+
+
+class ScanBaseParams(Params):
+
+ def __init__(self):
+ super().__init__()
+ self.gpus = None
+ self.group = None
+ self.interval = 5 # sleep interval between two tests
+ self.skip = 0
+ self.n = -1
+
+
+def format_args(**kwargs):
+ return ' '.join([f'--{k}={v}' for k, v in kwargs.items()])
+
+
+def base_main(pm: ScanBaseParams, files: List[str], dics: List[dict]):
+ assert isinstance(pm, ScanBaseParams)
+ log = Logger()
+ log.use_stdout = False
+ log.add_log_dir(f'./log_{pm.group}')
+
+ base = ("sleep {sleep} ; " +
+ sys.executable +
+ " {file} {kwargs} --device={device}{group} & \n"
+ )
+
+ if not torch.cuda.is_available():
+ gpus = ['cpu']
+ elif pm.gpus is None:
+ gpus = list(range(torch.cuda.device_count()))
+ elif isinstance(pm.gpus, (int, str)):
+ gpus = [torch.device(pm.gpus).index]
+ else:
+ gpus = pm.gpus
+
+ # append None to identity loop end.
+ gpus.append(None)
+ gpus = cycle(gpus)
+ c = 0
+ cnt = 0
+ for i, (file, kwargs) in enumerate(zip(files, dics)):
+ if pm.skip > cnt:
+ cnt += 1
+ continue
+
+ device = next(gpus)
+ if device is None:
+ # wait until all devices are free.
+ c = 0
+ print('wait', flush=True)
+ device = next(gpus)
+
+ if pm.group is not None:
+ group = f" --group={pm.group} "
+ else:
+ group = ''
+
+ cur = base.format(
+ sleep=c * pm.interval,
+ file=file, kwargs=format_args(**kwargs),
+ device=device, group=group)
+
+ c += 1
+ cnt += 1
+ log.info(cur.strip())
+ print(cur, flush=True, end='')
+
+ if pm.n > 0 and pm.skip + pm.n <= cnt:
+ break
+
+ print('wait', flush=True)
diff --git a/src/lumo/core/interp.py b/src/lumo/core/interp.py
index ad7a2c1a..21334b82 100644
--- a/src/lumo/core/interp.py
+++ b/src/lumo/core/interp.py
@@ -281,7 +281,7 @@ def __call__(self, cur):
class Cos(ABCContinuous):
- """one cycle cosine functoin
+ r"""one cycle cosine functoin
end -> ,---------
/
@@ -309,7 +309,7 @@ def interp(cls, cur, start=0., end=1., left=0., right=1., *args, **kwargs):
class Linear(ABCContinuous):
- """linear schedule
+ r"""linear schedule
^
end | .*
@@ -359,7 +359,7 @@ def interp(cls, cur, start=0., end=1., left=0., right=1., *args, **kwargs):
class Log(ABCContinuous):
- """
+ r"""
quick to slow
end | *
@@ -395,7 +395,7 @@ def interp(cls, cur, start=0., end=1., left=0., right=1., *args, **kwargs):
class Constant(ABCContinuous):
- """
+ r"""
A scheduler representing a constant value
|
constant |--------------
@@ -410,7 +410,7 @@ def __init__(self, value=0.5, *args, **kwargs):
class PeriodCos(ABCPeriod):
- """
+ r"""
periodic cosine schedule
end -> ,-. ,-. ,-. ,-.
@@ -432,7 +432,7 @@ def interp(cls, cur, start=0., end=1., left=0., period=1., *args, **kwargs):
class PeriodHalfCos(ABCPeriod):
- """
+ r"""
half periodic cosine schedule, period is (right-left)
end -> ,- ,- ,- ,-
diff --git a/src/lumo/core/meter.py b/src/lumo/core/meter.py
index 44bc21bd..7c52bc4d 100644
--- a/src/lumo/core/meter.py
+++ b/src/lumo/core/meter.py
@@ -126,9 +126,10 @@ def __setitem__(self, key, value):
stg = self._avg.get(key, None)
isscalar = value.size == 1
- if stg is None:
+ if stg is None: # Auto infer a stg method for the value
dtype = value.dtype.name
+ # sanity check
if self._stage in {'min', 'max'} and not isscalar:
raise ValueError(
f'Only support min/max(a) operator on scalar metrics, but got data of shape {value.shape}.')
@@ -145,9 +146,14 @@ def __setitem__(self, key, value):
self._stage = 'last'
self._avg[key] = self._stage
+ stg = self._stage
if isscalar:
value = value.item()
+ elif stg == {'min', 'max', 'sum'}:
+ value = getattr(np, stg)(value).item()
+ elif stg == 'mean':
+ value = [getattr(np, stg)(value).item(), value.size]
self._rec[key] = value
self._stage = 'default'
@@ -192,6 +198,17 @@ def mean(self):
self._stage = 'mean'
return self
+ @property
+ def test_mean(self):
+ """
+ Sets the aggregation method to 'mean'.
+
+ Returns:
+ The meter itself.
+ """
+ self._stage = 'mean'
+ return self
+
@property
def last(self):
"""
@@ -405,6 +422,10 @@ def __repr__(self):
def update(self, item):
"""Updates the reduction value with a new item."""
self.cur = item
+ count = 1
+ if isinstance(item, list) and len(item) == 2:
+ item, count = item
+
item = detach(item)
avg = self.gb_method
@@ -419,8 +440,8 @@ def update(self, item):
elif avg in {'mean', 'sum'}:
if len(self.acc) == 0:
self.acc.append(0)
- self.acc[0] = self.acc[0] + item
- self.c += 1
+ self.acc[0] = self.acc[0] + item * count
+ self.c += count
elif avg == 'max':
self._res = max(self.cur, self._res)
elif avg == 'min':
diff --git a/src/lumo/data/builder.py b/src/lumo/data/builder.py
index 1522bc6a..7515aa25 100644
--- a/src/lumo/data/builder.py
+++ b/src/lumo/data/builder.py
@@ -117,7 +117,7 @@ def __init__(self):
self._iter_cache = {}
- def __repr__(self):
+ def __str__(self):
if self.sized:
return f'Builder(flow={pformat(self._outs)}, sized={self.sized}, size={len(self)}, iterable={self.iterable})'
@@ -360,7 +360,7 @@ def scale_to_size(self, size: int):
size (int): the size to scale the dataset builder to.
Returns:
- self (DatasetBuilder): the scaled dataset builder.
+ the scaled dataset builder.
"""
assert isinstance(size, int)
assert 'pseudo_repeat' not in self._prop
@@ -378,7 +378,7 @@ def repeat(self, multiple: int):
multiple (int): the number of times to repeat the dataset builder.
Returns:
- self (DatasetBuilder): the repeated dataset builder.
+ the repeated dataset builder.
"""
assert isinstance(multiple, int)
assert 'pseudo_length' not in self._prop
@@ -424,7 +424,7 @@ def chain(self):
Set the mode of the dataset builder to chain.
Returns:
- self (DatasetBuilder): the dataset builder with the chain mode set.
+ the dataset builder with the chain mode set.
"""
self._prop['mode'] = 'chain'
return self
@@ -434,7 +434,7 @@ def item(self):
Set the mode of the dataset builder to item.
Returns:
- self (DatasetBuilder): the dataset builder with the item mode set.
+ the dataset builder with the item mode set.
"""
self._prop['mode'] = 'item'
return self
@@ -444,7 +444,7 @@ def zip(self):
Set the mode of the dataset builder to zip.
Returns:
- self (DatasetBuilder): the dataset builder with the zip mode set.
+ the dataset builder with the zip mode set.
"""
self._prop['mode'] = 'zip'
return self
@@ -489,7 +489,7 @@ def add_idx(self, name):
name (str): the name of the index.
Returns:
- self (DatasetBuilder): the dataset builder with the index added.
+ the dataset builder with the index added.
"""
outkeys = self._outs.setdefault(f"::idx::", [])
assert name not in self._outkeys, f'Output key {name} duplicated.'
@@ -507,7 +507,7 @@ def add_input(self, name: str, source, transform: SingleValueTransform = None):
transform (SingleValueTransform): the transform to apply to the input source.
Returns:
- self (DatasetBuilder): the dataset builder with the input source added.
+ the dataset builder with the input source added.
Notes:
Iterable object without `__len__` method currently are not well-tested. Be careful to use them in DatasetBuilder.
@@ -529,7 +529,7 @@ def add_input_transform(self, name: str, transform: SingleValueTransform = None)
transform (SingleValueTransform): the transform to add.
Returns:
- self (DatasetBuilder): the dataset builder with the transform added.
+ the dataset builder with the transform added.
"""
assert name in self._data, f'Source {name} should be added.'
warnings.warn('`add` may cause confusion, use set_input_transform ')
@@ -545,7 +545,7 @@ def add_output(self, name: str, outkey: str, transform: SingleValueTransform = N
transform (SingleValueTransform): the transform to apply to the output.
Returns:
- self (DatasetBuilder): the dataset builder with the output added.
+ the dataset builder with the output added.
"""
assert name in self._data, f'Must have data source {name} first.'
@@ -567,7 +567,7 @@ def add_output_transform(self, outkey: str, transform: SingleValueTransform = No
transform (SingleValueTransform): the transform to add.
Returns:
- self (DatasetBuilder): the dataset builder with the transform added.
+ the dataset builder with the transform added.
"""
assert outkey in self._outkeys, f'Output key {outkey} should be added.'
warnings.warn('add may cause confusion, use set_output_transform ')
@@ -581,7 +581,7 @@ def add_global_transform(self, transform: DictTransform):
transform (DictTransform): the global transform to apply to the dataset.
Returns:
- self (DatasetBuilder): the dataset builder with the global transform added.
+ the dataset builder with the global transform added.
"""
self._transforms['::global::'] = transform
return self
@@ -595,7 +595,7 @@ def set_input_transform(self, name, transform: SingleValueTransform = None):
transform (SingleValueTransform): the transform to set.
Returns:
- self (DatasetBuilder): the dataset builder with the transform set.
+ the dataset builder with the transform set.
"""
self._transforms[name] = transform
return self
@@ -609,7 +609,7 @@ def set_output_transform(self, outkey, transform: SingleValueTransform = None):
transform (SingleValueTransform): the transform to set.
Returns:
- self (DatasetBuilder): the dataset builder with the transform set.
+ the dataset builder with the transform set.
"""
self._transforms[f'::{outkey}'] = transform
return self
diff --git a/src/lumo/decorators/process.py b/src/lumo/decorators/process.py
index 5a488127..0e1f7dc6 100644
--- a/src/lumo/decorators/process.py
+++ b/src/lumo/decorators/process.py
@@ -1,5 +1,5 @@
from typing import Callable
-
+from functools import wraps
from lumo.proc.dist import is_main
@@ -18,6 +18,7 @@ def call_on_main_process_wrap(func) -> Callable:
"""
+ @wraps(func)
def inner(*args, **kwargs):
if is_main():
return func(*args, **kwargs)
diff --git a/src/lumo/exp/__init__.py b/src/lumo/exp/__init__.py
index da726c3f..b6de7031 100644
--- a/src/lumo/exp/__init__.py
+++ b/src/lumo/exp/__init__.py
@@ -1 +1,3 @@
from .experiment import SimpleExperiment, Experiment
+from .watch import Watcher, C
+from .metric import Metric
diff --git a/src/lumo/exp/agent.py b/src/lumo/exp/agent.py
new file mode 100644
index 00000000..00f69d6f
--- /dev/null
+++ b/src/lumo/exp/agent.py
@@ -0,0 +1,35 @@
+"""
+heartbeat mechanism
+"""
+import time
+
+import psutil
+from joblib import hash
+
+from lumo.core import BaseParams
+from lumo.utils import safe_io as IO
+from lumo.utils.fmt import strftime
+from lumo.exp import Experiment
+
+
+def wait_pid_stop(info_dir=None):
+ """wait test """
+ params = BaseParams()
+ params.info_dir = info_dir
+ params.from_args()
+
+ exp = Experiment.from_disk(params.info_dir)
+ pid = exp.properties['pinfo'].get('pid')
+ info_dir = exp.info_dir
+ try:
+ while pid is not None and psutil.pid_exists(pid):
+ exp.trigger()
+ exp.dump_info('agent', {'last_edit_time': strftime()})
+
+ time.sleep(10)
+ except:
+ pass
+
+
+if __name__ == '__main__':
+ wait_pid_stop()
diff --git a/src/lumo/exp/backup.py b/src/lumo/exp/backup.py
new file mode 100644
index 00000000..2ba86a7d
--- /dev/null
+++ b/src/lumo/exp/backup.py
@@ -0,0 +1,190 @@
+from .experiment import Experiment
+import os
+from pprint import pformat
+from lumo.utils.fmt import strftime
+import re
+
+issue_title = """Test {test_name}"""
+
+summary_template = """
+{note}
+
+|Experiment Name|Test Name|Start| End |
+|---|---|---|---|
+|{exp_name}|{test_name}|{start}|{end}|
+
+```bash
+{command}
+```
+
+## Parameters
+```python
+{params}
+```
+
+## Metrics
+
+|Key|Value|
+|--|--|
+{metrics}
+
+Tensor metrics:
+```
+{non_scalar_metrics}
+```
+
+```
+{extra_info}
+```
+
+## Full properties
+ Click Here
+
+{properties}
+
+{column_name}
+ <% _.each(value, function(vv, key) { %>
+
+ <%= value['exception_content'] %> +
+{time_now}
" + # + # time_component = pn.pane.HTML("current time is") + # pn.state.onload(update_time) + # pn.state.add_periodic_callback(update_time, 1000) # 每1000毫秒执行一次更新时间的操作 + # + # widget = pn.Column(time_component, df_widget) # .servable() + + return df_widget diff --git a/src/lumo/exp/metric.py b/src/lumo/exp/metric.py index abf5f618..56cc574c 100644 --- a/src/lumo/exp/metric.py +++ b/src/lumo/exp/metric.py @@ -4,28 +4,72 @@ class Metric: """ + A class that handles metric values and saving/loading them to/from disk. + + Attributes: + fn (str): The file path of the metric file. + _metric (dict): A dictionary containing the metric values. + persistent (bool): A boolean value indicating whether to save the metric values to disk. """ def __init__(self, metric_fn, persistent=True): + """ + Initializes a new instance of the Metric class. + + Args: + metric_fn (str): The file path of the metric file. + persistent (bool): A boolean value indicating whether to save the metric values to disk. + Default is True. + + Returns: + None. + """ os.makedirs(os.path.dirname(os.path.abspath(metric_fn)), exist_ok=True) self.fn = metric_fn self._metric = {} + self._last = {} if os.path.exists(metric_fn): self._metric = IO.load_pkl(metric_fn) + self.persistent = persistent + @property + def current(self): + return self._last + @property def value(self): """ - A property that returns the metric values of the row. + A property that returns the metric values. Returns: - dict: A dictionary containing the metric values of the row. + dict: A dictionary containing the metric values. """ return self._metric def dump_metric(self, key, value, cmp: str, flush=True, **kwargs): - dic = self.value + """ + Updates the metric value for a given key. + + If the metric value for the given key is not set or the new value is better than the + existing value based on the comparison type specified by cmp, the metric value is updated + with the new value. The function returns the updated value. + + Args: + key (str): The key for the metric value. + value (float): The new metric value. + cmp (str): The type of comparison to use when updating the metric value. Must be 'max' or 'min'. + flush (bool): A boolean value indicating whether to save the updated metric values to disk. + Default is True. + **kwargs: Additional key-value pairs to store with the metric value. + + Returns: + float: The updated metric value. + + Raises: + NotImplementedError: If cmp is not 'max' or 'min'. + """ + dic = self._metric older = dic.setdefault(key, None) update = False @@ -45,16 +89,39 @@ def dump_metric(self, key, value, cmp: str, flush=True, **kwargs): dic[key] = value for kk, vv in kwargs.items(): dic[kk] = vv + else: + value = older + + self._last[key] = value + for kk, vv in kwargs.items(): + self._last[kk] = vv if flush: self.flush() return value def dump_metrics(self, dic: dict, cmp: str): - for k, v in dic.items(): - self.dump_metric(k, v, cmp) + """ + Updates multiple metric values with a dictionary. + + The function calls dump_metric for each key-value pair in the input dictionary and returns + a dictionary containing the updated metric values. + + Args: + dic (dict): A dictionary containing the key-value pairs to update. + cmp (str): The type of comparison to use when updating the metric values. Must be 'max' or 'min'. + + Returns: + dict: A dictionary containing the updated metric values. + """ + res = {k: self.dump_metric(k, v, cmp, flush=False) + for k, v in dic.items()} + self.flush() + return res def flush(self): - """Writes the value of the row to a file.""" + """ + Writes the metric values to a file. + """ if self.persistent: IO.dump_pkl(self.value, self.fn) diff --git a/src/lumo/exp/watch.py b/src/lumo/exp/watch.py index e2dc8675..d853dbf4 100644 --- a/src/lumo/exp/watch.py +++ b/src/lumo/exp/watch.py @@ -1,28 +1,10 @@ -""" -Watcher 可以在运行实验后在 jupyter 或者网页上展示正在运行和已经运行结束的实验(按时间顺序?) -以及可以简化记录实验的烦恼 - -现在的核心痛点是 - - [ ] 所有元信息都有了,但是找不到哪个实验是哪个实验 - - [ ] 同时跑的多个实验有一个失败了,重跑时会混淆,或许需要一种覆盖手段 -> - - > 怎么 rerun? - lumo rerun test_name √ - lumo note html (用 streamlit 之类的生成动态网页) - lumo note cmd (类似 top 的视角,按时间顺序排列) -- > rerun 将已经跑的实验 move - -可以代替 analysis 的作用。主要有 - --> 按照 progress 目录,获取所有的实验 --> 根据获取的实验,按顺序记录 --> 每次只记录 - -""" import numbers +import os import os.path +import re from typing import List, Dict, overload from pprint import pformat -import pandas as pd + from dbrecord import PDict from datetime import datetime from operator import gt, ge, le, lt, eq, ne @@ -31,6 +13,8 @@ from .experiment import Experiment from lumo.utils import safe_io as IO from lumo.utils.fmt import format_timedelta, strptime, strftime +from lumo.proc.tz import timezone +from lumo.proc import glob PID_ROOT = os.path.join(progressroot(), 'pid') HB_ROOT = os.path.join(progressroot(), 'hb') @@ -74,6 +58,53 @@ def not_in_(ser, value): class Condition: + """ + Represents a condition to filter data based on a certain criteria. + + row filter: + ``` + from lumo import C + import pandas as pd + + # create a sample DataFrame + data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'], + 'age': [25, 30, 35, 40, 45], + 'city': [{'index':0}, {'index':1}, {'index':2},{'index':3},{'index':4}]} + df = pd.DataFrame(data) + + # create and apply the condition to filter the DataFrame + filtered_df = (C['age'] >= 35).apply(df) + + # print the filtered DataFrame + print(filtered_df) + ``` + + column edit: + ``` + (C+{'city.index':'cindex'}).apply(df).columns + # Index(['name', 'age', 'city', 'cindex'], dtype='object') + + (-C['city']).apply(df).columns + # Index(['name', 'age'], dtype='object') + + (C-['city']).apply(df).columns + (C-'city').apply(df).columns + # Index(['name', 'age'], dtype='object') + + (C-['city','name']).apply(df).columns + # Index(['age'], dtype='object') + ``` + + pipeline: + ``` + C.pipe(df,[ + (C['age']>35), + C+{'city.index':'cindex'}, + C-['city','name'] + ]) + ``` + """ + def __init__(self, name: str = None, value=None, op=None): self.name = name self.value = value @@ -85,9 +116,35 @@ def __getattr__(self, item): def __getitem__(self, item): return Condition(item) + def __add__(self, other): + c = Condition() + c.op = 'add_column' + c.value = {} + if isinstance(other, str): + c.value[other] = other + elif isinstance(other, dict): + c.value.update(other) + else: + raise NotImplementedError() + return c + + def __sub__(self, other): + c = Condition() + c.name = None + c.op = 'drop_column' + c.value = {} + if isinstance(other, str): + c.value.update({other: None}) + elif isinstance(other, (list, set, dict)): + c.value.update({k: None for k in other}) + else: + raise NotImplementedError() + return c + def __neg__(self): - self.drop = True - return self + c = Condition() + c.op = 'drop_column' + return c def __ge__(self, other): if other is None: @@ -127,47 +184,115 @@ def __lt__(self, other): return self def __repr__(self): - return f'C({self.name} {self.op} {self.value})' + return f'C({self.name}, {self.value}, {self.op})' def in_(self, lis): - """condition of `in` operation""" + """ + Sets the condition to evaluate if the value is in a given list. + + Args: + lis (list): the list of values to compare against. + + Returns: + The current instance of the Condition class with the comparison operator and value set. + """ self.op = 'in' self.value = set(lis) return self def not_in_(self, lis): - """condition of `.duplicated(value) == False` operation""" + """ + Sets the condition to evaluate if the value is not in a given list. + + Args: + lis (list): the list of values to compare against. + + Returns: + The current instance of the Condition class with the comparison operator and value set. + """ self.op = 'notin' self.value = set(lis) return self def mask(self, df): + """ + Returns a boolean mask of the given DataFrame based on the condition. + + Args: + df (pd.DataFrame): the DataFrame to evaluate. + + Returns: + A boolean mask of the given DataFrame based on the condition. + """ + import pandas as pd names = self.name.split('.') value = df for i in names: if isinstance(value, pd.DataFrame): value = value[i] else: - value = df.apply(lambda x: x[i]) + value = value.apply(lambda x: x[i]) return mapping[self.op](value, self.value) + def capply(self, df): + """apply operations in column axis""" + import pandas as pd + df = df.reset_index(drop=True) + if self.op == 'drop_column': + if isinstance(self.name, str): + var = [self.name] + elif isinstance(self.value, str): + var = [self.value] + else: + var = list(self.value) + print(var) + print(df.columns) + df = df.drop(var, axis=1) + else: + assert isinstance(self.value, dict) + for name, aim in self.value.items(): + names = name.split('.') + value = df + for i in names: + if isinstance(value, pd.DataFrame): + value = value[i] + else: + value = value.apply(lambda x: x[i]) + df[aim] = value + return df + def apply(self, df): - return df[self.mask(df)] + """Returns a new DataFrame with only the rows that meet the condition.""" + if not self.op.endswith('column'): + return df[self.mask(df)].reset_index(drop=True) + else: + return self.capply(df) + + def pipe(self, df, conditions: List['Condition']): + """Applies a list of conditions to a DataFrame using the pipe method.""" + filtered_df = df + for condition in conditions: + filtered_df = condition.apply(filtered_df) # .reset_index(drop=True) + return filtered_df C = Condition() class Watcher: - """List and watch experiments with time order + """ + A class for listing and watching experiments with time order and caching test information in 'metrics/{key}: {pformat(self._prop['params'][key], width=10, indent=2, compact=True)}
""")
-
- def sep(self):
- return self.wid.Output(layout={'border': '1px solid black'})
-
- def id_flag(self):
- return self.wid.VBox([
- self._widgets['exp_name'],
- self._widgets['test_name'],
- ])
-
- def key_params(self):
- return self.wid.VBox([
- *self._params_widgets.values()
- ])
-
- def editable(self):
- return self.wid.VBox([
- self._widgets['note'],
- self.sep(),
- self._widgets['tags'],
- ])
-
- def time(self):
- now = datetime.now()
- start = strptime(datestr=self._prop['progress']['start'])
- end = strptime(datestr=self._prop['progress']['start'])
- return self.wid.VBox([
- self.wid.HTML(f"""Start at: {format_timedelta(now - start)}"""),
- self.wid.HTML(f"""End at: {format_timedelta(now - end)}"""),
- ])
-
- def widget_dict(self):
- return {
- 'id_flag': self.id_flag(),
- 'time': self.time(),
- 'editable': self.editable(),
- 'params': self.key_params(),
- }
-
- def widget(self):
- params = self.key_params()
- params = [
- self.sep(),
- params,
- ]
-
- hbox = self.wid.HBox([
- self.id_flag(),
- self.time(),
- self._widgets['metrics'],
- self.editable(),
- self.key_params(),
- ])
-
- return hbox
-
- @classmethod
- def from_experiment(cls, exp: Experiment):
- tags = exp.properties.get('tags', [])
- try:
- tags = set(tags)
- except:
- tags = set()
- return cls(
- exp_name=exp.exp_name,
- test_name=exp.test_name,
- progress=exp.properties.get('progress', {}),
- params=exp['params'],
- metrics=exp.metric.value,
- note=exp.properties.get('note', ''),
- tags=tags,
- exp=exp,
- )
+ Returns:
+ True if the string is a valid test name, False otherwise.
+ """
+ return re.search(r'^\d{6}\.\d{3}\.[a-z\d]{2}t$', test_name) is not None
diff --git a/src/lumo/proc/dist.py b/src/lumo/proc/dist.py
index f960bffc..30030190 100644
--- a/src/lumo/proc/dist.py
+++ b/src/lumo/proc/dist.py
@@ -1,4 +1,5 @@
import os
+import torch
from torch import distributed as dist
@@ -75,3 +76,22 @@ def is_main():
"""
return local_rank() <= 0
+
+
+def gather(tensor):
+ """
+ Gather the tensor data across all processes in the distributed setting.
+
+ Args:
+ tensor (torch.Tensor): The tensor to be gathered.
+
+ Returns:
+ The gathered tensor data.
+ """
+ if dist.is_initialized():
+ if tensor.ndim == 0:
+ tensor = tensor.clone()[None]
+ output_tensors = [tensor.clone() for _ in range(torch.distributed.get_world_size())]
+ torch.distributed.all_gather(output_tensors, tensor)
+ return torch.cat(output_tensors, dim=0)
+ return tensor
diff --git a/src/lumo/proc/path.py b/src/lumo/proc/path.py
index 8cff01b1..55547bac 100644
--- a/src/lumo/proc/path.py
+++ b/src/lumo/proc/path.py
@@ -112,6 +112,7 @@ def metricroot():
def dbroot():
+ """Root path to store experiment information. Default is `~/.lumo/database`"""
DB_ROOT = glob.get('db_root', None)
if DB_ROOT:
res = DB_ROOT
diff --git a/src/lumo/proc/tz.py b/src/lumo/proc/tz.py
new file mode 100644
index 00000000..d3eb7d7b
--- /dev/null
+++ b/src/lumo/proc/tz.py
@@ -0,0 +1,12 @@
+from .config import glob
+import pytz
+
+
+def timezone():
+ """
+ Get the timezone from the global configuration, or default to 'Asia/Shanghai'.
+
+ Returns:
+ The timezone.
+ """
+ return pytz.timezone(glob.get('timezone', 'Asia/Shanghai'))
diff --git a/src/lumo/sketch/vis/parser.py b/src/lumo/sketch/vis/parser.py
index b1f94068..7d47aea8 100644
--- a/src/lumo/sketch/vis/parser.py
+++ b/src/lumo/sketch/vis/parser.py
@@ -23,25 +23,25 @@ class Step:
step: int
-def find_metric_fron_test_root(test_root):
- test_root = finder.retrieval_test_root(test_root)
- if test_root is None:
- return False, {}
-
- exp = Experiment.from_disk(test_root)
- if exp.has_prop('tensorboard_args'):
- tb = exp.properties.get('tensorboard_args')
- metrics = parse_fron_tensorboard(tb['log_dir'])
- elif exp.has_prop('logger_args'):
- tb = exp.properties.get('logger_args')
- metrics = parse_from_log(tb['log_dir'])
- else:
- fs = [i for i in os.listdir(exp.test_root)]
- if len([f for f in fs if f.endswith('.log')]) > 0:
- metrics = parse_from_log(os.path.join(exp.test_root, fs[0]))
- else:
- metrics = {}
- return True, metrics
+# def find_metric_fron_test_root(test_root):
+# test_root = finder.retrieval_test_root(test_root)
+# if test_root is None:
+# return False, {}
+#
+# exp = Experiment.from_disk(test_root)
+# if exp.has_prop('tensorboard_args'):
+# tb = exp.properties.get('tensorboard_args')
+# metrics = parse_fron_tensorboard(tb['log_dir'])
+# elif exp.has_prop('logger_args'):
+# tb = exp.properties.get('logger_args')
+# metrics = parse_from_log(tb['log_dir'])
+# else:
+# fs = [i for i in os.listdir(exp.test_root)]
+# if len([f for f in fs if f.endswith('.log')]) > 0:
+# metrics = parse_from_log(os.path.join(exp.test_root, fs[0]))
+# else:
+# metrics = {}
+# return True, metrics
def parse_from_log(log) -> Dict[str, List[Step]]:
diff --git a/src/lumo/trainer/accelerator.py b/src/lumo/trainer/accelerator.py
new file mode 100644
index 00000000..72de3996
--- /dev/null
+++ b/src/lumo/trainer/accelerator.py
@@ -0,0 +1,297 @@
+import warnings
+from torch import nn
+import torch
+from torch import distributed
+from torch.utils.data import DataLoader
+from lumo.data.loader import DataLoaderSide
+from lumo.proc.dist import gather
+
+
+class Accelerator:
+ """
+ A class to define the interface for various types of accelerator.
+
+ Attributes:
+ _prop (dict): A dictionary of keyword arguments.
+
+ Methods:
+ device: A property method to get the device.
+ set_device: A method to set the device.
+ prepare_data_loader: A method to prepare the data loader.
+ prepare_model: A method to prepare the model.
+ prepare_optimizer: A method to prepare the optimizer.
+ unwrap_model: A method to unwrap the model.
+ prepare: A method to prepare the inputs for training.
+ wait_for_everyone: A method to wait for all processes to synchronize.
+ gather: A method to gather the tensor data.
+ backward: A method to compute the gradients using backpropagation.
+ """
+
+ def __init__(self, **kwargs):
+ """
+ Initialize the class with a dictionary of keyword arguments.
+ """
+ self._prop = kwargs
+
+ @property
+ def device(self) -> torch.device:
+ """
+ Get the device.
+ """
+ return self._prop.get('device', None)
+
+ def set_device(self, device: torch.device):
+ """
+ Set the device.
+
+ Args:
+ device (torch.device): The device to be set.
+ """
+ assert isinstance(device, torch.device)
+ self._prop['device'] = device
+
+ def prepare_data_loader(self, dataloader):
+ """
+ Prepare the data loader.
+
+ Args:
+ dataloader: The data loader.
+
+ Returns:
+ The prepared data loader.
+ """
+ return dataloader
+
+ def prepare_model(self, model: torch.nn.Module):
+ """
+ Prepare the model.
+
+ Args:
+ model (torch.nn.Module): The model.
+
+ Returns:
+ The prepared model.
+ """
+ return model.to(self.device)
+
+ def prepare_optimizer(self, optimizer: torch.optim.Optimizer):
+ """
+ Prepare the optimizer.
+
+ Args:
+ optimizer (torch.optim.Optimizer): The optimizer.
+
+ Returns:
+ The prepared optimizer.
+ """
+ return optimizer
+
+ def unwrap_model(self, model):
+ """
+ Unwrap the model.
+
+ Args:
+ model: The model.
+
+ Returns:
+ The unwrapped model.
+ """
+ return model
+
+ def prepare(self, *args):
+ """
+ Prepare the inputs for training.
+
+ Args:
+ *args: The inputs.
+
+ Returns:
+ The prepared inputs.
+ """
+ res = []
+ for item in args:
+ if isinstance(item, nn.Module):
+ res.append(self.prepare_model(item))
+ elif isinstance(item, (DataLoader, DataLoaderSide)):
+ res.append(self.prepare_data_loader(item))
+ elif isinstance(item, torch.optim.Optimizer):
+ res.append(self.prepare_optimizer(item))
+ else:
+ raise NotImplementedError()
+ return res
+
+ def wait_for_everyone(self):
+ """
+ Wait for all processes to synchronize.
+ """
+ torch.distributed.barrier()
+
+ def gather(self, tensor: torch.Tensor):
+ """
+ Gather the tensor data.
+
+ Args:
+ tensor (torch.Tensor): The tensor to be gathered.
+
+ Returns:
+ The gathered tensor data.
+ """
+ return gather(tensor)
+
+ def backward(self, loss: torch.Tensor, **kwargs):
+ """
+ Compute the gradients using backpropagation.
+
+ Args:
+ loss (torch.Tensor): The loss tensor.
+ **kwargs: The additional keyword arguments.
+ """
+ loss.backward(**kwargs)
+
+class HugAccelerator(Accelerator):
+ """
+ A class to define the interface for Hugging Face accelerator.
+
+ Methods:
+ set_device: A method to set the device.
+ prepare_data_loader: A method to prepare the data loader.
+ prepare_model: A method to prepare the model.
+ prepare_optimizer: A method to prepare the optimizer.
+ unwrap_model: A method to unwrap the model.
+ prepare: A method to prepare the inputs for training.
+ wait_for_everyone: A method to wait for all processes to synchronize.
+ gather: A method to gather the tensor data.
+ backward: A method to compute the gradients using backpropagation.
+ """
+
+ def __init__(self, **kwargs):
+ """
+ Initialize the class with a dictionary of keyword arguments.
+ """
+ super().__init__(**kwargs)
+ from .backend.accelerator import Accelerator
+ self._backbone = Accelerator()
+
+ @property
+ def device(self):
+ """
+ Get the device.
+ """
+ return self._backbone.device
+
+ def set_device(self, device: torch.device):
+ """
+ Set the device.
+
+ Args:
+ device (torch.device): The device to be set.
+ """
+ assert isinstance(device, torch.device)
+ self._backbone.state.device = device
+
+ def prepare_data_loader(self, loader):
+ """
+ Prepare the data loader.
+
+ Args:
+ loader: The data loader.
+
+ Returns:
+ The prepared data loader.
+ """
+ from accelerate.data_loader import DataLoaderShard, DataLoaderDispatcher
+ if isinstance(loader, (DataLoaderShard, DataLoaderDispatcher)):
+ warnings.warn('Duplicated prepare a same DataLoader twice, check your code.')
+ return loader
+ return self._backbone.prepare_data_loader(loader)
+
+ def prepare_model(self, model):
+ """
+ Prepare the model.
+
+ Args:
+ model: The model.
+
+ Returns:
+ The prepared model.
+ """
+ return self._backbone.prepare_model(model)
+
+ def prepare_optimizer(self, optimizer):
+ """
+ Prepare the optimizer.
+
+ Args:
+ optimizer: The optimizer.
+
+ Returns:
+ The prepared optimizer.
+ """
+ return self._backbone.prepare_optimizer(optimizer)
+
+ def unwrap_model(self, model):
+ """
+ Unwrap the model.
+
+ Args:
+ model: The model.
+
+ Returns:
+ The unwrapped model.
+ """
+ return self._backbone.unwrap_model(model)
+
+ def prepare(self, *args):
+ """
+ Prepare the inputs for training.
+
+ Args:
+ *args: The inputs.
+
+ Returns:
+ The prepared inputs.
+ """
+ return self._backbone.prepare(*args)
+
+ def wait_for_everyone(self):
+ """
+ Wait for all processes to synchronize.
+ """
+ self._backbone.wait_for_everyone()
+
+ def gather(self, tensor):
+ """
+ Gather the tensor data.
+
+ Args:
+ tensor: The tensor to be gathered.
+
+ Returns:
+ The gathered tensor data.
+ """
+ return self._backbone.gather(tensor)
+
+ def backward(self, loss: torch.Tensor, **kwargs):
+ """
+ Compute the gradients using backpropagation.
+
+ Args:
+ loss (torch.Tensor): The loss tensor.
+ **kwargs: The additional keyword arguments.
+ """
+ self._backbone.backward(loss, **kwargs)
+
+
+register = {
+
+ 'none': Accelerator,
+ 'accelerator': HugAccelerator,
+ 'deepspeed': None,
+ 'horovod': None,
+}
+
+
+def get_accelerator(name: str, **kwargs) -> Accelerator:
+ """Get the accelerator for the specified name."""
+ assert name in register, ', '.join(register.keys())
+ return register[name](**kwargs)
diff --git a/src/lumo/trainer/backend/__init__.py b/src/lumo/trainer/backend/__init__.py
new file mode 100644
index 00000000..e69de29b
diff --git a/src/lumo/contrib/accelerate/accelerator.py b/src/lumo/trainer/backend/accelerator.py
similarity index 50%
rename from src/lumo/contrib/accelerate/accelerator.py
rename to src/lumo/trainer/backend/accelerator.py
index 79d4d3b5..5c64a757 100644
--- a/src/lumo/contrib/accelerate/accelerator.py
+++ b/src/lumo/trainer/backend/accelerator.py
@@ -4,13 +4,23 @@
class Accelerator(_Accelerator):
"""
- Accelerator instance for distributed training (on multi-GPU, TPU) or mixed precision training.
+ Accelerator instance for distributed training (on multi-GPU, TPU) or mixed precision training.
- This Accelerator subclass is use for `Trainer`, the only difference is that
- the device of data will be controlled by Trainer rather than Accelerator.
- """
+ This Accelerator subclass is used for `Trainer`. The only difference is that
+ the device of data will be controlled by `Trainer` rather than `Accelerator`.
+ """
+
+ def prepare_data_loader(self, data_loader, **kwargs):
+ """
+ Prepare the data loader.
+
+ Args:
+ data_loader: The data loader.
+ **kwargs: Additional keyword arguments.
- def prepare_data_loader(self, data_loader):
+ Returns:
+ The prepared data loader.
+ """
return prepare_data_loader(
data_loader,
None, # None instead of self.device,
diff --git a/src/lumo/trainer/base.py b/src/lumo/trainer/base.py
index bab315fe..26a5c068 100644
--- a/src/lumo/trainer/base.py
+++ b/src/lumo/trainer/base.py
@@ -134,7 +134,7 @@ def inner(dm=None, params=None, *args, **kwargs):
process_loader = getattr(self, 'process_loader', None)
if process_loader is not None:
process_loader(dm, TrainStage.create_from_str(func.__name__))
- func(*args, **kwargs)
+ func(dm, params, *args, **kwargs)
return inner
diff --git a/src/lumo/trainer/callbacks.py b/src/lumo/trainer/callbacks.py
index 2f71b228..dad473ef 100644
--- a/src/lumo/trainer/callbacks.py
+++ b/src/lumo/trainer/callbacks.py
@@ -9,7 +9,7 @@
from datetime import datetime
from functools import wraps
from typing import NewType, Any, Optional, Dict, Union
-
+from lumo.proc.tz import timezone
import psutil
from torch.utils.data import DataLoader
@@ -353,6 +353,7 @@ def __init__(self, step_frequence=3, break_in=1000):
self.step = step_frequence
file = tempfile.TemporaryFile('w')
self.temp = file
+ self.cur_tqdm = None
def on_imodels_end(self, trainer: Trainer, func, params: ParamsType, result: Any, *args, **kwargs):
super().on_imodels_end(trainer, func, params, result, *args, **kwargs)
@@ -404,12 +405,14 @@ def update(self, trainer: Trainer):
TrainStage.train in self.stage and ((trainer.idx + 1) == self.stage[TrainStage.train])):
trainer.logger.inline(self.cur_tqdm.full_str())
trainer.logger.newline()
+ trainer.exp.trigger()
def flush(self, trainer: Trainer):
"""Flush"""
self.c = 0
- trainer.logger.inline(self.cur_tqdm)
- trainer.logger.newline()
+ if self.cur_tqdm is not None:
+ trainer.logger.inline(self.cur_tqdm)
+ trainer.logger.newline()
def on_train_epoch_begin(self, trainer: Trainer, func, params: ParamsType, *args, **kwargs):
super().on_train_epoch_begin(trainer, func, params, *args, **kwargs)
@@ -707,7 +710,7 @@ def request(self, data: dict):
task = self.executor.submit(self.req.post, self.url, json={'data': data,
'type': 'timeevent',
'from': 'lumo.RemoteCallback',
- 'datetime': datetime.now().isoformat()})
+ 'datetime': datetime.now(timezone()).isoformat()})
self.submits.append(task)
def on_hooked(self, source: Trainer, params: ParamsType):
@@ -805,7 +808,7 @@ class SkipWhenParamsEq(TrainCallback, InitialCallback):
def on_hooked(self, source: Trainer, params: ParamsType):
super().on_hooked(source, params)
from dbrecord import PDict
- from lumo.exp.finder import is_test_root
+ from lumo.exp.watch import is_test_root
self.fn = source.exp.mk_rpath('contrib', 'params_key.sqlite')
olds = PDict(self.fn)
diff --git a/src/lumo/trainer/components.py b/src/lumo/trainer/components.py
index 37d5bf7e..30dc078c 100644
--- a/src/lumo/trainer/components.py
+++ b/src/lumo/trainer/components.py
@@ -15,7 +15,6 @@ def log_dir(self):
@property
def params_fn(self):
res = self.mk_ipath('params.yaml')
- self.dump_string('params.yaml', res)
return res
@property
diff --git a/src/lumo/trainer/trainer.py b/src/lumo/trainer/trainer.py
index 27338f92..c8307ff1 100644
--- a/src/lumo/trainer/trainer.py
+++ b/src/lumo/trainer/trainer.py
@@ -1,38 +1,29 @@
import bisect
import os
-import sys
import warnings
-from datetime import datetime
from functools import lru_cache
from typing import Union, Dict, Any, Optional, Sequence, Mapping, Callable
import numpy as np
import torch
-from accelerate import DistributedDataParallelKwargs
-from accelerate.data_loader import DataLoaderDispatcher, DataLoaderShard
+from .accelerator import get_accelerator
from torch import nn
from torch.optim import Optimizer
from torch.utils.data import DataLoader
-import json
-from lumo.contrib.accelerate import Accelerator
-from lumo.contrib.accelerate.utils import send_to_device
+from lumo.utils.device import send_to_device
from lumo.core import TrainStage, Record, MetricType, Meter
from lumo.core.disk import TableRow, Metrics
from lumo.data import DataModule
from lumo.data.loader import DataLoaderType, DataLoaderSide
from lumo.proc import dist
from lumo.proc import glob
+from lumo.utils import safe_io as IO
from lumo.trainer.rnd import RndManager
from lumo.utils.logger import Logger
from .base import _BaseTrainer
from .components import TrainerExperiment, TrainerParams
from .saver import Saver
-# overwrite send_to_device to resolve https://github.com/pytorch/pytorch/issues/83015
-# from accelerate import Accelerator
-# from accelerate.utils import send_to_device
-from ..utils.fmt import strftime
-
ParamsType = TrainerParams
@@ -72,7 +63,7 @@ def __init_subclass__(cls, **kwargs):
raise TypeError(
f"Can't instantiate abstract class {cls.__name__} directly, please create a subclass of it.")
- def __init__(self, params: ParamsType, dm: DataModule = None):
+ def __init__(self, params: ParamsType, dm: DataModule = None, accelerator=None):
if dm is None:
dm = DataModule(params)
else:
@@ -85,30 +76,37 @@ def __init__(self, params: ParamsType, dm: DataModule = None):
self._saver = None
self.params.iparams()
- self.exp = TrainerExperiment(self.generate_exp_name())
+ self.exp = TrainerExperiment(exp_name=self.generate_exp_name())
self._database = TableRow(self.exp.mk_ipath('metric.pkl'), persistent=self.is_main)
self.metric_board = Metrics(self.exp.mk_bpath('board.sqlite'), persistent=self.is_main)
self.metric = self.exp.metric
- self.exp.dump_info('metric_board', self.metric_board.fpath)
- self.exp.dump_info('table_row', self._database.fpath)
+ # self.exp.dump_info('table_row', self._database.fpath)
self.rnd = RndManager()
self.train_epoch_toggle = False
self.train_toggle = False
device = params.get('device', None) if not self.is_dist else None
+ # self.accelerate = Accelerator(kwargs_handlers=[
+ # DistributedDataParallelKwargs(find_unused_parameters=params.get('find_unused_parameters', False))
+ # ])
+ accelerator = glob.get('accelerator', 'accelerator')
+ self.accelerate = get_accelerator(accelerator)
- self.accelerate = Accelerator(kwargs_handlers=[
- DistributedDataParallelKwargs(find_unused_parameters=params.get('find_unused_parameters', False))
- ])
-
- if self.accelerate.state.distributed_type == self.accelerate.state.distributed_type.NO:
- self.accelerate.state.device = torch.device(device)
+ self.accelerate.set_device(torch.device(device))
if dist.is_main():
self.params.to_yaml(self.exp.params_fn)
+ params_hash = self.params.hash()
+ self.exp.dump_info('trainer', {
+ 'params_meta': {
+ 'fn': self.exp.params_fn,
+ 'hash': params_hash
+ },
+ 'board_fn': self.metric_board.fpath
+ }, append=True)
self.exp.dump_info('params', self.params.to_dict())
self.set_global_steps(0)
@@ -169,7 +167,9 @@ def logger(self):
self._logger.debug('Enable debug log.')
if self.is_main:
fn = self._logger.add_log_dir(self.exp.log_dir)
- self.exp.dump_info('logger_args', {'log_dir': fn})
+ self.exp.dump_info('trainer', {
+ 'logger_fn': fn
+ }, append=True)
return self._logger
@@ -516,10 +516,10 @@ def to_device(self, item: Optional[Union[nn.Module, torch.Tensor, Sequence, Mapp
def on_trainer_exception(self, func: Callable, exception: BaseException):
"""Updates database with error information when an exception occurs during training."""
- self.exp.dump_info('exception', dict(end=strftime(),
- finished=False,
- error=str(exception),
- trainer_frame=str(func)))
+ # self.exp.dump_info('exception', dict(end=strftime(),
+ # finished=False,
+ # error=str(exception),
+ # trainer_frame=str(func)), append=True)
@property
def is_initialized(self):
@@ -541,15 +541,15 @@ def initialize(self):
return
self.exp.start()
- params_hash = self.params.hash()
- self.exp.dump_string('params_hash', params_hash)
-
self.icallbacks(self.params)
self.set_property('initial.callbacks', True)
self.imodels(self.params)
self.set_property('initial.model', True)
self.set_property('initial', True)
+ self.logger.info('Use Experiment')
+ self.logger.info(self.exp)
+
def stop_train(self):
"""Toggle to stop train."""
self.train_toggle = True
@@ -566,17 +566,7 @@ def prepare_dataloader(self, loader: DataLoaderType, stage: TrainStage = None):
:param stage:
:return:
"""
- if isinstance(loader, (DataLoaderShard, DataLoaderDispatcher)):
- warnings.warn('Duplicated prepare a same DataLoader twice, check your code.')
- return loader
-
- split_batches = self.params.get('split_batches', None)
- if stage is not None and not stage.is_train():
- split_batches = True
-
- """do not change original loader stage"""
if isinstance(loader, DataLoader):
- self.accelerate.split_batches = split_batches
loader = self.accelerate.prepare_data_loader(loader)
elif isinstance(loader, DataLoaderSide):
loader = loader.copy()
@@ -599,6 +589,8 @@ def train(self, dm: Union[DataModule, DataLoaderType] = None, params: ParamsType
ValueError: If no data loader is available for training.
"""
+ self.change_stage(TrainStage.train)
+
loader = self.select_loader(dm)
if not loader:
loader = self.train_dataloader
@@ -612,7 +604,6 @@ def train(self, dm: Union[DataModule, DataLoaderType] = None, params: ParamsType
for eidx in range(params.epoch):
# update training progress
- self.exp.dump_train_eidx(eidx, params.epoch)
self.set_epoch_idx(eidx)
# train loop
@@ -631,7 +622,8 @@ def train(self, dm: Union[DataModule, DataLoaderType] = None, params: ParamsType
self.set_property('early_stop', f'meet limit_global_steps {limit_global_steps}')
break
- # update when train finished
+ self.exp.dump_train_eidx(eidx, params.epoch)
+
self.exp.end()
return self._prop
@@ -805,8 +797,7 @@ def change_stage(self, stage: TrainStage):
else:
v.eval()
- @classmethod
- def select_loader(cls, dm=None):
+ def select_loader(self, dm=None, stage=None):
"""
Selects the appropriate loader based on the given data module.
@@ -819,7 +810,8 @@ def select_loader(cls, dm=None):
loader = None
if dm:
if isinstance(dm, DataModule):
- loader = dm.train_dataloader
+ loader = dm.get_loader_with_stage(stage=self.trainstage)
+ # loader = dm.train_dataloader
elif isinstance(dm, DataLoader) or isinstance(dm, DataLoaderSide):
loader = dm
else:
@@ -1012,28 +1004,63 @@ def wait_for_everyone(self):
self.accelerate.wait_for_everyone()
def save_best_model(self):
+ """
+ Saves the best model checkpoint and metadata.
+
+ If the current process is the main process, saves the best model checkpoint as 'best_model.ckpt'
+ and its metadata as 'best_model.json'. If not, saves the checkpoint and metadata with the process rank
+ appended to the filename, e.g., 'best_model-