diff --git a/README.ch.md b/README.ch.md index afbb7798..a9b4e24f 100644 --- a/README.ch.md +++ b/README.ch.md @@ -2,184 +2,270 @@ [![PyPI version](https://badge.fury.io/py/lumo.svg)](https://badge.fury.io/py/lumo) ![Python-Test](https://github.com/pytorch-lumo/lumo/actions/workflows/python-test.yml/badge.svg) -[![license](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/pytorch-lumo/lumo/blob/master/LICENSE) +[![license](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/Lightning-AI/lightning/blob/master/LICENSE) +![Python-doc](./images/docstr_coverage_badge.svg) + +`lumo` 是一个精简高效的库,简化了实验所需的所有组件的管理,并特别关注增强深度学习实践者的体验。 + +- 实验管理:: 为每次运行分配唯一路径,区分不同类型的文件并存储;通过 git 管理代码快照;记录实验中产生的一切信息,保障可回溯、可复现 +- 参数管理:基于 fire 提供比 argparser 更便捷的参数管理 +- 运行时配置:提供多级作用域下的配置管理 +- 可视化:基于 [Panel](https://panel.holoviz.org/index.html) 提供可交互的 jupyter 实验管理面板 +- 为深度学习提供额外的优化 + - 训练:基于 Trainer 提供可任意扩展的训练逻辑简化,并提供完善的回调逻辑 + - 优化器:参数与优化器构建一体化 + - 数据: 数据集构建流程抽象、组合多个 DataLoader、... + - 分布式训练:同样支持多种训练加速框架,统一抽象,方便随时切换 +- 更多工具类... + +![lumo-framework](./images/lumo-intro.png) + +# :book: 目录 + +- :book: [目录](#book-目录) +- :cloud: [安装](#cloud-安装) +- :book: [快速开始](#book-快速开始) + - :small_orange_diamond: [已有项目嵌入](#small_orange_diamond-已有项目嵌入) + - :small_orange_diamond: [从零开始](#small_orange_diamond-从零开始) + - :small_orange_diamond: [可视化界面](#small_orange_diamond-可视化界面) + - :small_orange_diamond: [复现实验](#small_orange_diamond-复现实验) + - :small_orange_diamond: [备份](#small_orange_diamond-备份) +- [More](#more) +- :pencil: [Acknowledge](#pencil-acknowledge) +- :scroll: [License](#scroll-license) + +# :cloud: 安装 + +安装已发布的通过了所有测试的版本 +```bash +pip install -U lumo +``` -`lumo`:轻量、可扩展、功能解耦合的 Pytorch 实验框架。 - -lumo 的设计理念: - -- 模块解耦合:所有模块可以单独作为您现在使用的框架中的一个插件使用(而不像其他框架几乎耦合在一起) -- 恰到好处的抽象:和模型相关的细节完全由使用者掌控,lumo 只封装了外部通用逻辑(而不像其他一些框架会代理模型初始化或损失迭代) -- 覆盖整个生命周期:数据集构建、模型初始化、随机种子、训练/测试...,lumo 为所有步骤提供了功能包或流程简化 -- 极强的可扩展性:从单文件到包含多个领域多个方法的项目,lumo 都可以提供舒适的使用体验。已在两个领域有复现项目的最佳实践示例(见[Related Work](#Related Work))。 - -# 如何使用 - -## 安装 - -从 pypi 或 github 主页安装最新的稳定版本: +或从 dev1 分支安装最新版本: ```bash -pip install -U lumo pip install git+https://github.com/pytorch-lumo/lumo ``` +实验面板依赖于 panel,需要额外安装: -## 快速开始 +``` +pip install panel +``` -本节包含 lumo 最常用的几个子功能,帮助使用者快速利用这些功能减少已有项目中的冗余代码,这些功能包括: +# :book: 快速开始 -- 一些常用功能的更优替代,如 [Params](#参数控制)(arguments 的平替),[Logger](#变量&日志记录)(logging 的平替) -- 一些训练过程中部份流程的优化,如 [Experiment](#路径管理&版本控制)(提供无重复的实验路径管理、基于 git 的版本控制),[DatasetBuilder](#数据集构建)(更快构建数据集), +以下是几个经典场景: -### 参数控制 +## :small_orange_diamond: 已有项目嵌入 -`argparse` 的更优替代。`Params` 底层依托于 [omegaconf](https://github.com/omry/omegaconf) 和 [fire](https://github.com/google/python-fire) -,只需要简单的配置,就可以从文件、命令行中读取参数。 +对已有项目,可以通过以下方式快速嵌入 -直接基于 Params 类定义参数: +- 引入 ```python -# python main.py --epoch=30 --dataset=100 -from lumo import Params - -params = Params() -params.epoch = 20 -# 集成优化器参数,自带补全提示 -params.optim = params.OPTIM.create_optim('Adam', lr=0.0001, weight_decay=4e-5) -# 数据集只能从 cifar10/cifar100 中选择,且默认为 cifar10,其他的选择会报错 -params.dataset = params.choice('cifar10', 'cifar100') +import random +from lumo import SimpleExperiment, Params, Logger, Meter, Record +``` -# 从命令行参数中更新 -params.from_args() -print(params.epoch) # -> 30 -print(params.dataset) # -> cifar100 +- 初始化 Logger 和 Experiment -# 保存到文件 -params.to_json('./config.json') -params.to_yaml('./config.yaml') -# 从文件中更新 -params.from_json('./config.json') -params.from_yaml('./config.yaml') +```python +logger = Logger() +# 定义及使用,无需转换 +exp = SimpleExperiment(exp_name='my_exp_a') # 为每种实验手动定义唯一名称 +exp.start() +logger.add_log_dir(exp.mk_ipath()) ``` -也可以通过继承、多重继承来嵌套,组合参数。即使在命令行中输入了不存在的参数,Params 也会正常读取。 - -### 变量&日志记录 +- 初始化参数(代替 argparse 等) -`logging` 的更优替代。通过 Meter、Record 和 Logger,可以实现变量的记录和格式化输出。其中: +```python +# 替换基于 argparse 等的参数定义方法 +params = Params() +params.dataset = params.choice('cifar10', 'cifar100') +params.alpha = params.arange(default=1, left=0, right=10) +params.from_args() # python3 train.py --dataset=cifar100 --alpha=0.2 +print(params.to_dict()) # {"dataset": "cifar100", "alpha": 0.2} +``` -- Meter 记录单次的值 -- Record 以一定规则归约 Meter 实例(如 mean、sum 等) -- Logger 用于代替 logging,除常用的 info、warn 等方法外,还提供了 inline 方法,可以在屏幕能单行更新(实际中,屏幕打印时间远小于训练时间,因此单行更新带来的时间开销可以忽略不计)。 +- 在训练过程中记录参数、存储信息(代替手动管理路径、自己维护 AvgItem) ```python -import random -import time +# 记录实验参数 +exp.dump_info('params', params.to_dict()) +print(exp.test_name) # 为每次实验自动分配唯一名称 + +# 基于命名空间提供本次实验的唯一路径 +# 元数据和二进制大文件分离,方便清理 +params.to_yaml(exp.mk_ipath('params.yaml')) -from lumo import Record, Meter, Logger +for i in range(10): + # 记录实验指标 + max_acc = exp.dump_metric('Acc', random.random(), cmp='max') + logger.info(f'Max acc {max_acc}') -log = Logger() + # 存储大文件/二进制文件(如模型权重) + ckpt_fn = exp.mk_bpath('checkpoints', f'model_{i}.ckpt') + ... # 保存代码 given ckpt_fn record = Record() -for idx in range(256): - meter = Meter() - meter.last.i = idx - meter.sum.acc = idx - meter.mean.loss = random.random() +for batch in range(10): + m = Meter() + m.mean.Lall = random.random() + m.last.lr = batch + record.record(m) + logger.info(record) + +# 主动结束实验,补充元信息。也可以在进程结束后由 hook 自动结束,支持针对异常的记录 +exp.end() +``` - record.record(meter) - log.inline(record) # 单行更新 - time.sleep(0.5) - if idx % 50 == 0: - log.newline() - record.clear() +## :small_orange_diamond: 从零开始 -log.info(record) -``` +如果从新开始一个深度学习实验,那么可以使用 lumo 全方位的加速代码的构建,下面提供了多个不同规模下使用 lumo 训练的示例: -### 路径管理&版本控制 +单文件: -`Experiment` 主要提供路径管理,可以为每一次试验根据实验名、日期、次数等自动提供不一样的保存路径。此外,Experiment 还可以通过 hook -提供如代码版本管理、元数据记录等功能。在实验中,可以使用其子类 `SimpleExperiment` 实现大部分需求。 +| 示例 | CoLab | 代码行数 | +|----------------------------------------|-------|------| +| [MNIST 示例](./examples/mnist.py) | | 118 | +| [MocoV2 训练 CIFAR10](./examples/moco.py) || 284 | +| [多卡训练 ImageNet]() ||| -```python -from lumo import SimpleExperiment -from lumo import Params +实验项目: -pm = Params() -pm.module = 'example' -pm.from_args() +| 项目 | 说明 | +|-----------------------------------------------------------------------------------------------------------|-------------------------------| +| [image-classification](https://github.com/pytorch-lumo/image-classification) | 集成了全监督、半监督、自监督的多个论文的复现代码 | +| [emotion-recognition-in-coversation](https://github.com/pytorch-lumo/emotion-recognition-in-conversation) | 集成了对话情感分类、多模态对话情感分类的多个论文的复现代码 | -# 注册该次实验,实验名为 `pm.module` -exp = SimpleExperiment(pm.module) -# 实验开始,该方法会调用已注册的 ExpHooks,完成代码版本控制等功能。 -exp.start() +## :small_orange_diamond: 可视化界面 -# 小数据通过 `.test_file()` 获得路径 -fn = exp.test_file('params.json') -pm.to_json(fn) +在 jupyter 中: -# 大文件通过 `.blob_file()` 获得路径(这是约定,而不是强制,大文件也可以保存到 `.test_file()` 中) -fn = exp.blob_file('checkpoint.pt') -with open(fn, 'w') as w: - w.write('write big data in blob file') +```python +from lumo import Watcher -print(exp.test_root) -print(exp.get_prop('git')) # see git commit history -exp.end() +w = Watcher() +df = w.load() +widget = w.panel(df) +widget.servable() ``` -### 数据集构建 +![Panel](./images/panel-example.png) -![DatasetBuilder](./images/DatasetBuilder.png) +可视化手动筛选后的实验: +![Panel](./images/panel-example2.png) -`DatasetBuilder` 是采用有向无环图思路设计的数据集构建类,该类提供了一个恰当的抽象逻辑,避免了在一个实验里定义多个重复 Datasets 类。 +可以直接使用命令行打开页面查看当前所有实验: -`DatasetBuilder `将数据集的构件划分为输入-输出两阶段,同时提供 `.chain()`(序列格式)和`.zip()`(字典格式) 两种输出方式。 +``` +lumo board [--port, --address, --open] +``` -```python -from lumo import DatasetBuilder -from torchvision.transforms import transforms -import torch - -# Create a mnist-like dummy dataset -db = ( - DatasetBuilder() - .add_input("xs", torch.rand(500, 28, 28)) - .add_input("ys", torch.randint(0, 10, (500,))) - .add_idx('id') - .add_output("xs", "xs1", transforms.RandomHorizontalFlip()) - .add_output("xs", "xs2", ) - .add_output("ys", "ys") -) -# Watch dataset structure -print(db) -# Builder(flow={'::idx::': ['id'], 'xs': ['xs1', 'xs2'], 'ys': ['ys']}, sized=True, size=500, iterable=True) +## :small_orange_diamond: 复现实验 -print(db[0]) -# dict_keys(['id', 'xs1', 'xs2', 'ys']) +对因为个中原因失败的实验,在藉由可视化界面观察并解决后,可以通过唯一实验 Id (test_name) 直接重跑,并对关键参数重新赋值: + +``` +lumo rerun 230313.030.57t --device=0 ``` -# 更多教程 +## :small_orange_diamond: 备份 -# Related Work +记录实验信息到 Github issue (基于 PyGitHub): -- [image-classification](https://github.com/pytorch-lumo/image-classification): supervised/semi-supervised/self-supervised/noisy label learning on image-classfication - field. (suporrted datasets: CIFAR10/CIFAR100/STL-10/SVHN/ImageNet/tinyimagenet) -- [emotion-recognition-in-conversation](https://github.com/pytorch-lumo/emotion-recognition-in-conversation):Multimodel emotional recognition on conversation. (suporrted datasets: IEMOCAP/MELD/MOSEI) +```python +from lumo import Experiment, Watcher +from lumo import glob + +glob['github_access_token'] = 'ghp_*' # `access_token` 的默认值,建议将 access_token 写在全局配置 `~/.lumorc.json` 中 + +w = Watcher() +df = w.load() + +# 选择单个实验备份 +exp = Experiment.from_cache(df.iloc[0].to_dict()) +issue = exp.backup('github', repo='pytorch-lumo/image-classification-private', + access_token='ghp_*', + update=True, # 如果已备份,则覆盖更新之前的 issue + labels=None, # 可选标签 + ) +print(issue.number) + +# 批量备份,并且基于每个实验的参数添加标签 +issues = df.apply( + lambda x: Experiment.from_cache(x.to_dict()).backup(..., labels=[x['params'].get('dataset', '')]), + axis=1 +) +``` +![backup_github](./images/backup_github.png) -# Acknowledge +# Full properties - 一个人维护一个库四年,背后的动力是我持续不断的使用,感谢 lumo 陪我见证我的学术生涯。lumo 确实不一定适合所有人的习惯,但一定最适合我自己。lumo 取自 lumos,这是哈利波特里魔法杖发光的咒语。torch 是火炬,ignite 是点燃,所以 lumo 也向往着发光发热,希望 lumo 给大家带来美好的使用体验。 +``` +{'agent': nan, + 'backup': {'23-03-17-003438': {'backend': 'github', + 'number': 9, + 'repo': '...'}, + }, + 'exception': nan, + 'execute': {'cwd': '~/Documents/Python/lumo', + 'exec_argv': ['~/Documents/Python/lumo/a.py'], + 'exec_bin': '~/.pyenv/versions/3.9.16/bin/python3.9', + 'exec_file': '~/Documents/Python/lumo/a.py', + 'repo': '~/Documents/Python/lumo'}, + 'exp_name': 'my_exp_a', + 'git': {'commit': '1014b6b5', + 'dep_hash': 'c93b8c4e340882f55cf0c8e125fa0203', + 'repo': '~/Documents/Python/lumo'}, + 'hooks': {'Diary': {'loaded': True, 'msg': ''}, + 'FinalReport': {'loaded': True, 'msg': ''}, + 'GitCommit': {'loaded': True, 'msg': ''}, + 'LastCmd': {'loaded': True, 'msg': ''}, + 'LockFile': {'loaded': True, 'msg': ''}, + 'RecordAbort': {'loaded': True, 'msg': ''}}, + 'lock': {'accelerate': '0.16.0', + 'decorator': '5.1.1', + 'fire': '0.5.0', + 'hydra': '1.3.2', + 'joblib': '1.2.0', + 'lumo': '0.15.0', + 'numpy': '1.24.2', + 'omegaconf': '2.3.0', + 'psutil': '5.9.4', + 'torch': '1.13.1'}, + 'note': 'This is a Note', + 'params': {'alpha': 1, 'dataset': 'cifar10'}, + 'paths': {'blob_root': '~/.lumo/blob', + 'cache_root': '~/.lumo/cache', + 'info_root': '~/.lumo/experiments'}, + 'pinfo': {'hash': '0af4b77497c85bc5b65ccbdd9ff4ca0f', + 'obj': {'argv': ['~/.pyenv/versions/3.9.16/bin/python3.9', + '~/Documents/Python/lumo/a.py'], + 'pid': 63975, + 'pname': 'python3.9', + 'pstart': 1678898740.099484}, + 'pid': 63975}, + 'progress': {'end': '23-03-16-004542', + 'end_code': 0, + 'last_edit_time': '23-03-16-004542', + 'ratio': 1, + 'start': '23-03-16-004542', + 'update_from': None}, + 'tags': [], + 'test_name': '230316.000.8ct', + 'trainer': nan} +``` -# License +# :pencil: Acknowledge -Distributed under the GNU General Public License 3.0. See [LICENSE](./LICENSE) for more information. +从 2020 年维护至今。感谢 lumo 陪我见证我的研究生生涯。 -# Contact +# :scroll: License - - [sailist@outlook.com](mailto:sailist@outlook.com) +采用 [Apache License Version 2.0](./LICENSE) 协议分发。 diff --git a/README.md b/README.md index 8d3e8517..78ad25c3 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,5 @@ +[中文](./README.ch.md) + # lumo [![PyPI version](https://badge.fury.io/py/lumo.svg)](https://badge.fury.io/py/lumo) @@ -5,162 +7,256 @@ [![license](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/Lightning-AI/lightning/blob/master/LICENSE) ![Python-doc](./images/docstr_coverage_badge.svg) -`lumo` is a light-weight library to help construct your experiment code, record your experiment results, especially in the field of deep learning. - - -## Features - -`lumo` is designed for reducing difficulty of the frequent code modification in experiments and simplify the redundant code. - -At present, `lumo` has these features: - - - Simplest code for **Hyperparameter Configuration**、**Dataset Building**、**Module Checkpoint**、**Meter and Log**. - - Include Git support and random seed management. You can **reset** and **archive** and **reimplement your experiments** by using simple console command. - - Include a **deep learning experiment code templete**. You can add any experiments with linearly increasing code complexity by using it. - - The framework follows the design paradigm of **convention over configuration**, the more you follow the convention, the more the framework will do for you. - -> Better use Pycharm. - -See [document](https://sailist.github.io/lumo/) for details. - +`lumo` is a streamlined and efficient library that simplifies the management of all components required for experiments +and focuses on enhancing the experience of deep learning practitioners. + +- **Experimental Management**: **Assign unique id and path** for each run, distinguish and store various file types; + manage **code snapshots** through git; **record all information** generated during the experiment to ensure + traceability and reproducibility. +- **Parameter Management:** Provides more convenient **parameter management** than argparser based on fire. +- **Runtime Configuration:** Provides configuration management under multi-level scopes. +- **Visualization:** Provides an Jupyter-compatible interactive dashboard for experiment management based + on [Panel](https://panel.holoviz.org/index.html). +- Additional optimization for deep learning: + - **Training:** Provides easily extendable training logic based on Trainer and provides comprehensive callback + logic. + - **Optimizer:** Integrated parameter and optimizer construction. + - **Data:** Abstraction of dataset construction process, combination of multiple DataLoaders, etc. + - **Distributed Training:** Also supports multiple training acceleration frameworks, unified abstraction, and easy + switching at any time. +- More utilities... + +![lumo-framework](./images/lumo-intro.png) + +# :book: Table of Contents + +- :cloud: [Installation](#cloud-installation) +- :book: [Quick Start](#book-quick-start) + - :small_orange_diamond: [Embedding into Existing Projects](#small_orange_diamond-embedding-into-existing-projects) + - :small_orange_diamond: [Building from Scratch](#small_orange_diamond-building-from-scratch) + - :small_orange_diamond: [Visual Interface](#small_orange_diamond-visual-interface) + - :small_orange_diamond: [re-run](#small_orange_diamond-re-run) + - :small_orange_diamond: [backup](#small_orange_diamond-backup) +- :scroll: [License](#scroll-license) + +# :cloud: Installation + +Install the published and tested version: - -## Install ```bash -pip install lumo +pip install -U lumo ``` -or +Or install the latest version from the dev1 branch: ```bash -git clone https://github.com/sailist/lumo - -python setup.py install +pip install git+https://github.com/pytorch-lumo/lumo@dev1 ``` -### test +The experiment panel depends on Panel, which needs to be installed separately: ``` -python -m pytest # or python3 -m pytest +pip install panel ``` -> Only a part of code have unit test. - +# :book: Quick Start -## Requirements +Here are two classic scenarios: - - install lumo will automatically install three light other libraries: [fire](https://github.com/google/python-fire), [psutil](https://github.com/giampaolo/psutil), [joblib](https://github.com/joblib/joblib). - - lumo has mandatory dependencies on `pytorch`, `pandas` and `numpy`, you should manully install these before using lumo since they are usually common-used. - - lumo has an optional dependency on `GitPython` as a plugin to execute git command, you can run `pip install GitPython` to install it. - -```shell -pip install pandas numpy GitPython -``` -and then see [pytorch](https://pytorch.org/) to install torch. +## :small_orange_diamond: Embedding into Existing Projects +For existing projects, you can quickly embed Lumo by following these steps: +- Import Lumo and initialize Logger and Experiment: -## Introduction +```python +import random +from lumo import SimpleExperiment, Params, Logger, Meter, Record -Unlike other pytorch tools, `lumo` mainly designed for research, there are two core idea of it: +logger = Logger() +exp = SimpleExperiment(exp_name='my_exp_a') +exp.start() +logger.add_log_dir(exp.mk_ipath()) +``` -1. Reduce repetition of your code. -2. Make all operations **recordable**, **resumable**, **analyzable**. +- Initialize parameters: +```python +params = Params() +params.dataset = params.choice('cifar10', 'cifar100') +params.alpha = params.arange(default=1, left=0, right=10) +params.from_args() # python3 train.py --dataset=cifar100 --alpha=0.2 +print(params.to_dict()) # {"dataset": "cifar100", "alpha": 0.2} +``` -Your can click [Tutorial](https://sailist.github.io/lumo/tutorial/) to learn the basic use of this framework. After that, you can view [Cookbook](https://sailist.github.io/lumo/cookbook/) to see some details of this library. +- Record parameters and store information during training: -A suggested learning order may be: +```python +exp.dump_info('params', params.to_dict()) +print(exp.test_name) - - Learn highly frequency used module: [Define hyperparameter(Params)](https://sailist.github.io/lumo/params)、[Record variable(Meter)](https://sailist.github.io/lumo/meter)、[Log(Logger)](/lumo/logger)、[Reshape your dataloader(DataBundler)](https://sailist.github.io/lumo/bundler) and their aggregation [Trainer](https://sailist.github.io/lumo/trainer). - - Learn how to manage/analyse your experiment by [Config](https://sailist.github.io/lumo/exp) and [Experiment](https://sailist.github.io/lumo/exp) - - Learn how to simple manage random seed by [RndManager](https://sailist.github.io/lumo/rnd) and to create your dataset elegantly by [DatasetBuilder](https://sailist.github.io/lumo/builder) +params.to_yaml(exp.mk_ipath('params.yaml')) -After learning above contents, you can view [Cookbook](https://sailist.github.io/lumo/cookbook/) to learn the use of [tempelet code](https://sailist.github.io/lumo/structure) and other [details](https://sailist.github.io/lumo/details). +for i in range(10): + max_acc = exp.dump_metric('Acc', random.random(), cmp='max') + logger.info(f'Max acc {max_acc}') -You can also view another repository [lumo-implement](https://github.com/lumo/lumo-implement) to see a bigger example, it will continuously reimplement papers I interested by using the templete provided in `lumo`. + ckpt_fn = exp.mk_bpath('checkpoints', f'model_{i}.ckpt') + ... # save code given ckpt_fn -## Examples +record = Record() +for batch in range(10): + m = Meter() + m.mean.Lall = random.random() + m.last.lr = batch + record.record(m) + logger.info(record) -Before start, maybe you'd like to see some simple examples to learn what can `lumo` do. +exp.end() +``` -### Define hyperparameters -By use `lumo.frame.Params`, you can define hyperparameters simply. See [Params](https://sailist.github.io/lumo/params) for details. -```python -from lumo import Params -params = Params() -params.batch_size = 128 -params.from_args() # from command args +## :small_orange_diamond: Building from Scratch ->>> python ap.py --optim.lr=0.001 --epoch=400 --dataset=cifar10 --k=12 -``` -### Record variable +If you want to start a new deep learning experiment from scratch, you can use Lumo to accelerate your code development. +Below are examples of Lumo training at different scales: -By using `lumo.frame.Meter`, you can record variable and update its average value with as little code as possible. See [Meter](https://sailist.github.io/lumo/meter) for details. +one-fine training: -```python -from lumo import Meter,AvgMeter - -am = AvgMeter() # use for record average -for j in range(500): - meter = Meter() - meter.percent(meter.c_) # when print, format 'c' as a percentage - meter.a = 1 - meter.b = "2" - meter.c = torch.rand(1)[0] - - meter.loss = loss_fn(...) - meter.rand = torch.rand(2) - meter.d = [4] # you can record any type of variable - meter.e = {5: "6"} - - am.update(meter) # Update current value in meter. Average value will be calculated automatic by declaration and the type of the variable. - print(am) -``` +| Example | CoLab | Lines of Code | +|---------------------------------------------|-------|---------------| +| [MNIST example](./examples/mnist.py) | | 118 | +| [MocoV2 trains CIFAR10](./examples/moco.py) | | 284 | +| [Multi-GPU training ImageNet]() | || +Experimental project: -## Contribute +| Project | Description | +|-----------------------------------------------------------------------------------------------------------|-------------------------------| +| [image-classification](https://github.com/pytorch-lumo/image-classification) | Reproducible code for multiple papers with full supervision, semi-supervision, and self-supervision | +| [emotion-recognition-in-coversation](https://github.com/pytorch-lumo/emotion-recognition-in-conversation) | Reproducible code for multiple papers on dialogue emotion classification and multimodal dialogue emotion classification | -`lumo` will be better in the future, but there are still some lack exists currently, including: +## :small_orange_diamond: Visual Interface - - **Lack of more detail guide** because of the lacking of developer's energy and time. - - **Lack more tests**. unit test only covers a part of the code. I hope I fixed all bugs during my using of it, but there is no guarantee of it. The compatibility is also unguaranteed. So, welcome to [issus](https://github.com/sailist/lumo/issues) it if you find it. - - **Lack of development experience**. So the version number may be confused. +In jupyter: -Thanks for all contribution. +```python +from lumo import Watcher +w = Watcher() +df = w.load() +widget = w.panel(df) +widget.servable() +``` +![Panel](./images/panel-example.png) -For file read/write and get/set, I designed +Manually filtered experiments for visualization: +![Panel](./images/panel-example2.png) -- [Params], to make runtime config get/set/load/dump easily, -- [globs], a global/local/runtime environment variables manager. -- [Saver], to help you save/load/manage your checkpoints/models in one class. +You can directly use the command line: -For data processing, I designed +``` +lumo board [--port, --address, --open] +``` -- [Builder], to hold nearly all dataset formats and special operations by one class, +## :small_orange_diamond: re-run -For managing experiments, I designed +Experiment that failed due to certain reasons can be **re-run by using the unique experiment ID (test_name)** , extra +parameters can be **reassigned and replaced**. -- [Experiment], which can - - make you build a suitable directory and file path in one place, - - make you record lightweight data, and - - help you make snapshot for your project code (based on git), which can make each result recoverable and - reproducible -- [random manager], a cross-lib(random/numpy/pytorch) random seed manager +``` +lumo rerun 230313.030.57t --device=0 +``` -For log and meter variables produced during experiment, I designed +## :small_orange_diamond: backup -- [Meter] to meter every thing in appropriate format, and -- [Logger] to log every thing in appropriate format. +Backing up experiment information to a Github issue (based on PyGitHub): -Finally, I designed [Trainer] to bundle all module above for deep learning experiment. +```python +from lumo import Experiment, Watcher +from lumo import glob + +glob[ + 'github_access_token'] = 'ghp_*' # Default value for `access_token`. It is recommended to store the access_token in the global configuration `~/.lumorc.json`. + +w = Watcher() +df = w.load() + +# Selecting a single experiment for backup +exp = Experiment.from_cache(df.iloc[0].to_dict()) +issue = exp.backup('github', repo='pytorch-lumo/image-classification-private', + access_token='ghp_*', + update=True, # If already backed up, overwrite the previous issue + labels=None, # Optional labels + ) +print(issue.number) + +# Batch backup and add labels based on each experiment's parameters +issues = df.apply( + lambda x: Experiment.from_cache(x.to_dict()).backup(..., labels=[x['params'].get('dataset', '')]), + axis=1 +) +``` +![backup_github](./images/backup_github.png) -As you can see, These modules covered most demandings on deeplearning +# Full properties -You can find what you want and click the link to quickly learn HOW TO USE it! All module is designed easy to use, it's -my principles. +``` +{'agent': nan, + 'backup': {'23-03-17-003438': {'backend': 'github', + 'number': 9, + 'repo': '...'}, + }, + 'exception': nan, + 'execute': {'cwd': '~/Documents/Python/lumo', + 'exec_argv': ['~/Documents/Python/lumo/a.py'], + 'exec_bin': '~/.pyenv/versions/3.9.16/bin/python3.9', + 'exec_file': '~/Documents/Python/lumo/a.py', + 'repo': '~/Documents/Python/lumo'}, + 'exp_name': 'my_exp_a', + 'git': {'commit': '1014b6b5', + 'dep_hash': 'c93b8c4e340882f55cf0c8e125fa0203', + 'repo': '~/Documents/Python/lumo'}, + 'hooks': {'Diary': {'loaded': True, 'msg': ''}, + 'FinalReport': {'loaded': True, 'msg': ''}, + 'GitCommit': {'loaded': True, 'msg': ''}, + 'LastCmd': {'loaded': True, 'msg': ''}, + 'LockFile': {'loaded': True, 'msg': ''}, + 'RecordAbort': {'loaded': True, 'msg': ''}}, + 'lock': {'accelerate': '0.16.0', + 'decorator': '5.1.1', + 'fire': '0.5.0', + 'hydra': '1.3.2', + 'joblib': '1.2.0', + 'lumo': '0.15.0', + 'numpy': '1.24.2', + 'omegaconf': '2.3.0', + 'psutil': '5.9.4', + 'torch': '1.13.1'}, + 'note': 'This is a Note', + 'params': {'alpha': 1, 'dataset': 'cifar10'}, + 'paths': {'blob_root': '~/.lumo/blob', + 'cache_root': '~/.lumo/cache', + 'info_root': '~/.lumo/experiments'}, + 'pinfo': {'hash': '0af4b77497c85bc5b65ccbdd9ff4ca0f', + 'obj': {'argv': ['~/.pyenv/versions/3.9.16/bin/python3.9', + '~/Documents/Python/lumo/a.py'], + 'pid': 63975, + 'pname': 'python3.9', + 'pstart': 1678898740.099484}, + 'pid': 63975}, + 'progress': {'end': '23-03-16-004542', + 'end_code': 0, + 'last_edit_time': '23-03-16-004542', + 'ratio': 1, + 'start': '23-03-16-004542', + 'update_from': None}, + 'tags': [], + 'test_name': '230316.000.8ct', + 'trainer': nan} +``` +# :scroll: License +Distributed under the Apache License Version 2.0. See [LICENSE](./LICENSE) for more information. diff --git a/docstr-coverage.sh b/docstr-coverage.sh new file mode 100644 index 00000000..3d1a1029 --- /dev/null +++ b/docstr-coverage.sh @@ -0,0 +1,9 @@ +docstr-coverage \ + src/lumo/cli \ + src/lumo/core \ + src/lumo/data \ + src/lumo/exp \ + src/lumo/proc \ + src/lumo/trainer \ + src/lumo/utils + diff --git a/examples/data/datamodule.py b/examples/data/datamodule.py deleted file mode 100644 index 052e65e1..00000000 --- a/examples/data/datamodule.py +++ /dev/null @@ -1,35 +0,0 @@ -from lumo import DataModule, ParamsType, TrainStage, DatasetBuilder -import torch - - -class MyDataModule(DataModule): - def idataloader(self, params: ParamsType = None, stage: TrainStage = None): - if stage.is_train(): - print("init train dataloader") - db = ( - DatasetBuilder() - .add_input("xs", torch.rand(500, 28, 28)) - .add_input("ys", torch.randint(0, 10, (500,))) - .add_output("xs", "xs") - .add_output("ys", "ys") - ) - loader = db.DataLoader(batch_size=10) - else: - print("init test dataloader") - db = ( - DatasetBuilder() - .add_input("xs", torch.rand(50, 28, 28)) - .add_input("ys", torch.randint(0, 10, (50,))) - .add_output("xs", "xs") - .add_output("ys", "ys") - ) - loader = db.DataLoader(batch_size=10) - self.regist_dataloader_with_stage(stage, loader) - - -dm = MyDataModule() -print(dm._prop) -print(len(dm.train_dataloader)) -print(dm._prop) -print(len(dm.test_dataloader)) -print(dm._prop) diff --git a/examples/imagenet.py b/examples/imagenet.py new file mode 100644 index 00000000..34fbf3a4 --- /dev/null +++ b/examples/imagenet.py @@ -0,0 +1,259 @@ +import sys +from pathlib import Path +from typing import Union + +import torch +from PIL import Image +from torch.utils.data import DataLoader +import os +import torch.multiprocessing as mp +from torchvision.datasets.folder import default_loader + +from lumo import DatasetBuilder, MetricType, Trainer, TrainerParams, Meter, callbacks, DataModule +from torchvision.datasets import FakeData, ImageFolder +from torchvision import transforms +from torchvision.models.resnet import resnet18 +from torch import nn +from lumo.proc.dist import is_dist, is_main +from torch.nn import functional as F +from lumo.proc import glob +from lumo.utils.subprocess import run_command + +"""define transforms""" + + +def none(mean, std): + return transforms.Compose([ + transforms.ToTensor(), + transforms.Normalize(mean, std) + ]) + + +def standard(mean, std, resize=None): + return transforms.Compose([ + transforms.RandomResizedCrop(224), + transforms.RandomHorizontalFlip(), + transforms.ToTensor(), + transforms.Normalize(mean, std) + ]) + + +"""create datasets""" + + +def imagenet(split='train'): + """ + download from https://www.kaggle.com/c/imagenet-object-localization-challenge/overview/description + ``` + mkdir imagenet + cd ./imagenet + kaggle competitions download -c imagenet-object-localization-challenge + unzip imagenet-object-localization-challenge.zip + tar -xvf imagenet_object_localization_patched2019.tar.gz + ls + >>> ILSVRC LOC_synset_mapping.txt LOC_val_solution.csv imagenet_object_localization_patched2019.tar.gz + >>> LOC_sample_submission.csv LOC_train_solution.csv imagenet-object-localization-challenge.zip + ``` + """ + root = glob['imagenet'] + if split == 'train': + file = Path(root).joinpath('ILSVRC', 'ImageSets', 'CLS-LOC', 'train_cls.txt') + train_root = os.path.join(root, 'ILSVRC/Data/CLS-LOC/train') + with file.open('r') as r: + lines = r.readlines() + imgs = [line.split(' ')[0] for line in lines] + name_cls_map = {name: i for i, name in enumerate(sorted(set([i.split('/')[0] for i in imgs])))} + xs = [os.path.join(train_root, f'{i}.JPEG') for i in imgs] + ys = [name_cls_map[i.split('/')[0]] for i in imgs] + else: + file = Path(root).joinpath('LOC_val_solution.csv') + val_root = os.path.join(root, 'ILSVRC/Data/CLS-LOC/val') + + with file.open('r') as r: + r.readline() + lines = r.readlines() + lines = [line.split(',') for line in lines] + lines = [[img, res.split(' ')[0]] for img, res in lines] + + name_cls_map = {name: i for i, name in enumerate(sorted(set([i[1] for i in lines])))} + xs = [os.path.join(val_root, f'{img}.JPEG') for img, _ in lines] + ys = [name_cls_map[res] for _, res in lines] + + return list(xs), list(ys) + + +def take_first(item): + return item[0] + + +def take_second(item): + return item[1] + + +def make_dataset(dummy=False): + if dummy: + train_dataset = FakeData(1281167, (3, 224, 224), 1000, transforms.ToTensor()) + val_dataset = FakeData(50000, (3, 224, 224), 1000, transforms.ToTensor()) + ds = ( + DatasetBuilder() + .add_input('fake', train_dataset) + .add_output('fake', 'xs', transform=take_first) + .add_output('fake', 'ys', transform=take_second) + ) + test_ds = ( + DatasetBuilder() + .add_input('fake', val_dataset) + .add_output('fake', 'xs', transform=take_first) + .add_output('fake', 'ys', transform=take_second) + ) + else: + train_dataset = ImageFolder(os.path.join(glob['imagenet'], 'train')) + val_dataset = ImageFolder(os.path.join(glob['imagenet'], 'val')) + + xs, ys = list(zip(*train_dataset.samples)) + test_xs, test_ys = list(zip(*val_dataset.samples)) + + mean = [0.485, 0.456, 0.406] + std = [0.229, 0.224, 0.225] + + ds = ( + DatasetBuilder() + .add_input('xs', xs, transform=default_loader) # 注册样本来源,命名为 'xs' + .add_input('ys', ys) # 注册标签来源,命名为 'ys' + .add_output('xs', 'xs', transform=standard(mean, std)) # 添加一个弱增广输出 'xs0' + .add_output('ys', 'ys') # 添加标签输出 + ) + + print(ds) + print(ds[0].keys()) + + test_ds = ( + DatasetBuilder() + .add_input('xs', test_xs, transform=default_loader) # 注册样本来源,命名为 'xs' + .add_input('ys', test_ys) # 注册标签来源,命名为 'ys' + .add_output('xs', 'xs', transform=none(mean, std)) # 测试样本不使用增广 + .add_output('ys', 'ys') # 添加标签输出 + ) + + print(test_ds) + print(test_ds[0].keys()) + return ds, test_ds + + +class LargeParams(TrainerParams): + def __init__(self): + super().__init__() + self.optim = self.OPTIM.create_optim('SGD', + lr=0.06, + momentum=0.9, + weight_decay=5e-5, + ) + self.lr_decay_end = 0.00001 + self.batch_size = 512 + self.dummy = False + self.multiprocessing_distributed = True + + +ParamsType = LargeParams + + +class LargeModel(nn.Module): + + def __init__(self, feature_dim) -> None: + super().__init__() + self.backbone = resnet18() + in_feature = self.backbone.fc.in_features + self.backbone.fc = nn.Identity() + self.head = nn.Linear(in_feature, feature_dim, bias=True) + + def forward(self, xs): + feature_map = self.backbone(xs) + feature = self.head(feature_map) + return feature + + +class LargeTrainer(Trainer): + + def icallbacks(self, params: ParamsType): + callbacks.LoggerCallback().hook(self) + + def imodels(self, params: ParamsType): + self.model = resnet18(num_classes=1000) + self.optim = params.optim.build(self.model.parameters()) + + self.lr_sche = params.SCHE.Cos( + start=params.optim.lr, end=params.lr_decay_end, + left=0, + right=len(self.train_dataloader) * params.epoch + ) + # manually trigger send_to_device method + self.to_device() + + def train_step(self, batch, params: ParamsType = None) -> MetricType: + super().train_step(batch, params) + m = Meter() + xs, ys = batch['xs'], batch['ys'] + logits = self.model(xs) + + Lall = F.cross_entropy(logits, ys) + self.optim.zero_grad() + self.accelerate.backward(Lall) + self.optim.step() + + # change lr by training epoch + cur_lr = self.lr_sche.apply(self.optim, self.eidx) + + with torch.no_grad(): + m.mean.Lall = Lall + m.mean.Ax = torch.eq(logits.argmax(dim=-1), ys).float().mean() + m.last.lr = cur_lr + return m + + def test_step(self, batch, params: ParamsType = None) -> MetricType: + m = Meter() + xs, ys = batch['xs'], batch['ys'] + logits = self.model(xs) + + all_logits = self.accelerate.gather(logits) + all_ys = self.accelerate.gather(ys) + + m.test_mean.Ax = torch.eq(all_logits.argmax(dim=-1), all_ys).float() + return m + + +def main_worker(device, ngpus_per_node, args): + # create datamodule to contain dataloader + params = LargeParams() + params.device = device + params.from_args() + + ds, test_ds = make_dataset(dummy=params.dummy) + dl = ds.DataLoader(batch_size=params.batch_size, num_workers=4) + test_dl = test_ds.DataLoader(batch_size=params.batch_size, num_workers=4) + dm = DataModule() + dm.regist_dataloader(train=dl, test=test_dl) + + # with the input of params and dataloader, the initialization of models and optimizers in Trainer, + # then the output will be the trained parameters, metrics and logs. + trainer = LargeTrainer(params, dm=dm) + + trainer.train() # or trainer.train(dm=dl) if dm are not given above + trainer.test() # or trainer.test(dm=dl) + trainer.save_last_model() + + +def main(): + # if params.multiprocessing_distributed and not is_dist(): + # mp.spawn(main, nprocs=torch.cuda.device_count(), args=(ngpus_per_node, args)) + # command = ' '.join([ + # 'accelerate', 'launch', + # *sys.argv, + # ]) + # print(command) + # run_command(command) + # else: # not distributed or in distribution environment + pass + + +if __name__ == '__main__': + main() diff --git a/examples/mnist.py b/examples/mnist.py new file mode 100644 index 00000000..45cc6a67 --- /dev/null +++ b/examples/mnist.py @@ -0,0 +1,117 @@ +import torch +from lumo import DatasetBuilder, MetricType, Trainer, TrainerParams, Meter, callbacks, DataModule +from lumo.proc.path import cache_dir +from torchvision.datasets.mnist import MNIST +from torchvision.transforms import Compose, RandomCrop, Normalize +from torch import nn +from torch.nn import functional as F + + +def make_dataset(): + data = MNIST(root=cache_dir(), train=True, download=True) + test_data = MNIST(root=cache_dir(), train=False, download=True) + + mean = torch.mean(data.data / 255.) + std = torch.std(data.data / 255.) + + ds = ( + DatasetBuilder() + .add_input('xs', data.data.float().unsqueeze(1)) # 注册样本来源,命名为 'xs' + .add_input('ys', data.targets) # 注册标签来源,命名为 'ys' + .add_output('xs', 'xs0', transform=Normalize(mean=(mean,), std=(std,))) # 添加一个弱增广输出 'xs0' + .add_output('xs', 'xs1', + transform=Compose( + [RandomCrop(28, padding=4), Normalize(mean=(mean,), std=(std,))])) # 添加一个强增广输出 'xs1' + .add_output('ys', 'ys') # 添加标签输出 + ) + print(ds) + print(ds[0].keys()) + + test_ds = ( + DatasetBuilder() + .add_input('xs', test_data.data.float().unsqueeze(1)) # 注册样本来源,命名为 'xs' + .add_input('ys', test_data.targets) # 注册标签来源,命名为 'ys' + .add_output('xs', 'xs', transform=Normalize(mean=(mean,), std=(std,))) # 测试样本不使用增广 + .add_output('ys', 'ys') # 添加标签输出 + ) + + print(test_ds) + print(test_ds[0].keys()) + return ds, test_ds + + +class MNISTParams(TrainerParams): + def __init__(self): + super().__init__() + self.batch_size = 128 + self.optim = self.OPTIM.create_optim('SGD', lr=0.0001, momentum=0.9) + + +ParamsType = MNISTParams + + +class MNISTTrainer(Trainer): + + def icallbacks(self, params: ParamsType): + super().icallbacks(params) + callbacks.LoggerCallback().hook(self) + + def imodels(self, params: ParamsType): + super().imodels(params) + self.model = nn.Sequential( + nn.Flatten(), + nn.Linear(28 * 28, 128), + nn.ReLU(), + nn.Linear(128, 10), + ) + self.optim = params.optim.build(self.model.parameters()) + + # manually trigger send_to_device method + self.to_device() + + def train_step(self, batch, params: ParamsType = None) -> MetricType: + super().train_step(batch, params) + m = Meter() + eval_xs, xs, ys = batch['xs0'], batch['xs1'], batch['ys'] + logits = self.model(xs) + Lall = F.cross_entropy(logits, ys) + self.optim.zero_grad() + Lall.backward() + self.optim.step() + with torch.no_grad(): + m.mean.Lall = Lall + eval_logits = self.model(eval_xs) + m.mean.Ax = torch.eq(eval_logits.argmax(dim=-1), ys).float().mean() + return m + + def test_step(self, batch, params: ParamsType = None) -> MetricType: + m = Meter() + xs, ys = batch['xs'], batch['ys'] + logits = self.model(xs) + m.test_mean.Ax = torch.eq(logits.argmax(dim=-1), ys).float() + return m + + +def main(): + ds, test_ds = make_dataset() + + # create datamodule to contain dataloader + dl = ds.DataLoader(batch_size=32) + test_dl = test_ds.DataLoader(batch_size=32) + dm = DataModule() + dm.regist_dataloader(train=dl, test=test_dl) + + params = MNISTParams() + params.epoch = 10 + params.from_args() + # with the input of params and dataloader, the initialization of models and optimizers in Trainer, + # then the output will be the trained parameters, metrics and logs. + trainer = MNISTTrainer(params, dm=dm) + + trainer.train() # or trainer.train(dm=dl) if dm are not given above + trainer.test() # or trainer.test(dm=dl) + trainer.save_last_model() + + +if __name__ == '__main__': + main() diff --git a/examples/moco.py b/examples/moco.py new file mode 100644 index 00000000..14dff276 --- /dev/null +++ b/examples/moco.py @@ -0,0 +1,290 @@ +""" +refer to +https://colab.research.google.com/github/facebookresearch/moco/blob/colab-notebook/colab/moco_cifar10_demo.ipynb +""" +from typing import Union + +import torch +from PIL import Image +from torch.utils.data import DataLoader + +from lumo import DatasetBuilder, MetricType, Trainer, TrainerParams, Meter, callbacks, DataModule +from lumo.contrib import EMA, MemoryBank, StorageBank + +from lumo.contrib.nn.loss import contrastive_loss2 +from lumo.proc.path import cache_dir +from torchvision.datasets.cifar import CIFAR10 +from torchvision import transforms +from torchvision.models.resnet import resnet18 +from torch import nn +from torch.nn import functional as F + +from lumo.utils.device import send_to_device + +"""define transforms""" + + +def none(mean, std): + return transforms.Compose([ + transforms.ToTensor(), + transforms.Normalize(mean, std) + ]) + + +def simclr(mean, std): + return transforms.Compose([ + transforms.RandomResizedCrop(32), + transforms.RandomHorizontalFlip(), + transforms.RandomApply([ + transforms.ColorJitter(0.4, 0.4, 0.4, 0.1) + ], p=0.8), + transforms.RandomGrayscale(0.2), + transforms.ToTensor(), + transforms.Normalize(mean, std) + ]) + + +"""create datasets""" + + +def make_dataset(): + data = CIFAR10(root=cache_dir(), train=True, download=True) + data.data = [Image.fromarray(img) for img in data.data] + test_data = CIFAR10(root=cache_dir(), train=False, download=True) + + mean = [0.4914, 0.4822, 0.4465] + std = [0.2023, 0.1994, 0.2010] + + ds = ( + DatasetBuilder() + .add_input('xs', data.data) # 注册样本来源,命名为 'xs' + .add_input('ys', data.targets) # 注册标签来源,命名为 'ys' + .add_output('xs', 'xs0', transform=simclr(mean, std)) # 添加一个弱增广输出 'xs0' + .add_output('xs', 'xs1', transform=simclr(mean, std)) # 添加一个强增广输出 'xs1' + .add_output('ys', 'ys') # 添加标签输出 + ) + + # for knn test + memo_ds = ( + DatasetBuilder() + .add_input('xs', data.data) # 注册样本来源,命名为 'xs' + .add_input('ys', data.targets) # 注册标签来源,命名为 'ys' + .add_output('xs', 'xs', transform=none(mean, std)) # 添加一个弱增广输出 'xs0' + .add_output('ys', 'ys') # 添加标签输出 + ) + print(ds) + print(ds[0].keys()) + + test_ds = ( + DatasetBuilder() + .add_idx('idx') # add index key for sample + .add_input('xs', test_data.data) # 注册样本来源,命名为 'xs' + .add_input('ys', test_data.targets) # 注册标签来源,命名为 'ys' + .add_output('xs', 'xs', transform=none(mean, std)) # 测试样本不使用增广 + .add_output('ys', 'ys') # 添加标签输出 + ) + + print(test_ds) + print(test_ds[0].keys()) + return ds, memo_ds, test_ds + + +class MocoParams(TrainerParams): + def __init__(self): + super().__init__() + self.optim = self.OPTIM.create_optim('SGD', + lr=0.06, + momentum=0.9, + weight_decay=5e-5, + ) + self.lr_decay_end = 0.00001 + self.temperature = 0.1 + self.ema_alpha = 0.99 + self.feature_dim = 129 + self.queue_size = 4096 + self.batch_size = 512 + self.symmetric = False + + +ParamsType = MocoParams + + +class MocoModel(nn.Module): + + def __init__(self, feature_dim) -> None: + super().__init__() + self.backbone = resnet18() + in_feature = self.backbone.fc.in_features + self.backbone.fc = nn.Identity() + self.head = nn.Linear(in_feature, feature_dim, bias=True) + + def forward(self, xs): + feature_map = self.backbone(xs) + feature = self.head(feature_map) + return feature + + +def knn_predict(feature, feature_bank, feature_labels, classes, knn_k, knn_t): + # compute cos similarity between each feature vector and feature bank ---> [B, N] + sim_matrix = torch.mm(feature, feature_bank) + # [B, K] + sim_weight, sim_indices = sim_matrix.topk(k=knn_k, dim=-1) + # [B, K] + sim_labels = torch.gather(feature_labels.expand(feature.size(0), -1), dim=-1, index=sim_indices) + sim_weight = (sim_weight / knn_t).exp() + + # counts for each class + one_hot_label = torch.zeros(feature.size(0) * knn_k, classes, device=sim_labels.device) + # [B*K, C] + one_hot_label = one_hot_label.scatter(dim=-1, index=sim_labels.view(-1, 1), value=1.0) + # weighted score ---> [B, C] + pred_scores = torch.sum(one_hot_label.view(feature.size(0), -1, classes) * sim_weight.unsqueeze(dim=-1), dim=1) + + pred_labels = pred_scores.argsort(dim=-1, descending=True) + return pred_labels + + +class MocoTrainer(Trainer): + + def icallbacks(self, params: ParamsType): + callbacks.LoggerCallback().hook(self) + + def imodels(self, params: ParamsType): + self.model = MocoModel(params.feature_dim) + self.ema_model = EMA(self.model, alpha=params.ema_alpha) + + self.optim = params.optim.build(self.model.parameters()) + + self.tensors = StorageBank() + self.tensors.register('test_feature', dim=params.feature_dim, k=len(self.dm.test_dataset)) + self.tensors.register('test_ys', dim=-1, k=len(self.dm.test_dataset), dtype=torch.long) + + self.mem = MemoryBank() + # do not need normalize because normalize is applied in contrastive_loss2 function + self.mem.register('negative', dim=params.feature_dim, k=params.queue_size) + self.mem['negative'] = F.normalize(self.mem['negative'], dim=-1) + + self.lr_sche = params.SCHE.Cos( + start=params.optim.lr, end=params.lr_decay_end, + left=0, + right=len(self.train_dataloader) * params.epoch + ) + # manually trigger send_to_device method + self.to_device() + + def train_step(self, batch, params: ParamsType = None) -> MetricType: + m = Meter() + im_query, im_key, ys = batch['xs0'], batch['xs1'], batch['ys'] + feat_query = self.model.forward(im_query) + + with torch.no_grad(): + # shuffle for making use of BN + feat_key = self.ema_model.forward(im_key) # keys: NxC + feat_key = F.normalize(feat_key, dim=1) # already normalized + + feat_query = F.normalize(feat_query, dim=1) + + if params.symmetric: + Lcsa = contrastive_loss2(query=feat_query, key=feat_key, + memory=self.mem['negative'], + query_neg=False, key_neg=False, + temperature=params.temperature, + norm=False) + Lcsb = contrastive_loss2(query=feat_key, key=feat_query, + memory=self.mem['negative'], + query_neg=False, key_neg=False, + temperature=params.temperature, + norm=False) + Lcs = Lcsa + Lcsb + else: + + Lcs = contrastive_loss2(query=feat_query, key=feat_key.detach(), + memory=self.mem['negative'].clone().detach(), + query_neg=False, key_neg=False, + temperature=params.temperature, + norm=False) + + # memory bank + with torch.no_grad(): + if params.symmetric: + self.mem.push('negative', torch.cat([feat_query, feat_key], dim=0)) + else: + self.mem.push('negative', feat_key) + + self.optim.zero_grad() + self.accelerate.backward(Lcs) + self.optim.step() + cur_lr = self.lr_sche.apply(self.optim, self.global_steps) + + # metrics + with torch.no_grad(): + m.mean.Lcs = Lcs + m.last.lr = cur_lr + return m + + def test_step(self, batch, params: ParamsType = None) -> MetricType: + idx = batch['idx'] + xs, ys = batch['xs'], batch['ys'] + feature = self.model(xs) + self.tensors.scatter('test_feature', feature, idx) + self.tensors.scatter('test_ys', ys, idx) + + def test(self, dm: Union[DataModule, DataLoader] = None, params: ParamsType = None, limit_step=None): + super().test(dm, params, limit_step) # run default test loop + self.save_last_model() + + feature_bank = [] + with torch.no_grad(): + # generate feature bank + for batch in self.dm['memo']: + batch = send_to_device(batch, self.device) + data, target = batch['xs'], batch['ys'] + feature = self.model(data) + feature = F.normalize(feature, dim=1) + feature_bank.append(feature) + + feature_bank = torch.cat(feature_bank, dim=0).t().contiguous() + # [N] + feature_labels = torch.tensor(self.dm['memo'].dataset.inputs['ys'], device=feature_bank.device) + # loop test data to predict the label by weighted knn search + + pred_labels = knn_predict(self.tensors['test_feature'], + feature_bank, feature_labels, params.n_classes, params.knn_k, params.knn_t) + total_num = pred_labels.shape[0] + total_top1 = torch.eq(pred_labels[:, 0], self.tensors['test_ys']).float().sum().item() + + knn_acc = total_top1 / total_num * 100 + + max_knn_acc = self.metric.dump_metric('Knn', knn_acc, cmp='max', flush=True) + self.logger.info(f'Best Knn Top-1 acc: {max_knn_acc}, current: {knn_acc}') + + if knn_acc >= max_knn_acc: + self.save_best_model() + + +def main(): + ds, memo_ds, test_ds = make_dataset() + + params = MocoParams() + params.from_args() + + # create datamodule to contain dataloader + dl = ds.DataLoader(batch_size=params.batch_size, num_workers=2) + memo_dl = memo_ds.DataLoader(batch_size=params.batch_size, num_workers=2) + test_dl = test_ds.DataLoader(batch_size=params.batch_size, num_workers=2) + dm = DataModule() + dm.regist_dataloader(train=dl, + test=test_dl, + memo=memo_dl) # add extra dataloader with any name + + # with the input of params and dataloader, the initialization of models and optimizers in Trainer, + # then the output will be the trained parameters, metrics and logs. + trainer = MocoTrainer(params, dm=dm) + + trainer.train() # or trainer.train(dm=dl) if dm are not given above + trainer.test() # or trainer.test(dm=dl) + trainer.save_last_model() + + +if __name__ == '__main__': + main() diff --git a/images/backup_github.png b/images/backup_github.png new file mode 100644 index 00000000..01518629 Binary files /dev/null and b/images/backup_github.png differ diff --git a/images/lumo-intro.png b/images/lumo-intro.png new file mode 100644 index 00000000..1199de27 Binary files /dev/null and b/images/lumo-intro.png differ diff --git a/images/panel-example.png b/images/panel-example.png new file mode 100644 index 00000000..42f26f61 Binary files /dev/null and b/images/panel-example.png differ diff --git a/images/panel-example2.png b/images/panel-example2.png new file mode 100644 index 00000000..c0116eb9 Binary files /dev/null and b/images/panel-example2.png differ diff --git a/pyproject.toml b/pyproject.toml index e0133509..189fab03 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -42,6 +42,7 @@ omit = [ 'src/lumo/vis/*', 'src/lumo/decorators/*', 'src/lumo/exp/agent.py', + 'src/lumo/exp/lazy_panel.py', 'src/lumo/analyse/*', 'src/lumo/sketch/*', 'src/lumo/core/record_backend/*', diff --git a/setup.py b/setup.py index a0d9f4c6..10eecd68 100644 --- a/setup.py +++ b/setup.py @@ -1,5 +1,4 @@ import re -from datetime import datetime from setuptools import setup, find_packages diff --git a/src/lumo/__init__.py b/src/lumo/__init__.py index 5893105d..35697cca 100644 --- a/src/lumo/__init__.py +++ b/src/lumo/__init__.py @@ -1,11 +1,12 @@ """ """ -__version__ = "0.15.0" +__version__ = "1.0.0" from .core import Params, ParamsType, MetricType, Meter, Record, TrainStage, BaseParams +from .proc import glob from .data import DataLoader, DataModule, DatasetBuilder, LumoDataLoader, CollateBase, DataLoaderSide -from .exp import SimpleExperiment, Experiment +from .exp import SimpleExperiment, Experiment, Metric, Watcher, C from .trainer import Trainer, TrainerParams, callbacks, RndManager from .utils import Logger diff --git a/src/lumo/analyse/__init__.py b/src/lumo/analyse/__init__.py index 7eeb38a8..0af59d33 100644 --- a/src/lumo/analyse/__init__.py +++ b/src/lumo/analyse/__init__.py @@ -1,2 +1,5 @@ +import warnings + from .collect import collect_table_rows, flatten_dict, flatten_params, flatten_metric from .condition import C, filter_by_condition +warnings.warn("lumo.analyse has been deprecated and will be removed soon, please use lumo.exp.Watcher instead.") \ No newline at end of file diff --git a/src/lumo/cli/__init__.py b/src/lumo/cli/__init__.py index 714545ea..e9a39df9 100644 --- a/src/lumo/cli/__init__.py +++ b/src/lumo/cli/__init__.py @@ -13,12 +13,17 @@ def rerun(test_name, **kwarg): Returns: """ - from lumo.exp.finder import retrieval_experiment - exp = retrieval_experiment(test_name) - if exp is not None: - exp.rerun([f'--{k}={v}' for k, v in kwarg.items()]) - else: + from lumo.exp.watch import Watcher + from lumo import Experiment + w = Watcher() + df = w.load() + df = df[df['test_name'] == test_name] + if len(df) == 0: + print(f'{test_name} not found') exit(1) + else: + exp = Experiment.from_cache(df.iloc[0].to_dict()) + exp.rerun([f'--{k}={v}' for k, v in kwarg.items()]) def note(test_name, description): @@ -36,7 +41,7 @@ def note(test_name, description): print(f"Adding note '{description}' to {test_name}") -def server(port=8080): +def board(port=11606, address=None, open=True): """ Args: @@ -45,12 +50,16 @@ def server(port=8080): Returns: """ + from lumo import Watcher + w = Watcher() + w.panel().show(port=port, address=address, open=open) print(f"Starting server on port {port}") def main(): + """the entry""" fire.Fire({ 'rerun': rerun, 'note': note, - 'server': server, + 'board': board, }) diff --git a/src/lumo/contrib/__init__.py b/src/lumo/contrib/__init__.py index a4f53fdd..e4f1925b 100644 --- a/src/lumo/contrib/__init__.py +++ b/src/lumo/contrib/__init__.py @@ -3,3 +3,4 @@ """ from .module.ema import EMA from .optim.grouper import ParamGrouper +from .module.memoty_bank import MemoryBank, StorageBank diff --git a/src/lumo/contrib/accelerate/__init__.py b/src/lumo/contrib/accelerate/__init__.py deleted file mode 100644 index 59d25ad8..00000000 --- a/src/lumo/contrib/accelerate/__init__.py +++ /dev/null @@ -1 +0,0 @@ -from .accelerator import Accelerator \ No newline at end of file diff --git a/src/lumo/contrib/accelerate/data_loader.py b/src/lumo/contrib/accelerate/data_loader.py deleted file mode 100644 index f05ac57f..00000000 --- a/src/lumo/contrib/accelerate/data_loader.py +++ /dev/null @@ -1,10 +0,0 @@ -from accelerate import synchronize_rng_states, DistributedType -from accelerate.data_loader import DataLoaderDispatcher as _DataLoaderDispatcher, DataLoaderShard as _DataLoaderShard -from accelerate.state import AcceleratorState -from accelerate.utils import send_to_device - -from lumo import LumoDataLoader - - -class DataLoaderDispatcher(_DataLoaderDispatcher): - pass diff --git a/src/lumo/contrib/accelerate/utils.py b/src/lumo/contrib/accelerate/utils.py deleted file mode 100644 index 44d29cc0..00000000 --- a/src/lumo/contrib/accelerate/utils.py +++ /dev/null @@ -1,30 +0,0 @@ -from accelerate.utils import recursively_apply -import torch - - -def send_to_device(tensor, device, non_blocking=False): - """ - Recursively sends the elements in a nested list/tuple/dictionary of tensors to a given device. - - Args: - tensor (nested list/tuple/dictionary of `torch.Tensor`): - The data to send to a given device. - device (`torch.device`): - The device to send the data to. - - Returns: - The same data structure as `tensor` with all tensors sent to the proper device. - """ - if not isinstance(device, torch.device): - device = torch.device(device) - - if 'mps' in device.type: - non_blocking = False - - def _send_to_device(t, device): - return t.to(device, non_blocking=non_blocking) - - def _has_to_method(t): - return hasattr(t, "to") - - return recursively_apply(_send_to_device, tensor, device, test_type=_has_to_method) diff --git a/src/lumo/contrib/module/memoty_bank.py b/src/lumo/contrib/module/memoty_bank.py index d080c87e..a73dfe59 100644 --- a/src/lumo/contrib/module/memoty_bank.py +++ b/src/lumo/contrib/module/memoty_bank.py @@ -1,6 +1,7 @@ -from accelerate.utils import gather -from lumo.proc.dist import is_dist +from torch import distributed +from lumo.proc.dist import is_dist, gather import torch.distributed + from torch import nn import torch diff --git a/src/lumo/contrib/scan.py b/src/lumo/contrib/scan.py new file mode 100644 index 00000000..974fe70a --- /dev/null +++ b/src/lumo/contrib/scan.py @@ -0,0 +1,85 @@ +""" +基线对比 +""" +import time +from itertools import cycle +from typing import List +import torch + +from lumo import Params, Logger +import sys + +DATE_FLAG = '2023.03.04' + + +class ScanBaseParams(Params): + + def __init__(self): + super().__init__() + self.gpus = None + self.group = None + self.interval = 5 # sleep interval between two tests + self.skip = 0 + self.n = -1 + + +def format_args(**kwargs): + return ' '.join([f'--{k}={v}' for k, v in kwargs.items()]) + + +def base_main(pm: ScanBaseParams, files: List[str], dics: List[dict]): + assert isinstance(pm, ScanBaseParams) + log = Logger() + log.use_stdout = False + log.add_log_dir(f'./log_{pm.group}') + + base = ("sleep {sleep} ; " + + sys.executable + + " {file} {kwargs} --device={device}{group} & \n" + ) + + if not torch.cuda.is_available(): + gpus = ['cpu'] + elif pm.gpus is None: + gpus = list(range(torch.cuda.device_count())) + elif isinstance(pm.gpus, (int, str)): + gpus = [torch.device(pm.gpus).index] + else: + gpus = pm.gpus + + # append None to identity loop end. + gpus.append(None) + gpus = cycle(gpus) + c = 0 + cnt = 0 + for i, (file, kwargs) in enumerate(zip(files, dics)): + if pm.skip > cnt: + cnt += 1 + continue + + device = next(gpus) + if device is None: + # wait until all devices are free. + c = 0 + print('wait', flush=True) + device = next(gpus) + + if pm.group is not None: + group = f" --group={pm.group} " + else: + group = '' + + cur = base.format( + sleep=c * pm.interval, + file=file, kwargs=format_args(**kwargs), + device=device, group=group) + + c += 1 + cnt += 1 + log.info(cur.strip()) + print(cur, flush=True, end='') + + if pm.n > 0 and pm.skip + pm.n <= cnt: + break + + print('wait', flush=True) diff --git a/src/lumo/core/interp.py b/src/lumo/core/interp.py index ad7a2c1a..21334b82 100644 --- a/src/lumo/core/interp.py +++ b/src/lumo/core/interp.py @@ -281,7 +281,7 @@ def __call__(self, cur): class Cos(ABCContinuous): - """one cycle cosine functoin + r"""one cycle cosine functoin end -> ,--------- / @@ -309,7 +309,7 @@ def interp(cls, cur, start=0., end=1., left=0., right=1., *args, **kwargs): class Linear(ABCContinuous): - """linear schedule + r"""linear schedule ^ end | .* @@ -359,7 +359,7 @@ def interp(cls, cur, start=0., end=1., left=0., right=1., *args, **kwargs): class Log(ABCContinuous): - """ + r""" quick to slow end | * @@ -395,7 +395,7 @@ def interp(cls, cur, start=0., end=1., left=0., right=1., *args, **kwargs): class Constant(ABCContinuous): - """ + r""" A scheduler representing a constant value | constant |-------------- @@ -410,7 +410,7 @@ def __init__(self, value=0.5, *args, **kwargs): class PeriodCos(ABCPeriod): - """ + r""" periodic cosine schedule end -> ,-. ,-. ,-. ,-. @@ -432,7 +432,7 @@ def interp(cls, cur, start=0., end=1., left=0., period=1., *args, **kwargs): class PeriodHalfCos(ABCPeriod): - """ + r""" half periodic cosine schedule, period is (right-left) end -> ,- ,- ,- ,- diff --git a/src/lumo/core/meter.py b/src/lumo/core/meter.py index 44bc21bd..7c52bc4d 100644 --- a/src/lumo/core/meter.py +++ b/src/lumo/core/meter.py @@ -126,9 +126,10 @@ def __setitem__(self, key, value): stg = self._avg.get(key, None) isscalar = value.size == 1 - if stg is None: + if stg is None: # Auto infer a stg method for the value dtype = value.dtype.name + # sanity check if self._stage in {'min', 'max'} and not isscalar: raise ValueError( f'Only support min/max(a) operator on scalar metrics, but got data of shape {value.shape}.') @@ -145,9 +146,14 @@ def __setitem__(self, key, value): self._stage = 'last' self._avg[key] = self._stage + stg = self._stage if isscalar: value = value.item() + elif stg == {'min', 'max', 'sum'}: + value = getattr(np, stg)(value).item() + elif stg == 'mean': + value = [getattr(np, stg)(value).item(), value.size] self._rec[key] = value self._stage = 'default' @@ -192,6 +198,17 @@ def mean(self): self._stage = 'mean' return self + @property + def test_mean(self): + """ + Sets the aggregation method to 'mean'. + + Returns: + The meter itself. + """ + self._stage = 'mean' + return self + @property def last(self): """ @@ -405,6 +422,10 @@ def __repr__(self): def update(self, item): """Updates the reduction value with a new item.""" self.cur = item + count = 1 + if isinstance(item, list) and len(item) == 2: + item, count = item + item = detach(item) avg = self.gb_method @@ -419,8 +440,8 @@ def update(self, item): elif avg in {'mean', 'sum'}: if len(self.acc) == 0: self.acc.append(0) - self.acc[0] = self.acc[0] + item - self.c += 1 + self.acc[0] = self.acc[0] + item * count + self.c += count elif avg == 'max': self._res = max(self.cur, self._res) elif avg == 'min': diff --git a/src/lumo/data/builder.py b/src/lumo/data/builder.py index 1522bc6a..7515aa25 100644 --- a/src/lumo/data/builder.py +++ b/src/lumo/data/builder.py @@ -117,7 +117,7 @@ def __init__(self): self._iter_cache = {} - def __repr__(self): + def __str__(self): if self.sized: return f'Builder(flow={pformat(self._outs)}, sized={self.sized}, size={len(self)}, iterable={self.iterable})' @@ -360,7 +360,7 @@ def scale_to_size(self, size: int): size (int): the size to scale the dataset builder to. Returns: - self (DatasetBuilder): the scaled dataset builder. + the scaled dataset builder. """ assert isinstance(size, int) assert 'pseudo_repeat' not in self._prop @@ -378,7 +378,7 @@ def repeat(self, multiple: int): multiple (int): the number of times to repeat the dataset builder. Returns: - self (DatasetBuilder): the repeated dataset builder. + the repeated dataset builder. """ assert isinstance(multiple, int) assert 'pseudo_length' not in self._prop @@ -424,7 +424,7 @@ def chain(self): Set the mode of the dataset builder to chain. Returns: - self (DatasetBuilder): the dataset builder with the chain mode set. + the dataset builder with the chain mode set. """ self._prop['mode'] = 'chain' return self @@ -434,7 +434,7 @@ def item(self): Set the mode of the dataset builder to item. Returns: - self (DatasetBuilder): the dataset builder with the item mode set. + the dataset builder with the item mode set. """ self._prop['mode'] = 'item' return self @@ -444,7 +444,7 @@ def zip(self): Set the mode of the dataset builder to zip. Returns: - self (DatasetBuilder): the dataset builder with the zip mode set. + the dataset builder with the zip mode set. """ self._prop['mode'] = 'zip' return self @@ -489,7 +489,7 @@ def add_idx(self, name): name (str): the name of the index. Returns: - self (DatasetBuilder): the dataset builder with the index added. + the dataset builder with the index added. """ outkeys = self._outs.setdefault(f"::idx::", []) assert name not in self._outkeys, f'Output key {name} duplicated.' @@ -507,7 +507,7 @@ def add_input(self, name: str, source, transform: SingleValueTransform = None): transform (SingleValueTransform): the transform to apply to the input source. Returns: - self (DatasetBuilder): the dataset builder with the input source added. + the dataset builder with the input source added. Notes: Iterable object without `__len__` method currently are not well-tested. Be careful to use them in DatasetBuilder. @@ -529,7 +529,7 @@ def add_input_transform(self, name: str, transform: SingleValueTransform = None) transform (SingleValueTransform): the transform to add. Returns: - self (DatasetBuilder): the dataset builder with the transform added. + the dataset builder with the transform added. """ assert name in self._data, f'Source {name} should be added.' warnings.warn('`add` may cause confusion, use set_input_transform ') @@ -545,7 +545,7 @@ def add_output(self, name: str, outkey: str, transform: SingleValueTransform = N transform (SingleValueTransform): the transform to apply to the output. Returns: - self (DatasetBuilder): the dataset builder with the output added. + the dataset builder with the output added. """ assert name in self._data, f'Must have data source {name} first.' @@ -567,7 +567,7 @@ def add_output_transform(self, outkey: str, transform: SingleValueTransform = No transform (SingleValueTransform): the transform to add. Returns: - self (DatasetBuilder): the dataset builder with the transform added. + the dataset builder with the transform added. """ assert outkey in self._outkeys, f'Output key {outkey} should be added.' warnings.warn('add may cause confusion, use set_output_transform ') @@ -581,7 +581,7 @@ def add_global_transform(self, transform: DictTransform): transform (DictTransform): the global transform to apply to the dataset. Returns: - self (DatasetBuilder): the dataset builder with the global transform added. + the dataset builder with the global transform added. """ self._transforms['::global::'] = transform return self @@ -595,7 +595,7 @@ def set_input_transform(self, name, transform: SingleValueTransform = None): transform (SingleValueTransform): the transform to set. Returns: - self (DatasetBuilder): the dataset builder with the transform set. + the dataset builder with the transform set. """ self._transforms[name] = transform return self @@ -609,7 +609,7 @@ def set_output_transform(self, outkey, transform: SingleValueTransform = None): transform (SingleValueTransform): the transform to set. Returns: - self (DatasetBuilder): the dataset builder with the transform set. + the dataset builder with the transform set. """ self._transforms[f'::{outkey}'] = transform return self diff --git a/src/lumo/decorators/process.py b/src/lumo/decorators/process.py index 5a488127..0e1f7dc6 100644 --- a/src/lumo/decorators/process.py +++ b/src/lumo/decorators/process.py @@ -1,5 +1,5 @@ from typing import Callable - +from functools import wraps from lumo.proc.dist import is_main @@ -18,6 +18,7 @@ def call_on_main_process_wrap(func) -> Callable: """ + @wraps(func) def inner(*args, **kwargs): if is_main(): return func(*args, **kwargs) diff --git a/src/lumo/exp/__init__.py b/src/lumo/exp/__init__.py index da726c3f..b6de7031 100644 --- a/src/lumo/exp/__init__.py +++ b/src/lumo/exp/__init__.py @@ -1 +1,3 @@ from .experiment import SimpleExperiment, Experiment +from .watch import Watcher, C +from .metric import Metric diff --git a/src/lumo/exp/agent.py b/src/lumo/exp/agent.py new file mode 100644 index 00000000..00f69d6f --- /dev/null +++ b/src/lumo/exp/agent.py @@ -0,0 +1,35 @@ +""" +heartbeat mechanism +""" +import time + +import psutil +from joblib import hash + +from lumo.core import BaseParams +from lumo.utils import safe_io as IO +from lumo.utils.fmt import strftime +from lumo.exp import Experiment + + +def wait_pid_stop(info_dir=None): + """wait test """ + params = BaseParams() + params.info_dir = info_dir + params.from_args() + + exp = Experiment.from_disk(params.info_dir) + pid = exp.properties['pinfo'].get('pid') + info_dir = exp.info_dir + try: + while pid is not None and psutil.pid_exists(pid): + exp.trigger() + exp.dump_info('agent', {'last_edit_time': strftime()}) + + time.sleep(10) + except: + pass + + +if __name__ == '__main__': + wait_pid_stop() diff --git a/src/lumo/exp/backup.py b/src/lumo/exp/backup.py new file mode 100644 index 00000000..2ba86a7d --- /dev/null +++ b/src/lumo/exp/backup.py @@ -0,0 +1,190 @@ +from .experiment import Experiment +import os +from pprint import pformat +from lumo.utils.fmt import strftime +import re + +issue_title = """Test {test_name}""" + +summary_template = """ +{note} + +|Experiment Name|Test Name|Start| End | +|---|---|---|---| +|{exp_name}|{test_name}|{start}|{end}| + +```bash +{command} +``` + +## Parameters +```python +{params} +``` + +## Metrics + +|Key|Value| +|--|--| +{metrics} + +Tensor metrics: +``` +{non_scalar_metrics} +``` + +``` +{extra_info} +``` + +## Full properties +
+ Click Here + +{properties} + +
+""" + + +def make_summary(exp: Experiment, **kwargs): + """make a summary of experiment, can be used for backup or write report.""" + properties = exp.properties.copy() + + execute = properties['execute'] + params = properties.get('params', {}) + + metrics = {} + non_scalar_metrics = {} + for k, v in exp.metric.value.items(): + if isinstance(v, (int, float)): + metrics[k] = f'{v:.5f}' + else: + non_scalar_metrics[k] = v + if len(metrics) == 0: + metrics['null'] = 'null' + + progress = properties['progress'] + + # make markdown string + working_dir_str = execute['cwd'] + + common_prefix = os.path.commonprefix([execute['exec_argv'][0], execute['cwd']]) + file = execute['exec_argv'][0][len(common_prefix):].lstrip('/') + bin = os.path.basename(execute['exec_bin']) + command_str = ' '.join( + [bin, file, *execute['exec_argv'][1:]]) + params_str = pformat(params) + metrics_str = '\n'.join(f'|{k}|{v}|' for k, v in metrics.items()) + non_scalar_metrics_str = '\n'.join(f'{k}\n{pformat(v)}\n' for k, v in non_scalar_metrics.items()) + properties_str = pformat(properties) + + start_str = progress.get('start', '') + end_str = progress.get('end', '') + extra_info_str = '\n'.join(f'{k}: {v}' for k, v in kwargs.items()) + + body = summary_template.format( + # working_dir=working_dir_str, + exp_name=exp.exp_name, + note=exp.load_note(), + test_name=exp.test_name, + start=start_str, + end=end_str, + # interval=interval_str, + extra_info=extra_info_str, + command=command_str, + params=params_str, + metrics=metrics_str, + non_scalar_metrics=non_scalar_metrics_str, + properties=properties_str, + ) + + try: + username = os.getlogin() + except OSError: + pass + + # hide privacy path, only show relative path + home_with_name = os.path.expanduser('~') + body = re.sub("""'[^']*\.lumo""", '.lumo', body) + body = body.replace(home_with_name, '~') + + return body + + +def backup_github_issue(exp: Experiment, repo: str, access_token: str, + labels: list = None, update: bool = True, + **kwargs): + """backup ipath content and bpath content(optional) to github""" + try: + from github import Github + except ImportError as e: + raise ImportError('backup by github need PyGithub to be installed, try `pip install PyGithub`') from e + + g = Github(login_or_token=access_token) + repo_obj = g.get_repo(repo) + + title = issue_title.format(test_name=exp.test_name) + + issue = None + if update: + number = -1 + old_backup = exp.properties.get('backup', {}) + for k, v in sorted(list(old_backup.items())): + if v['backend'] == 'github' and v['repo'] == repo: + number = v['number'] + + if number > 0: + issue = repo_obj.get_issue(number) + + if issue is None: + issue = repo_obj.create_issue(title=title) + exp.dump_info('backup', {strftime(): {'backend': 'github', 'number': issue.number, 'repo': repo}}, append=True) + + body = make_summary(exp, **kwargs) + + issue.edit(state='closed', body=body) + if labels is not None: + issue.edit(labels=labels) + + return issue + + # There is no way to upload files by GitHub Api + # filter backuped file sizes + # files = [] + # for root, dirs, fs in os.walk(exp.mk_ipath()): + # for f in fs: + # absf = os.path.join(root, f) + # file_size = os.path.getsize(absf) / (1024 * 1024) # Mb + # if file_size > size_limit: + # continue + # files.append(absf) + # + # for root, dirs, fs in os.walk(exp.mk_bpath()): + # for f in fs: + # absf = os.path.join(root, f) + # file_size = os.path.getsize(absf) / (1024 * 1024) # Mb + # if file_size > size_limit: + # continue + # files.append(absf) + + # repo.create_git_blob() + # {files} + # issue.create_comment() + + +def backup_ssh(exp: Experiment, host, username, root, size_limit): + """compress backup zip files to target server with replacement""" + pass + + +def backup_local(exp: Experiment, target: str): + """backup in local dist""" + pass + + +backup_regist = { + 'github': backup_github_issue, + 'remote': backup_ssh, + 'local': backup_local, +} diff --git a/src/lumo/exp/base.py b/src/lumo/exp/base.py index fbfb056f..4fd231a9 100644 --- a/src/lumo/exp/base.py +++ b/src/lumo/exp/base.py @@ -88,7 +88,7 @@ def on_newpath(self, exp, *args, **kwargs): """ - def __repr__(self): + def __str__(self): """Return a string representation of the hook. Returns: diff --git a/src/lumo/exp/experiment.py b/src/lumo/exp/experiment.py index 358e9c3a..0e128a97 100644 --- a/src/lumo/exp/experiment.py +++ b/src/lumo/exp/experiment.py @@ -9,7 +9,7 @@ import sys import time import traceback -from typing import Any, List +from typing import Any, List, overload from functools import wraps from lumo.decorators.process import call_on_main_process_wrap from lumo.proc import glob @@ -19,6 +19,7 @@ from lumo.utils import safe_io as io from lumo.utils.fmt import can_be_filename, strftime from lumo.utils.logger import Logger +from lumo.utils.subprocess import run_command from .base import BaseExpHook from ..proc.pid import pid_hash, runtime_pid_obj from .metric import Metric @@ -91,7 +92,7 @@ class Experiment: ENV_TEST_NAME_KEY = 'LUMO_EXP_TEST_NAME' - def __init__(self, exp_name: str, test_name=None, paths=None): + def __init__(self, exp_name: str = None, test_name=None, paths=None, info_dir=None): """ Initializes a new instance of the Experiment class. @@ -102,28 +103,34 @@ def __init__(self, exp_name: str, test_name=None, paths=None): Raises: ValueError: If the experiment name is not a legal filename. """ - if not can_be_filename(exp_name): - raise ValueError(f'Experiment name should be a ligal filename(bettor only contain letter or underline),' - f'but got {exp_name}.') - - self._prop = {} - self._prop['exp_name'] = exp_name - if test_name is None: - test_name = os.environ.get(Experiment.ENV_TEST_NAME_KEY, None) - self._prop['test_name'] = test_name - if paths is None: - paths = {} - self._prop['paths'] = paths + if info_dir is not None: + exp = self.__class__.from_disk(info_dir=info_dir) + self._prop = exp._prop + self._hooks = exp._hooks + self._metric = exp._metric + else: + assert exp_name is not None + if not can_be_filename(exp_name): + raise ValueError(f'Experiment name should be a ligal filename(bettor only contain letter or underline),' + f'but got {exp_name}.') + self._prop = {'exp_name': exp_name} + if test_name is None: + test_name = os.environ.get(Experiment.ENV_TEST_NAME_KEY, None) + self._prop['test_name'] = test_name + if paths is None: + paths = {} + self._prop['paths'] = paths + self._prop['note'] = '' - self._hooks = {} - self._metric = None + self._hooks = {} + self._metric = None # wrap self.dump_string = self._trigger_change(self.dump_string) self.dump_note = self._trigger_change(self.dump_note) self.dump_info = self._trigger_change(self.dump_info) + self.trigger = self._trigger_change(self.trigger) - self.add_exit_hook(self.end) self.logger = Logger() def __getitem__(self, item): @@ -183,7 +190,7 @@ def __repr__(self): Returns: str: A string representation of the Experiment object. """ - return f'{self.exp_name}->({self.test_name})' + return f'{self.__class__.__name__}(info_dir="{self.info_dir})"' def __str__(self): """ @@ -192,7 +199,7 @@ def __str__(self): Returns: str: A string representation of the Experiment object. """ - return self.__repr__() + return f'{self.__class__.__name__}(info_dir={self.info_dir})' def _repr_html_(self): """Return a html representation for a particular DataFrame.""" @@ -366,13 +373,31 @@ def exec_argv(self): except: return [] + def trigger(self): + """trigger the disk change""" + pass + def _trigger_change(self, func): - # test_root update some files + """ + Decorator function that updates the heartbeat file before executing the decorated function. + + The heartbeat file indicates that a change has occurred in the experiment directory. + + Args: + func: the function to be decorated. + + Returns: + A decorated function. + """ + + # test_root update some files @wraps(func) def inner(*args, **kwargs): + """wrap function""" fn = self.heartbeat_fn io.dump_text(self.info_dir, fn) - func(*args, **kwargs) + io.dump_text(self.info_dir, self.pid_fn) + return func(*args, **kwargs) return inner @@ -437,9 +462,9 @@ def dump_progress(self, ratio: float, update_from=None): update_from: The process from which the progress update came from. """ res = {'ratio': max(min(ratio, 1), 0)} + res['last_edit_time'] = strftime() if update_from is None: res['update_from'] = update_from - res['last_edit_time'] = strftime() self.dump_info('progress', res, append=True) def dump_info(self, key: str, info: Any, append=False): @@ -479,15 +504,39 @@ def load_info(self, key: str): return {} def load_note(self): + """ + Loads the contents of the note file, if it exists. + + Returns: + A string representing the contents of the note file, or an empty string if the file does not exist. + """ fn = self.mk_ipath('note.md') if os.path.exists(fn): return io.load_text(fn) return '' def dump_tags(self, *tags): + """ + Dumps the experiment's tags to the info file. + + Args: + *tags: a variable-length argument list of tags to be added to the experiment. + + Returns: + None. + """ self.dump_info('tags', tags) def dump_note(self, note: str): + """ + Dumps the contents of the note to the note file. + + Args: + note: a string representing the contents of the note. + + Returns: + None. + """ fn = self.mk_ipath('note.md') self.set_prop('note', note) io.dump_text(note, fn) @@ -521,9 +570,15 @@ def load_string(self, key: str): return io.load_text(fn) def dump_metric(self, key, value, cmp: str, flush=True, **kwargs): + """ + See Metric for details. + """ return self.metric.dump_metric(key, value, cmp, flush, **kwargs) def dump_metrics(self, dic: dict, cmp: str): + """ + See Metric for details. + """ return self.metric.dump_metrics(dic, cmp) @property @@ -574,7 +629,17 @@ def blob_dir(self): os.makedirs(d, exist_ok=True) return d - def _mk_path(self, *path: str, is_dir) -> str: + def _mk_path(self, *path: str, is_dir: bool) -> str: + """ + Helper method to create a directory path if it does not exist and return the path. + + Args: + *path: tuple of path strings to be joined. + is_dir: boolean flag indicating whether the path is a directory. + + Returns: + str: the full path created. + """ path = os.path.join(*path) if is_dir: os.makedirs(path, exist_ok=True) @@ -582,16 +647,56 @@ def _mk_path(self, *path: str, is_dir) -> str: os.makedirs(os.path.dirname(path), exist_ok=True) return path - def mk_ipath(self, *path, is_dir=False): + def mk_ipath(self, *path: str, is_dir: bool = False) -> str: + """ + Creates a directory path within the experiment's info directory. + + Args: + *path: tuple of path strings to be joined. + is_dir: boolean flag indicating whether the path is a directory. Default is False. + + Returns: + str: the full path created. + """ return self._mk_path(self.info_dir, *path, is_dir=is_dir) - def mk_cpath(self, *path, is_dir=False): + def mk_cpath(self, *path: str, is_dir: bool = False) -> str: + """ + Creates a directory path within the experiment's cache directory. + + Args: + *path: tuple of path strings to be joined. + is_dir: boolean flag indicating whether the path is a directory. Default is False. + + Returns: + str: the full path created. + """ return self._mk_path(self.cache_dir, *path, is_dir=is_dir) - def mk_bpath(self, *path, is_dir=False): + def mk_bpath(self, *path: str, is_dir: bool = False) -> str: + """ + Creates a directory path within the experiment's blob directory. + + Args: + *path: tuple of path strings to be joined. + is_dir: boolean flag indicating whether the path is a directory. Default is False. + + Returns: + str: the full path created. + """ return self._mk_path(self.blob_dir, *path, is_dir=is_dir) - def mk_rpath(self, *path, is_dir=False): + def mk_rpath(self, *path: str, is_dir: bool = False) -> str: + """ + Creates a directory path within the user's home directory. + + Args: + *path: tuple of path strings to be joined. + is_dir: boolean flag indicating whether the path is a directory. Default is False. + + Returns: + str: the full path created. + """ return self._mk_path(libhome(), *path, is_dir=is_dir) @classmethod @@ -605,10 +710,10 @@ def rerun(self, arg_list: List[str]): new_test_name = self._create_test_name(self.exp_dir) new_exp = Experiment(self.exp_name, test_name=new_test_name) self.dump_info('deprecated', {'rerun_at': {new_exp.test_name: True}}, append=True) - old_rerun_info = self.properties.get('rerun', None) + old_rerun_info = self.properties.get('rerun', {}) count = 1 - if old_rerun_info is not None: - count += old_rerun_info['count'] + if isinstance(old_rerun_info, dict): + count += old_rerun_info.get('repeat', 0) new_exp.dump_info('rerun', {'from': self.test_name, 'repeat': count}) from lumo.utils.subprocess import run_command old_exec = self.properties['execute'] @@ -639,11 +744,15 @@ def initial(self): 'obj': runtime_pid_obj(), }) + self.dump_tags() + # register start - self.dump_info('progress', {'start': strftime(), 'finished': False}, append=True) + self.dump_info('progress', {'start': strftime()}, append=True) self.dump_progress(0) # register progress - io.dump_text(self.info_dir, self.pid_fn) + self.proc = run_command(f'python3 -m lumo.exp.agent --info_dir={self.info_dir}', non_block=True) + + self.add_exit_hook(self.end) @call_on_main_process_wrap def start(self): @@ -653,7 +762,6 @@ def start(self): if self.properties.get('progress', None) is not None: return self.initial() - self.set_prop('start', True) for hook in self._hooks.values(): # type: BaseExpHook hook.on_start(self) return self @@ -668,20 +776,19 @@ def end(self, end_code=0, *args, **extra): *args: Additional arguments to pass to the end hooks. **extra: Additional keyword arguments to pass to the end hooks. """ - if not self.is_alive: - return - if not self.properties.get('progress', None) is None: + if self.properties.get('progress', None) is None: return if self.properties['progress'].get('end', False): return - self.set_prop('end', True) if end_code == 0: self.dump_progress(1) - self.dump_info('progress', {'end': strftime(), 'finished': end_code == 0}, append=True) + self.dump_info('progress', {'end': strftime(), 'end_code': end_code}, append=True) for hook in self._hooks.values(): # type: BaseExpHook hook.on_end(self, end_code=end_code, *args, **extra) + + self.proc.terminate() return self @call_on_main_process_wrap @@ -717,12 +824,24 @@ def add_exit_hook(self, func): import atexit def exp_func(): """Function executed before process exit.""" - func(self) + func() atexit.register(exp_func) @classmethod def from_cache(cls, dic: dict): + """ + Creates an Experiment object from a cached dictionary. + + The cached dictionary should have the same format as the one returned by the Experiment.to_cache() method. + + Args: + cls: the Experiment class. + dic: a dictionary with the cached Experiment data. + + Returns: + An Experiment object. + """ paths = dic.pop('paths', {}) _ = dic.pop('metrics') self = cls(exp_name=dic['exp_name'], test_name=dic['test_name'], paths=paths) @@ -730,12 +849,12 @@ def from_cache(cls, dic: dict): return self @classmethod - def from_disk(cls, path): + def from_disk(cls, info_dir): """ Creates an Experiment object from a test root directory on disk. Args: - path (str): The path to the test root directory. + info_dir (str): The path to the test root directory. Returns: Experiment: An Experiment object created from the test root directory. @@ -743,13 +862,13 @@ def from_disk(cls, path): Raises: ValueError: If the path is not a valid test root directory. """ - from .finder import is_test_root - if not is_test_root(path): - raise ValueError(f'{path} is not a valid test_root') - path = os.path.abspath(path) - exp_dir = os.path.dirname(path) + from lumo.exp.watch import is_test_root + if not is_test_root(info_dir): + raise ValueError(f'{info_dir} is not a valid test_root') + info_dir = os.path.abspath(info_dir) + exp_dir = os.path.dirname(info_dir) - paths_fn = os.path.join(path, 'info', f'paths.json') + paths_fn = os.path.join(info_dir, 'info', f'paths.json') if os.path.exists(paths_fn): try: paths = io.load_json(paths_fn) @@ -758,7 +877,7 @@ def from_disk(cls, path): else: paths = {} - self = cls(os.path.basename(exp_dir), test_name=os.path.basename(path), paths=paths) + self = cls(os.path.basename(exp_dir), test_name=os.path.basename(info_dir), paths=paths) # load prop for f in os.listdir(self.mk_ipath('info', is_dir=True)): @@ -774,18 +893,31 @@ def from_disk(cls, path): return self def cache(self): + """Cache information of current test.""" return { **self.properties, 'metrics': self.metric.value, } def dict(self): + """Get full information of current test, including dynamic status.""" return { **self.properties, 'is_alive': self.is_alive, 'metrics': self.metric.value, } + def backup(self, backend: str = 'github', **kwargs): + """ + Backup this experiment into the given target, currently only support GitHub, you can implement your own way + by the provided information of Experiment. + """ + from .backup import backup_regist + if backend == 'github': + kwargs.setdefault('access_token', glob['github_access_token']) + + return backup_regist[backend](exp=self, **kwargs) + class SimpleExperiment(Experiment): """ @@ -793,8 +925,8 @@ class SimpleExperiment(Experiment): execute before and after the experiment. """ - def __init__(self, exp_name: str, root=None): - super().__init__(exp_name, root) + def __init__(self, exp_name: str = None, test_name=None, paths=None, info_dir=None): + super().__init__(exp_name, test_name, paths, info_dir) from . import exphook self.set_hook(exphook.LastCmd()) self.set_hook(exphook.LockFile()) diff --git a/src/lumo/exp/exphook.py b/src/lumo/exp/exphook.py index 68500fbb..f0c25d04 100644 --- a/src/lumo/exp/exphook.py +++ b/src/lumo/exp/exphook.py @@ -89,11 +89,13 @@ def exc_end(self, exc_type, exc_val, exc_tb): import traceback res = traceback.format_exception(exc_type, exc_val, exc_tb) res = [i for i in res if 'in _newfunc' not in i] - self.exp.dump_string('exception', "".join(res)) - self.exp.end( - end_code=1, - exc_type=traceback.format_exception_only(exc_type, exc_val)[-1].strip() - ) + + self.exp.dump_info('exception', { + 'exception_type': exc_type.__name__, + 'exception_content': "".join(res) + }) + + self.exp.end(end_code=1) # class TimeMonitor(ExpHook): @@ -196,11 +198,9 @@ def on_start(self, exp: Experiment, *args, **kwargs): 'joblib', 'fire', 'psutil', - 'accelerate', 'hydra', 'omegaconf', 'decorator', - 'numpy', 'torch', ) diff --git a/src/lumo/exp/finder.py b/src/lumo/exp/finder.py index 53989174..dec377f6 100644 --- a/src/lumo/exp/finder.py +++ b/src/lumo/exp/finder.py @@ -11,9 +11,9 @@ import os from typing import List, Dict, Any +from lumo.exp.watch import is_test_root, is_test_name from lumo.proc.path import libhome, exproot, metricroot from lumo.utils.fmt import indent_print -from lumo.utils import re from . import Experiment @@ -112,33 +112,6 @@ def find_path_from_test_name(test_name: str) -> str: return None -def is_test_name(test_name: str) -> bool: - """ - Determines if the specified string is a valid test name. - - Args: - test_name: The string to check. - - Returns: - True if the string is a valid test name, False otherwise. - """ - return re.search(r'^\d{6}\.\d{3}\.[a-z\d]{2}t$', test_name) is not None - - -def is_test_root(path: str) -> bool: - """ - Determines if the specified path is a valid test root. - - Args: - path: The path to check. - - Returns: - True if the path is a valid test root, False otherwise. - """ - test_name = os.path.basename(path.rstrip('/')) - return is_test_name(test_name) - - def retrieval_test_root(test_flag: str) -> str: """ Returns the test root directory for the specified test name or test root. diff --git a/src/lumo/exp/lazy_panel.py b/src/lumo/exp/lazy_panel.py new file mode 100644 index 00000000..7fe49ae3 --- /dev/null +++ b/src/lumo/exp/lazy_panel.py @@ -0,0 +1,258 @@ +import numbers + +try: + import panel as pn +except ImportError as e: + raise ImportError('The experiment panel is supported by panel, ' + 'you should use it by `pip install panel` first.') from e + +import pandas as pd +from panel.models.tabulator import TableEditEvent + +import numpy as np +from bokeh.models.widgets.tables import HTMLTemplateFormatter +from lumo import Experiment + +css = ''' +.tabulator-tableholder { + height: fit-content !important; +} + +.tabulator-row .tabulator-cell { + overflow: visible !important; + vertical-align: top; + min-height: 20px; + height: fit-content !important; +} + +.bk { + height: fit-content !important; +} + +.tabulator .tabulator-col-resize-handle { + height: fit-content !important; + +} + +.tabulator-cell .tabulator-editable { + width: fit-content !important; + +} + +''' + +pn.extension('tabulator', raw_css=[css], css_files=[pn.io.resources.CSS_URLS['font-awesome']]) + + +def FoldDictFormatter(column_name): + """unfold dict in summary tag""" + base_template = """ +
+ {column_name} + <% _.each(value, function(vv, key) { %> +
  • <%= key %>: <%= vv %>
  • + <% }); %> +
    """ + return HTMLTemplateFormatter(template=base_template.replace('{column_name}', column_name)) + + +def DictFormatter(column_name): + """unfold dict in list tag""" + base_template = """ + <% _.each(value, function(vv, key) { %> +
  • <%= key %>: <%= vv %>
  • + <% }); %> + """ + return HTMLTemplateFormatter(template=base_template) + + +class ExceptionFormatter: + """format exception""" + base_template = """ +
    + + <%= value['exception_type'] %> + +

    + <%= value['exception_content'] %> +

    +
    """ + + +progress_formatter = HTMLTemplateFormatter( + template=""" +
    + """ +) + +tabulator_formatters = { + 'metrics': DictFormatter(column_name='metrics'), + 'progress_': FoldDictFormatter(column_name='progress'), + 'progress': progress_formatter, + 'params': FoldDictFormatter(column_name='params'), + 'exception': HTMLTemplateFormatter(template=ExceptionFormatter.base_template), +} + +tabulator_editors = { + 'exp_name': None, + 'test_name': None, + 'params': None, + 'metrics': None, + 'progress': None, + 'progress_': None, + 'exception': None, + # 'note': {'type': 'list', 'valuesLookup': True}, + # 'tags': {'type': 'list', 'valuesLookup': True}, +} + + +def drop_nonscalar_metric(dic): + """only keep scalar metrics""" + if not isinstance(dic, dict): + return + + for k, v in list(dic.items()): + if not isinstance(v, numbers.Number): + dic.pop(k) + return dic + + +def reformat_progress(dic): + """make progress""" + if not isinstance(dic, dict): + return { + 'ratio': '100%', + 'color': 'yellow', + } + + ratio = dic.get('ratio', 0) + end_code = dic.get('end_code', None) + + # normal end -> end_code == 0 -> green + # interrupt by exception -> end_code > 0 -> red + # running -> end_code is None -> blue + # killed without any signal ( in 5 minute ) -> end_code is None -> blue + # killed and dumped by watcher -> end_code == -10 -> red + + if end_code is None: + color = 'blue' + elif ratio == 1: + color = 'green' + elif end_code == 0: + color = 'green' + elif end_code > 0 or end_code < 0: + color = 'red' + if ratio == 0: + ratio = 0.01 + else: + color = 'black' + + return { + 'ratio': f'{ratio:2%}', + 'color': color, + } + + +def make_experiment_tabular(df: pd.DataFrame): + """ + {'start': '23-03-14-224651', + 'finished': False, + 'ratio': 0, + 'last_edit_time': '23-03-14-224651', + 'update_from': None} + + Args: + df: + + Returns: + + """ + df = df.reset_index(drop=True) + # process exception + if 'exception' not in df.columns: + df['exception'] = {} + else: + df['exception'].fillna({}) + + if 'tags' not in df.columns: + df['tags'] = np.empty((len(df.index), 0)).tolist() + + df['metrics'] = df['metrics'].apply(drop_nonscalar_metric) + + # df['progress_'] = df['progress'] + df['progress'] = df['progress'].apply(reformat_progress) + + # ratio = 1 + # ratio < 1, no-end-code -> + + extra_columns = set(df.columns) - {'git', 'paths', 'pinfo', + 'execute', 'lock', 'params.yaml', + 'params_hash', 'table_row', 'hooks', 'logger_args', + 'agent', 'trainer', 'progress_', + 'start', 'end', + 'tensorboard_args', + 'state', + 'metric_board'} + top_columns = [ + 'exp_name', 'test_name', + 'progress', + 'exception', + 'params', 'metrics', + 'note', 'tags', + ] + + extra_columns = extra_columns - set(top_columns) + [tabulator_editors.setdefault(k, None) for k in extra_columns] + + df = df[top_columns + list(extra_columns)] + + def on_cell_change(e: TableEditEvent): + """on edit cell""" + nonlocal df + + if e.column == 'note': + Experiment.from_cache(df.iloc[e.row].to_dict()).dump_note(e.value) + else: + tags = [i.strip() for i in e.value.split(',')] + Experiment.from_cache(df.iloc[e.row].to_dict()).dump_tags(tags) + + df_widget = pn.widgets.Tabulator( + df, + groupby=['exp_name'], + # hidden_columns=[], + pagination='local', + formatters=tabulator_formatters, + editors=tabulator_editors, + selection=[], + show_index=False, + configuration={ + 'clipboard': True, + 'tooltip': True, + # 'rowHeight': 100, + # 'columnDefaults': { + # 'headerSort': False, + # }, + } + ) + + # def on_cell_click(e): + # nonlocal enable_update + # if e.column in {'note', 'tags'}: + # enable_update = False + + # enable_update = True + # df_widget.on_click(on_cell_click) + df_widget.on_edit(on_cell_change) + # + # def update_time(): + # if enable_update: + # time_now = dt.datetime.now().strftime("%Y-%m-%d %H:%M:%S") + # time_component.object = f"

    {time_now}

    " + # + # time_component = pn.pane.HTML("current time is") + # pn.state.onload(update_time) + # pn.state.add_periodic_callback(update_time, 1000) # 每1000毫秒执行一次更新时间的操作 + # + # widget = pn.Column(time_component, df_widget) # .servable() + + return df_widget diff --git a/src/lumo/exp/metric.py b/src/lumo/exp/metric.py index abf5f618..56cc574c 100644 --- a/src/lumo/exp/metric.py +++ b/src/lumo/exp/metric.py @@ -4,28 +4,72 @@ class Metric: """ + A class that handles metric values and saving/loading them to/from disk. + + Attributes: + fn (str): The file path of the metric file. + _metric (dict): A dictionary containing the metric values. + persistent (bool): A boolean value indicating whether to save the metric values to disk. """ def __init__(self, metric_fn, persistent=True): + """ + Initializes a new instance of the Metric class. + + Args: + metric_fn (str): The file path of the metric file. + persistent (bool): A boolean value indicating whether to save the metric values to disk. + Default is True. + + Returns: + None. + """ os.makedirs(os.path.dirname(os.path.abspath(metric_fn)), exist_ok=True) self.fn = metric_fn self._metric = {} + self._last = {} if os.path.exists(metric_fn): self._metric = IO.load_pkl(metric_fn) + self.persistent = persistent + @property + def current(self): + return self._last + @property def value(self): """ - A property that returns the metric values of the row. + A property that returns the metric values. Returns: - dict: A dictionary containing the metric values of the row. + dict: A dictionary containing the metric values. """ return self._metric def dump_metric(self, key, value, cmp: str, flush=True, **kwargs): - dic = self.value + """ + Updates the metric value for a given key. + + If the metric value for the given key is not set or the new value is better than the + existing value based on the comparison type specified by cmp, the metric value is updated + with the new value. The function returns the updated value. + + Args: + key (str): The key for the metric value. + value (float): The new metric value. + cmp (str): The type of comparison to use when updating the metric value. Must be 'max' or 'min'. + flush (bool): A boolean value indicating whether to save the updated metric values to disk. + Default is True. + **kwargs: Additional key-value pairs to store with the metric value. + + Returns: + float: The updated metric value. + + Raises: + NotImplementedError: If cmp is not 'max' or 'min'. + """ + dic = self._metric older = dic.setdefault(key, None) update = False @@ -45,16 +89,39 @@ def dump_metric(self, key, value, cmp: str, flush=True, **kwargs): dic[key] = value for kk, vv in kwargs.items(): dic[kk] = vv + else: + value = older + + self._last[key] = value + for kk, vv in kwargs.items(): + self._last[kk] = vv if flush: self.flush() return value def dump_metrics(self, dic: dict, cmp: str): - for k, v in dic.items(): - self.dump_metric(k, v, cmp) + """ + Updates multiple metric values with a dictionary. + + The function calls dump_metric for each key-value pair in the input dictionary and returns + a dictionary containing the updated metric values. + + Args: + dic (dict): A dictionary containing the key-value pairs to update. + cmp (str): The type of comparison to use when updating the metric values. Must be 'max' or 'min'. + + Returns: + dict: A dictionary containing the updated metric values. + """ + res = {k: self.dump_metric(k, v, cmp, flush=False) + for k, v in dic.items()} + self.flush() + return res def flush(self): - """Writes the value of the row to a file.""" + """ + Writes the metric values to a file. + """ if self.persistent: IO.dump_pkl(self.value, self.fn) diff --git a/src/lumo/exp/watch.py b/src/lumo/exp/watch.py index e2dc8675..d853dbf4 100644 --- a/src/lumo/exp/watch.py +++ b/src/lumo/exp/watch.py @@ -1,28 +1,10 @@ -""" -Watcher 可以在运行实验后在 jupyter 或者网页上展示正在运行和已经运行结束的实验(按时间顺序?) -以及可以简化记录实验的烦恼 - -现在的核心痛点是 - - [ ] 所有元信息都有了,但是找不到哪个实验是哪个实验 - - [ ] 同时跑的多个实验有一个失败了,重跑时会混淆,或许需要一种覆盖手段 -> - - > 怎么 rerun? - lumo rerun test_name √ - lumo note html (用 streamlit 之类的生成动态网页) - lumo note cmd (类似 top 的视角,按时间顺序排列) -- > rerun 将已经跑的实验 move - -可以代替 analysis 的作用。主要有 - --> 按照 progress 目录,获取所有的实验 --> 根据获取的实验,按顺序记录 --> 每次只记录 - -""" import numbers +import os import os.path +import re from typing import List, Dict, overload from pprint import pformat -import pandas as pd + from dbrecord import PDict from datetime import datetime from operator import gt, ge, le, lt, eq, ne @@ -31,6 +13,8 @@ from .experiment import Experiment from lumo.utils import safe_io as IO from lumo.utils.fmt import format_timedelta, strptime, strftime +from lumo.proc.tz import timezone +from lumo.proc import glob PID_ROOT = os.path.join(progressroot(), 'pid') HB_ROOT = os.path.join(progressroot(), 'hb') @@ -74,6 +58,53 @@ def not_in_(ser, value): class Condition: + """ + Represents a condition to filter data based on a certain criteria. + + row filter: + ``` + from lumo import C + import pandas as pd + + # create a sample DataFrame + data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'], + 'age': [25, 30, 35, 40, 45], + 'city': [{'index':0}, {'index':1}, {'index':2},{'index':3},{'index':4}]} + df = pd.DataFrame(data) + + # create and apply the condition to filter the DataFrame + filtered_df = (C['age'] >= 35).apply(df) + + # print the filtered DataFrame + print(filtered_df) + ``` + + column edit: + ``` + (C+{'city.index':'cindex'}).apply(df).columns + # Index(['name', 'age', 'city', 'cindex'], dtype='object') + + (-C['city']).apply(df).columns + # Index(['name', 'age'], dtype='object') + + (C-['city']).apply(df).columns + (C-'city').apply(df).columns + # Index(['name', 'age'], dtype='object') + + (C-['city','name']).apply(df).columns + # Index(['age'], dtype='object') + ``` + + pipeline: + ``` + C.pipe(df,[ + (C['age']>35), + C+{'city.index':'cindex'}, + C-['city','name'] + ]) + ``` + """ + def __init__(self, name: str = None, value=None, op=None): self.name = name self.value = value @@ -85,9 +116,35 @@ def __getattr__(self, item): def __getitem__(self, item): return Condition(item) + def __add__(self, other): + c = Condition() + c.op = 'add_column' + c.value = {} + if isinstance(other, str): + c.value[other] = other + elif isinstance(other, dict): + c.value.update(other) + else: + raise NotImplementedError() + return c + + def __sub__(self, other): + c = Condition() + c.name = None + c.op = 'drop_column' + c.value = {} + if isinstance(other, str): + c.value.update({other: None}) + elif isinstance(other, (list, set, dict)): + c.value.update({k: None for k in other}) + else: + raise NotImplementedError() + return c + def __neg__(self): - self.drop = True - return self + c = Condition() + c.op = 'drop_column' + return c def __ge__(self, other): if other is None: @@ -127,47 +184,115 @@ def __lt__(self, other): return self def __repr__(self): - return f'C({self.name} {self.op} {self.value})' + return f'C({self.name}, {self.value}, {self.op})' def in_(self, lis): - """condition of `in` operation""" + """ + Sets the condition to evaluate if the value is in a given list. + + Args: + lis (list): the list of values to compare against. + + Returns: + The current instance of the Condition class with the comparison operator and value set. + """ self.op = 'in' self.value = set(lis) return self def not_in_(self, lis): - """condition of `.duplicated(value) == False` operation""" + """ + Sets the condition to evaluate if the value is not in a given list. + + Args: + lis (list): the list of values to compare against. + + Returns: + The current instance of the Condition class with the comparison operator and value set. + """ self.op = 'notin' self.value = set(lis) return self def mask(self, df): + """ + Returns a boolean mask of the given DataFrame based on the condition. + + Args: + df (pd.DataFrame): the DataFrame to evaluate. + + Returns: + A boolean mask of the given DataFrame based on the condition. + """ + import pandas as pd names = self.name.split('.') value = df for i in names: if isinstance(value, pd.DataFrame): value = value[i] else: - value = df.apply(lambda x: x[i]) + value = value.apply(lambda x: x[i]) return mapping[self.op](value, self.value) + def capply(self, df): + """apply operations in column axis""" + import pandas as pd + df = df.reset_index(drop=True) + if self.op == 'drop_column': + if isinstance(self.name, str): + var = [self.name] + elif isinstance(self.value, str): + var = [self.value] + else: + var = list(self.value) + print(var) + print(df.columns) + df = df.drop(var, axis=1) + else: + assert isinstance(self.value, dict) + for name, aim in self.value.items(): + names = name.split('.') + value = df + for i in names: + if isinstance(value, pd.DataFrame): + value = value[i] + else: + value = value.apply(lambda x: x[i]) + df[aim] = value + return df + def apply(self, df): - return df[self.mask(df)] + """Returns a new DataFrame with only the rows that meet the condition.""" + if not self.op.endswith('column'): + return df[self.mask(df)].reset_index(drop=True) + else: + return self.capply(df) + + def pipe(self, df, conditions: List['Condition']): + """Applies a list of conditions to a DataFrame using the pipe method.""" + filtered_df = df + for condition in conditions: + filtered_df = condition.apply(filtered_df) # .reset_index(drop=True) + return filtered_df C = Condition() class Watcher: - """List and watch experiments with time order + """ + A class for listing and watching experiments with time order and caching test information in 'metrics/.sqlite'. - Cache test_information in - metrics/.sqlite + Attributes: + exp_root (str): The root directory to search for experiments. + hb_root (str): The root directory to search for heartbeat files. + pid_root (str): The root directory to search for PID files. + db_root (str): The root directory to store the experiment databases. """ def __init__(self, exp_root=None, hb_root=None, pid_root=None, db_root=None): if exp_root is None: - exp_root = os.path.join(exproot(), 'hb') + exp_root = exproot() if hb_root is None: hb_root = os.path.join(cache_dir(), 'heartbeat') @@ -182,11 +307,19 @@ def __init__(self, exp_root=None, hb_root=None, pid_root=None, db_root=None): self.hb_root = hb_root self.pid_root = pid_root - def load(self): - res = {} + def retrieve(self, test_name=None) -> Experiment: + """retrieve the Experiment of given test name""" + df = self.load() + res = df[df['test_name'] == test_name] + if len(res) > 0: + return Experiment.from_cache(res.iloc[0].to_dict()) + return None + + def update(self): + """Diff & Update""" updates = {} if not os.path.exists(self.hb_root): - return pd.DataFrame() + return {} for root, dirs, fs in os.walk(self.hb_root): if root == self.hb_root: continue @@ -199,338 +332,169 @@ def load(self): updates.setdefault(exp.exp_name, []).append(exp.cache()) except KeyboardInterrupt as e: raise e - except: + except Exception as e: + print(e) continue + finally: + os.remove(hb_file) for exp_name, tests in updates.items(): dic = PDict(os.path.join(self.db_root, f'{exp_name}.sqlite')) - for test_name, test_prop in dic.items(): - res[test_name] = test_prop for test in tests: dic[test['test_name']] = test - res[test['test_name']] = test dic.flush() + return updates - df = pd.DataFrame(res.values()) - df = df.sort_values(['exp_name', 'test_name']) - return df.reset_index(drop=True) + def fullupdate(self): + """re-update all experiment from original experiment directories""" + updates = {} + for root, dirs, fs in os.walk(self.exp_root): + for f in dirs: + if is_test_name(f): + info_dir = os.path.join(root, f) + try: + exp = Experiment.from_disk(info_dir) + updates.setdefault(exp.exp_name, []).append(exp.cache()) + except KeyboardInterrupt as e: + raise e + except Exception as e: + print(e) + continue - def progress(self, is_alive=True): - """return the alive process""" + for exp_name, tests in updates.items(): + dic = PDict(os.path.join(self.db_root, f'{exp_name}.sqlite')) + dic.clear() + + for test in tests: + dic[test['test_name']] = test + dic.flush() + + return updates + + def load(self, with_pandas=True): + """ + Loads the experiment information from heartbeat files and the experiment databases. + + Args: + with_pandas (bool, optional): whether to return the experiment information as a pandas DataFrame. + Defaults to True. + + Returns: + If with_pandas is True, returns a pandas DataFrame containing the experiment information + sorted by experiment name and test name. Otherwise, returns a dictionary containing the + experiment information. + """ + res = {} + updates = self.update() + + def valid_row(dic): + """is a valid row?""" + return isinstance(dic, dict) and 'test_name' in dic + + for dic_fn in os.listdir(self.db_root): + if not dic_fn.endswith('sqlite'): + continue + dic = PDict(os.path.join(self.db_root, dic_fn)) + exp_name = os.path.splitext(dic_fn)[0] + + for test_name, test_prop in dic.items(): + if valid_row(test_prop): + res[test_name] = test_prop + + if exp_name in updates: + for test in updates[exp_name]: + if valid_row(test): + res[test['test_name']] = test + dic.flush() + + if with_pandas: + try: + import pandas as pd + except ImportError as e: + print( + 'with_padnas=True requires pandas to be installed, use pip install pandas or call `.load(with_padnas=False)`') + + df = pd.DataFrame(res.values()) + df = df.sort_values(['exp_name', 'test_name']) + return df.reset_index(drop=True) + else: + return res + + def progress(self, with_pandas=True): + """ + Returns a DataFrame of alive experiments. + + Returns: + A pandas DataFrame containing the experiment information of alive experiments. + """ res = [] for root, dirs, fs in os.walk(self.pid_root): for f in fs: if not f.endswith('.pid'): continue try: - test_root = IO.load_text(os.path.join(root, f)) + pid_f = os.path.join(root, f) + test_root = IO.load_text(pid_f) exp = Experiment.from_disk(test_root) - if exp.is_alive == is_alive: + + if exp.is_alive: res.append(exp.dict()) + elif exp.properties['progress'].get('end_code', None) is None: + if (datetime.timestamp(datetime.now()) - os.stat(pid_f).st_mtime) < glob.get( + 'ALIVE_SECONDS', 60 * 5): # powered by exp/agent.py + res.append(exp.dict()) + else: + exp.dump_info('progress', + { + 'end': strftime(), 'end_code': -10, + 'msg': 'ended by watcher'} + , append=True + ) + os.remove(pid_f) except: continue - return pd.DataFrame(res) - - def interactive(self): - """interactive, mark, label, note in ipython environment.""" - pass + if with_pandas: + import pandas as pd + return pd.DataFrame(res) + else: + return res def server(self): """simple server which make you note your experiments""" pass - def list_all(self, exp_root=None, limit=100) -> Dict[str, List[Experiment]]: - """ - Returns a dictionary of all experiments under exp_root directory. - - Args: - exp_root: The root directory to search for experiments. Default is None, which uses the default experiment root directory. - - Returns: - A dictionary of all experiments, where the keys are the names of the experiments and the values are lists of corresponding Experiment objects. - """ + def panel(self, df=None): + """create a dashboard powered by Panel, need to install panel library first.""" + from .lazy_panel import make_experiment_tabular + if df is None: + df = self.load() + widget = make_experiment_tabular(df) + return widget - def widget(self, - is_finished: bool = None, - is_alive: bool = None, - time_filter: list = None, - params_filter: list = None, - metric_filter: list = None - ): - assert params_filter is None or isinstance(params_filter, list) - assert metric_filter is None or isinstance(metric_filter, list) - from ipywidgets import widgets, interact, Label - from IPython.display import display - - def make_row(dic: dict): - exp = Experiment.from_cache(dic.copy()) +def is_test_root(path: str) -> bool: + """ + Determines if the specified path is a valid test root. - def on_note_update(sender): - exp.dump_note(sender['new']) + Args: + path: The path to check. - def on_tag_update(sender): - exp.dump_tags(*sender['new']) + Returns: + True if the path is a valid test root, False otherwise. + """ + test_name = os.path.basename(path.rstrip('/')) + return is_test_name(test_name) - note_ui = widgets.Textarea(dic['note']) - note_ui.continuous_update = False - note_ui.observe(on_note_update, names='value', type='change') +def is_test_name(test_name: str) -> bool: + """ + Determines if the specified string is a valid test name. - tags = dic.get('tags', []) - try: - tags = list(tags) - except: - tags = [] - tag_ui = widgets.TagsInput(value=tags) - tag_ui.observe(on_tag_update, names='value', type='change') - - now = datetime.now() - start = strptime(datestr=dic['progress']['start']) - end = strptime(datestr=dic['progress']['last_edit_time']) - - human = widgets.VBox([ - - ]) - return [ - widgets.Label(dic['exp_name']), - widgets.Label(dic['test_name']), - widgets.Label(f"""{strftime('%y-%m-%d %H:%M:%S', dateobj=start)}"""), - widgets.Label(f"""{strftime('%y-%m-%d %H:%M:%S', dateobj=end)}"""), - widgets.HTML('\n'.join([ - f'{k}: {v}' - for k, v in dic['metrics'].items() - if isinstance(v, numbers.Number) - ])), - widgets.HBox([note_ui, - tag_ui, ]) - - ] - - test_status = widgets.RadioButtons(options=['full', 'running', 'failed', 'succeed', 'finished']) - start_filter = widgets.DatetimePicker() - end_filter = widgets.DatetimePicker() - - def status_filter(sender): - print(sender) - make() - - test_status.observe(status_filter, names='value', type='change') - - # display() - - @interact - def make( - status=widgets.RadioButtons(options=['full', 'running', 'failed', 'succeed', 'finished']), - start=widgets.DatetimePicker(), - end=widgets.DatetimePicker(), - ): - if status == 'running': - df = self.progress() - elif status == 'finished': - df = self.progress(is_alive=False) - else: - df = self.load() - if status == 'succeed': - df = df[df['progress'].apply(lambda x: x['finished'])] - elif status == 'failed': - df = df[df['exception'].isna() == False] - - if start: - df = df.pipe( - lambda x: x[x['progress'].apply(lambda y: strptime(datestr=y['start'])) > start] - ) - if end: - df = df.pipe( - lambda x: x[x['progress'].apply(lambda y: strptime(datestr=y['end'])) < end] - ) - - if params_filter is not None: - df_params = df['params'] - masks = None - for condition in params_filter: - mask = condition.mask(df_params) - if masks is None: - masks = mask - else: - masks *= mask - df = df[masks] - - if metric_filter is not None: - df_params = df['metrics'] - masks = None - for condition in metric_filter: - mask = condition.mask(df_params) - if masks is None: - masks = mask - else: - masks *= mask - df = df[masks] - - exps = df.to_dict(orient='records') - # grid = widgets.GridspecLayout(len(exps) + 1, 7) - - children = [ - widgets.Label('exp_name'), - widgets.Label('test_name'), - widgets.Label('start'), - widgets.Label('end'), - widgets.Label('metrics'), - widgets.Label('note & tags'), - ] - # grid[0, 0] = widgets.Label('Meta') - # grid[0, 1] = widgets.Label('Metrics') - # grid[0, 2] = widgets.Label('Notes') - for i, exp in enumerate(exps, start=1): - row = make_row(exp) - children.extend(row) - # display(widgets.HBox(row)) - # for j, item in enumerate(row): - # grid[i, j] = item - - grid = widgets.GridBox(children=children, - - layout=widgets.Layout( - width='100%', - grid_template_columns=' '.join(['auto'] * 5) + ' auto', - # grid_template_rows='80px auto 80px', - grid_gap='5px 10px') - ) - display( - widgets.HTML(""" - - """), - grid, - - ) - - # return display( - # widgets.HTML(styles['row-radio']), - # widgets.HTML(""" - # - # """), - # grid, clear=True) - - -class ExperimentWidget: - @overload - def __init__(self, exp_name, test_name, - progress: dict, - params: dict, metrics: dict, note: str, tags: set, exp: Experiment): - pass + Args: + test_name: The string to check. - def __init__(self, **kwargs): - from ipywidgets import widgets - self.wid = widgets - self.exp = kwargs.pop('exp') # type: Experiment - self._prop = kwargs - - self._widgets = { - 'exp_name': widgets.HTML(self._prop['exp_name']), - 'test_name': widgets.HTML(self._prop['test_name']), - 'metrics': widgets.VBox( - [widgets.HTML(f'{k}: {v}') for k, v in self._prop['metrics'].items() if - isinstance(v, numbers.Number)]), - } - - self._params_widgets = {} - - note_ui = widgets.Textarea(self._prop['note']) - - note_ui.continuous_update = False - note_ui.observe(self.on_note_update, names='value', type='change') - self._widgets['note'] = note_ui - - tag_ui = widgets.TagsInput(value=list(self._prop['tags'])) - self._widgets['tags'] = tag_ui - tag_ui.observe(self.on_tag_update, names='value', type='change') - - def on_note_update(self, sender): - self.exp.dump_note(sender['new']) - - def on_tag_update(self, sender): - self.exp.dump_tags(*sender['new']) - - def set_key_params(self, keys: list): - self._params_widgets.clear() - for key in keys: - self._params_widgets[key] = self.wid.HTML( - f"""{key}: {pformat(self._prop['params'][key], width=10, indent=2, compact=True)}""") - - def sep(self): - return self.wid.Output(layout={'border': '1px solid black'}) - - def id_flag(self): - return self.wid.VBox([ - self._widgets['exp_name'], - self._widgets['test_name'], - ]) - - def key_params(self): - return self.wid.VBox([ - *self._params_widgets.values() - ]) - - def editable(self): - return self.wid.VBox([ - self._widgets['note'], - self.sep(), - self._widgets['tags'], - ]) - - def time(self): - now = datetime.now() - start = strptime(datestr=self._prop['progress']['start']) - end = strptime(datestr=self._prop['progress']['start']) - return self.wid.VBox([ - self.wid.HTML(f"""Start at: {format_timedelta(now - start)}"""), - self.wid.HTML(f"""End at: {format_timedelta(now - end)}"""), - ]) - - def widget_dict(self): - return { - 'id_flag': self.id_flag(), - 'time': self.time(), - 'editable': self.editable(), - 'params': self.key_params(), - } - - def widget(self): - params = self.key_params() - params = [ - self.sep(), - params, - ] - - hbox = self.wid.HBox([ - self.id_flag(), - self.time(), - self._widgets['metrics'], - self.editable(), - self.key_params(), - ]) - - return hbox - - @classmethod - def from_experiment(cls, exp: Experiment): - tags = exp.properties.get('tags', []) - try: - tags = set(tags) - except: - tags = set() - return cls( - exp_name=exp.exp_name, - test_name=exp.test_name, - progress=exp.properties.get('progress', {}), - params=exp['params'], - metrics=exp.metric.value, - note=exp.properties.get('note', ''), - tags=tags, - exp=exp, - ) + Returns: + True if the string is a valid test name, False otherwise. + """ + return re.search(r'^\d{6}\.\d{3}\.[a-z\d]{2}t$', test_name) is not None diff --git a/src/lumo/proc/dist.py b/src/lumo/proc/dist.py index f960bffc..30030190 100644 --- a/src/lumo/proc/dist.py +++ b/src/lumo/proc/dist.py @@ -1,4 +1,5 @@ import os +import torch from torch import distributed as dist @@ -75,3 +76,22 @@ def is_main(): """ return local_rank() <= 0 + + +def gather(tensor): + """ + Gather the tensor data across all processes in the distributed setting. + + Args: + tensor (torch.Tensor): The tensor to be gathered. + + Returns: + The gathered tensor data. + """ + if dist.is_initialized(): + if tensor.ndim == 0: + tensor = tensor.clone()[None] + output_tensors = [tensor.clone() for _ in range(torch.distributed.get_world_size())] + torch.distributed.all_gather(output_tensors, tensor) + return torch.cat(output_tensors, dim=0) + return tensor diff --git a/src/lumo/proc/path.py b/src/lumo/proc/path.py index 8cff01b1..55547bac 100644 --- a/src/lumo/proc/path.py +++ b/src/lumo/proc/path.py @@ -112,6 +112,7 @@ def metricroot(): def dbroot(): + """Root path to store experiment information. Default is `~/.lumo/database`""" DB_ROOT = glob.get('db_root', None) if DB_ROOT: res = DB_ROOT diff --git a/src/lumo/proc/tz.py b/src/lumo/proc/tz.py new file mode 100644 index 00000000..d3eb7d7b --- /dev/null +++ b/src/lumo/proc/tz.py @@ -0,0 +1,12 @@ +from .config import glob +import pytz + + +def timezone(): + """ + Get the timezone from the global configuration, or default to 'Asia/Shanghai'. + + Returns: + The timezone. + """ + return pytz.timezone(glob.get('timezone', 'Asia/Shanghai')) diff --git a/src/lumo/sketch/vis/parser.py b/src/lumo/sketch/vis/parser.py index b1f94068..7d47aea8 100644 --- a/src/lumo/sketch/vis/parser.py +++ b/src/lumo/sketch/vis/parser.py @@ -23,25 +23,25 @@ class Step: step: int -def find_metric_fron_test_root(test_root): - test_root = finder.retrieval_test_root(test_root) - if test_root is None: - return False, {} - - exp = Experiment.from_disk(test_root) - if exp.has_prop('tensorboard_args'): - tb = exp.properties.get('tensorboard_args') - metrics = parse_fron_tensorboard(tb['log_dir']) - elif exp.has_prop('logger_args'): - tb = exp.properties.get('logger_args') - metrics = parse_from_log(tb['log_dir']) - else: - fs = [i for i in os.listdir(exp.test_root)] - if len([f for f in fs if f.endswith('.log')]) > 0: - metrics = parse_from_log(os.path.join(exp.test_root, fs[0])) - else: - metrics = {} - return True, metrics +# def find_metric_fron_test_root(test_root): +# test_root = finder.retrieval_test_root(test_root) +# if test_root is None: +# return False, {} +# +# exp = Experiment.from_disk(test_root) +# if exp.has_prop('tensorboard_args'): +# tb = exp.properties.get('tensorboard_args') +# metrics = parse_fron_tensorboard(tb['log_dir']) +# elif exp.has_prop('logger_args'): +# tb = exp.properties.get('logger_args') +# metrics = parse_from_log(tb['log_dir']) +# else: +# fs = [i for i in os.listdir(exp.test_root)] +# if len([f for f in fs if f.endswith('.log')]) > 0: +# metrics = parse_from_log(os.path.join(exp.test_root, fs[0])) +# else: +# metrics = {} +# return True, metrics def parse_from_log(log) -> Dict[str, List[Step]]: diff --git a/src/lumo/trainer/accelerator.py b/src/lumo/trainer/accelerator.py new file mode 100644 index 00000000..72de3996 --- /dev/null +++ b/src/lumo/trainer/accelerator.py @@ -0,0 +1,297 @@ +import warnings +from torch import nn +import torch +from torch import distributed +from torch.utils.data import DataLoader +from lumo.data.loader import DataLoaderSide +from lumo.proc.dist import gather + + +class Accelerator: + """ + A class to define the interface for various types of accelerator. + + Attributes: + _prop (dict): A dictionary of keyword arguments. + + Methods: + device: A property method to get the device. + set_device: A method to set the device. + prepare_data_loader: A method to prepare the data loader. + prepare_model: A method to prepare the model. + prepare_optimizer: A method to prepare the optimizer. + unwrap_model: A method to unwrap the model. + prepare: A method to prepare the inputs for training. + wait_for_everyone: A method to wait for all processes to synchronize. + gather: A method to gather the tensor data. + backward: A method to compute the gradients using backpropagation. + """ + + def __init__(self, **kwargs): + """ + Initialize the class with a dictionary of keyword arguments. + """ + self._prop = kwargs + + @property + def device(self) -> torch.device: + """ + Get the device. + """ + return self._prop.get('device', None) + + def set_device(self, device: torch.device): + """ + Set the device. + + Args: + device (torch.device): The device to be set. + """ + assert isinstance(device, torch.device) + self._prop['device'] = device + + def prepare_data_loader(self, dataloader): + """ + Prepare the data loader. + + Args: + dataloader: The data loader. + + Returns: + The prepared data loader. + """ + return dataloader + + def prepare_model(self, model: torch.nn.Module): + """ + Prepare the model. + + Args: + model (torch.nn.Module): The model. + + Returns: + The prepared model. + """ + return model.to(self.device) + + def prepare_optimizer(self, optimizer: torch.optim.Optimizer): + """ + Prepare the optimizer. + + Args: + optimizer (torch.optim.Optimizer): The optimizer. + + Returns: + The prepared optimizer. + """ + return optimizer + + def unwrap_model(self, model): + """ + Unwrap the model. + + Args: + model: The model. + + Returns: + The unwrapped model. + """ + return model + + def prepare(self, *args): + """ + Prepare the inputs for training. + + Args: + *args: The inputs. + + Returns: + The prepared inputs. + """ + res = [] + for item in args: + if isinstance(item, nn.Module): + res.append(self.prepare_model(item)) + elif isinstance(item, (DataLoader, DataLoaderSide)): + res.append(self.prepare_data_loader(item)) + elif isinstance(item, torch.optim.Optimizer): + res.append(self.prepare_optimizer(item)) + else: + raise NotImplementedError() + return res + + def wait_for_everyone(self): + """ + Wait for all processes to synchronize. + """ + torch.distributed.barrier() + + def gather(self, tensor: torch.Tensor): + """ + Gather the tensor data. + + Args: + tensor (torch.Tensor): The tensor to be gathered. + + Returns: + The gathered tensor data. + """ + return gather(tensor) + + def backward(self, loss: torch.Tensor, **kwargs): + """ + Compute the gradients using backpropagation. + + Args: + loss (torch.Tensor): The loss tensor. + **kwargs: The additional keyword arguments. + """ + loss.backward(**kwargs) + +class HugAccelerator(Accelerator): + """ + A class to define the interface for Hugging Face accelerator. + + Methods: + set_device: A method to set the device. + prepare_data_loader: A method to prepare the data loader. + prepare_model: A method to prepare the model. + prepare_optimizer: A method to prepare the optimizer. + unwrap_model: A method to unwrap the model. + prepare: A method to prepare the inputs for training. + wait_for_everyone: A method to wait for all processes to synchronize. + gather: A method to gather the tensor data. + backward: A method to compute the gradients using backpropagation. + """ + + def __init__(self, **kwargs): + """ + Initialize the class with a dictionary of keyword arguments. + """ + super().__init__(**kwargs) + from .backend.accelerator import Accelerator + self._backbone = Accelerator() + + @property + def device(self): + """ + Get the device. + """ + return self._backbone.device + + def set_device(self, device: torch.device): + """ + Set the device. + + Args: + device (torch.device): The device to be set. + """ + assert isinstance(device, torch.device) + self._backbone.state.device = device + + def prepare_data_loader(self, loader): + """ + Prepare the data loader. + + Args: + loader: The data loader. + + Returns: + The prepared data loader. + """ + from accelerate.data_loader import DataLoaderShard, DataLoaderDispatcher + if isinstance(loader, (DataLoaderShard, DataLoaderDispatcher)): + warnings.warn('Duplicated prepare a same DataLoader twice, check your code.') + return loader + return self._backbone.prepare_data_loader(loader) + + def prepare_model(self, model): + """ + Prepare the model. + + Args: + model: The model. + + Returns: + The prepared model. + """ + return self._backbone.prepare_model(model) + + def prepare_optimizer(self, optimizer): + """ + Prepare the optimizer. + + Args: + optimizer: The optimizer. + + Returns: + The prepared optimizer. + """ + return self._backbone.prepare_optimizer(optimizer) + + def unwrap_model(self, model): + """ + Unwrap the model. + + Args: + model: The model. + + Returns: + The unwrapped model. + """ + return self._backbone.unwrap_model(model) + + def prepare(self, *args): + """ + Prepare the inputs for training. + + Args: + *args: The inputs. + + Returns: + The prepared inputs. + """ + return self._backbone.prepare(*args) + + def wait_for_everyone(self): + """ + Wait for all processes to synchronize. + """ + self._backbone.wait_for_everyone() + + def gather(self, tensor): + """ + Gather the tensor data. + + Args: + tensor: The tensor to be gathered. + + Returns: + The gathered tensor data. + """ + return self._backbone.gather(tensor) + + def backward(self, loss: torch.Tensor, **kwargs): + """ + Compute the gradients using backpropagation. + + Args: + loss (torch.Tensor): The loss tensor. + **kwargs: The additional keyword arguments. + """ + self._backbone.backward(loss, **kwargs) + + +register = { + + 'none': Accelerator, + 'accelerator': HugAccelerator, + 'deepspeed': None, + 'horovod': None, +} + + +def get_accelerator(name: str, **kwargs) -> Accelerator: + """Get the accelerator for the specified name.""" + assert name in register, ', '.join(register.keys()) + return register[name](**kwargs) diff --git a/src/lumo/trainer/backend/__init__.py b/src/lumo/trainer/backend/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/src/lumo/contrib/accelerate/accelerator.py b/src/lumo/trainer/backend/accelerator.py similarity index 50% rename from src/lumo/contrib/accelerate/accelerator.py rename to src/lumo/trainer/backend/accelerator.py index 79d4d3b5..5c64a757 100644 --- a/src/lumo/contrib/accelerate/accelerator.py +++ b/src/lumo/trainer/backend/accelerator.py @@ -4,13 +4,23 @@ class Accelerator(_Accelerator): """ - Accelerator instance for distributed training (on multi-GPU, TPU) or mixed precision training. + Accelerator instance for distributed training (on multi-GPU, TPU) or mixed precision training. - This Accelerator subclass is use for `Trainer`, the only difference is that - the device of data will be controlled by Trainer rather than Accelerator. - """ + This Accelerator subclass is used for `Trainer`. The only difference is that + the device of data will be controlled by `Trainer` rather than `Accelerator`. + """ + + def prepare_data_loader(self, data_loader, **kwargs): + """ + Prepare the data loader. + + Args: + data_loader: The data loader. + **kwargs: Additional keyword arguments. - def prepare_data_loader(self, data_loader): + Returns: + The prepared data loader. + """ return prepare_data_loader( data_loader, None, # None instead of self.device, diff --git a/src/lumo/trainer/base.py b/src/lumo/trainer/base.py index bab315fe..26a5c068 100644 --- a/src/lumo/trainer/base.py +++ b/src/lumo/trainer/base.py @@ -134,7 +134,7 @@ def inner(dm=None, params=None, *args, **kwargs): process_loader = getattr(self, 'process_loader', None) if process_loader is not None: process_loader(dm, TrainStage.create_from_str(func.__name__)) - func(*args, **kwargs) + func(dm, params, *args, **kwargs) return inner diff --git a/src/lumo/trainer/callbacks.py b/src/lumo/trainer/callbacks.py index 2f71b228..dad473ef 100644 --- a/src/lumo/trainer/callbacks.py +++ b/src/lumo/trainer/callbacks.py @@ -9,7 +9,7 @@ from datetime import datetime from functools import wraps from typing import NewType, Any, Optional, Dict, Union - +from lumo.proc.tz import timezone import psutil from torch.utils.data import DataLoader @@ -353,6 +353,7 @@ def __init__(self, step_frequence=3, break_in=1000): self.step = step_frequence file = tempfile.TemporaryFile('w') self.temp = file + self.cur_tqdm = None def on_imodels_end(self, trainer: Trainer, func, params: ParamsType, result: Any, *args, **kwargs): super().on_imodels_end(trainer, func, params, result, *args, **kwargs) @@ -404,12 +405,14 @@ def update(self, trainer: Trainer): TrainStage.train in self.stage and ((trainer.idx + 1) == self.stage[TrainStage.train])): trainer.logger.inline(self.cur_tqdm.full_str()) trainer.logger.newline() + trainer.exp.trigger() def flush(self, trainer: Trainer): """Flush""" self.c = 0 - trainer.logger.inline(self.cur_tqdm) - trainer.logger.newline() + if self.cur_tqdm is not None: + trainer.logger.inline(self.cur_tqdm) + trainer.logger.newline() def on_train_epoch_begin(self, trainer: Trainer, func, params: ParamsType, *args, **kwargs): super().on_train_epoch_begin(trainer, func, params, *args, **kwargs) @@ -707,7 +710,7 @@ def request(self, data: dict): task = self.executor.submit(self.req.post, self.url, json={'data': data, 'type': 'timeevent', 'from': 'lumo.RemoteCallback', - 'datetime': datetime.now().isoformat()}) + 'datetime': datetime.now(timezone()).isoformat()}) self.submits.append(task) def on_hooked(self, source: Trainer, params: ParamsType): @@ -805,7 +808,7 @@ class SkipWhenParamsEq(TrainCallback, InitialCallback): def on_hooked(self, source: Trainer, params: ParamsType): super().on_hooked(source, params) from dbrecord import PDict - from lumo.exp.finder import is_test_root + from lumo.exp.watch import is_test_root self.fn = source.exp.mk_rpath('contrib', 'params_key.sqlite') olds = PDict(self.fn) diff --git a/src/lumo/trainer/components.py b/src/lumo/trainer/components.py index 37d5bf7e..30dc078c 100644 --- a/src/lumo/trainer/components.py +++ b/src/lumo/trainer/components.py @@ -15,7 +15,6 @@ def log_dir(self): @property def params_fn(self): res = self.mk_ipath('params.yaml') - self.dump_string('params.yaml', res) return res @property diff --git a/src/lumo/trainer/trainer.py b/src/lumo/trainer/trainer.py index 27338f92..c8307ff1 100644 --- a/src/lumo/trainer/trainer.py +++ b/src/lumo/trainer/trainer.py @@ -1,38 +1,29 @@ import bisect import os -import sys import warnings -from datetime import datetime from functools import lru_cache from typing import Union, Dict, Any, Optional, Sequence, Mapping, Callable import numpy as np import torch -from accelerate import DistributedDataParallelKwargs -from accelerate.data_loader import DataLoaderDispatcher, DataLoaderShard +from .accelerator import get_accelerator from torch import nn from torch.optim import Optimizer from torch.utils.data import DataLoader -import json -from lumo.contrib.accelerate import Accelerator -from lumo.contrib.accelerate.utils import send_to_device +from lumo.utils.device import send_to_device from lumo.core import TrainStage, Record, MetricType, Meter from lumo.core.disk import TableRow, Metrics from lumo.data import DataModule from lumo.data.loader import DataLoaderType, DataLoaderSide from lumo.proc import dist from lumo.proc import glob +from lumo.utils import safe_io as IO from lumo.trainer.rnd import RndManager from lumo.utils.logger import Logger from .base import _BaseTrainer from .components import TrainerExperiment, TrainerParams from .saver import Saver -# overwrite send_to_device to resolve https://github.com/pytorch/pytorch/issues/83015 -# from accelerate import Accelerator -# from accelerate.utils import send_to_device -from ..utils.fmt import strftime - ParamsType = TrainerParams @@ -72,7 +63,7 @@ def __init_subclass__(cls, **kwargs): raise TypeError( f"Can't instantiate abstract class {cls.__name__} directly, please create a subclass of it.") - def __init__(self, params: ParamsType, dm: DataModule = None): + def __init__(self, params: ParamsType, dm: DataModule = None, accelerator=None): if dm is None: dm = DataModule(params) else: @@ -85,30 +76,37 @@ def __init__(self, params: ParamsType, dm: DataModule = None): self._saver = None self.params.iparams() - self.exp = TrainerExperiment(self.generate_exp_name()) + self.exp = TrainerExperiment(exp_name=self.generate_exp_name()) self._database = TableRow(self.exp.mk_ipath('metric.pkl'), persistent=self.is_main) self.metric_board = Metrics(self.exp.mk_bpath('board.sqlite'), persistent=self.is_main) self.metric = self.exp.metric - self.exp.dump_info('metric_board', self.metric_board.fpath) - self.exp.dump_info('table_row', self._database.fpath) + # self.exp.dump_info('table_row', self._database.fpath) self.rnd = RndManager() self.train_epoch_toggle = False self.train_toggle = False device = params.get('device', None) if not self.is_dist else None + # self.accelerate = Accelerator(kwargs_handlers=[ + # DistributedDataParallelKwargs(find_unused_parameters=params.get('find_unused_parameters', False)) + # ]) + accelerator = glob.get('accelerator', 'accelerator') + self.accelerate = get_accelerator(accelerator) - self.accelerate = Accelerator(kwargs_handlers=[ - DistributedDataParallelKwargs(find_unused_parameters=params.get('find_unused_parameters', False)) - ]) - - if self.accelerate.state.distributed_type == self.accelerate.state.distributed_type.NO: - self.accelerate.state.device = torch.device(device) + self.accelerate.set_device(torch.device(device)) if dist.is_main(): self.params.to_yaml(self.exp.params_fn) + params_hash = self.params.hash() + self.exp.dump_info('trainer', { + 'params_meta': { + 'fn': self.exp.params_fn, + 'hash': params_hash + }, + 'board_fn': self.metric_board.fpath + }, append=True) self.exp.dump_info('params', self.params.to_dict()) self.set_global_steps(0) @@ -169,7 +167,9 @@ def logger(self): self._logger.debug('Enable debug log.') if self.is_main: fn = self._logger.add_log_dir(self.exp.log_dir) - self.exp.dump_info('logger_args', {'log_dir': fn}) + self.exp.dump_info('trainer', { + 'logger_fn': fn + }, append=True) return self._logger @@ -516,10 +516,10 @@ def to_device(self, item: Optional[Union[nn.Module, torch.Tensor, Sequence, Mapp def on_trainer_exception(self, func: Callable, exception: BaseException): """Updates database with error information when an exception occurs during training.""" - self.exp.dump_info('exception', dict(end=strftime(), - finished=False, - error=str(exception), - trainer_frame=str(func))) + # self.exp.dump_info('exception', dict(end=strftime(), + # finished=False, + # error=str(exception), + # trainer_frame=str(func)), append=True) @property def is_initialized(self): @@ -541,15 +541,15 @@ def initialize(self): return self.exp.start() - params_hash = self.params.hash() - self.exp.dump_string('params_hash', params_hash) - self.icallbacks(self.params) self.set_property('initial.callbacks', True) self.imodels(self.params) self.set_property('initial.model', True) self.set_property('initial', True) + self.logger.info('Use Experiment') + self.logger.info(self.exp) + def stop_train(self): """Toggle to stop train.""" self.train_toggle = True @@ -566,17 +566,7 @@ def prepare_dataloader(self, loader: DataLoaderType, stage: TrainStage = None): :param stage: :return: """ - if isinstance(loader, (DataLoaderShard, DataLoaderDispatcher)): - warnings.warn('Duplicated prepare a same DataLoader twice, check your code.') - return loader - - split_batches = self.params.get('split_batches', None) - if stage is not None and not stage.is_train(): - split_batches = True - - """do not change original loader stage""" if isinstance(loader, DataLoader): - self.accelerate.split_batches = split_batches loader = self.accelerate.prepare_data_loader(loader) elif isinstance(loader, DataLoaderSide): loader = loader.copy() @@ -599,6 +589,8 @@ def train(self, dm: Union[DataModule, DataLoaderType] = None, params: ParamsType ValueError: If no data loader is available for training. """ + self.change_stage(TrainStage.train) + loader = self.select_loader(dm) if not loader: loader = self.train_dataloader @@ -612,7 +604,6 @@ def train(self, dm: Union[DataModule, DataLoaderType] = None, params: ParamsType for eidx in range(params.epoch): # update training progress - self.exp.dump_train_eidx(eidx, params.epoch) self.set_epoch_idx(eidx) # train loop @@ -631,7 +622,8 @@ def train(self, dm: Union[DataModule, DataLoaderType] = None, params: ParamsType self.set_property('early_stop', f'meet limit_global_steps {limit_global_steps}') break - # update when train finished + self.exp.dump_train_eidx(eidx, params.epoch) + self.exp.end() return self._prop @@ -805,8 +797,7 @@ def change_stage(self, stage: TrainStage): else: v.eval() - @classmethod - def select_loader(cls, dm=None): + def select_loader(self, dm=None, stage=None): """ Selects the appropriate loader based on the given data module. @@ -819,7 +810,8 @@ def select_loader(cls, dm=None): loader = None if dm: if isinstance(dm, DataModule): - loader = dm.train_dataloader + loader = dm.get_loader_with_stage(stage=self.trainstage) + # loader = dm.train_dataloader elif isinstance(dm, DataLoader) or isinstance(dm, DataLoaderSide): loader = dm else: @@ -1012,28 +1004,63 @@ def wait_for_everyone(self): self.accelerate.wait_for_everyone() def save_best_model(self): + """ + Saves the best model checkpoint and metadata. + + If the current process is the main process, saves the best model checkpoint as 'best_model.ckpt' + and its metadata as 'best_model.json'. If not, saves the checkpoint and metadata with the process rank + appended to the filename, e.g., 'best_model-.ckpt' and 'best_model-.json'. + + The saved metadata includes the global training steps and the value of the experiment's best metric. + + Args: + self: the Experiment object. + + Returns: + None. + """ if self.is_main: file = self.exp.mk_bpath('models', 'best_model.ckpt') file_info = self.exp.mk_bpath('models', 'best_model.json') else: file = self.exp.mk_bpath('models', f'best_model-{self.local_rank}.ckpt') file_info = self.exp.mk_bpath('models', f'best_model-{self.local_rank}.json') + torch.save(self.state_dict(), file) - with open(file_info, 'w') as w: - w.write(json.dumps({'global_steps': self.global_steps, 'metric': self.exp.metric.value})) + res = {'global_steps': self.global_steps, 'metric': self.exp.metric.value} + IO.dump_json(IO.filter_unserializable_values(res), file_info) + self.logger.info(f'saved best model at {file}') self.wait_for_everyone() def save_last_model(self): + """ + Saves the last model checkpoint and metadata. + + If the current process is the main process, saves the last model checkpoint as 'last_model.ckpt' + and its metadata as 'last_model.json'. If not, saves the checkpoint and metadata with the process rank + appended to the filename, e.g., 'last_model-.ckpt' and 'last_model-.json'. + + The saved metadata includes the global training steps and the value of the experiment's best metric. + + Args: + self: the Experiment object. + + Returns: + None. + """ if self.is_main: file = self.exp.mk_bpath('models', 'last_model.ckpt') file_info = self.exp.mk_bpath('models', 'last_model.json') else: file = self.exp.mk_bpath('models', f'last_model-{self.local_rank}.ckpt') file_info = self.exp.mk_bpath('models', f'last_model-{self.local_rank}.json') + torch.save(self.state_dict(), file) - with open(file_info, 'w') as w: - w.write(json.dumps({'global_steps': self.global_steps, 'metric': self.exp.metric.value})) + + res = {'global_steps': self.global_steps, 'metric': self.exp.metric.current} + IO.dump_json(IO.filter_unserializable_values(res), file_info) + self.logger.info(f'saved last model at {file}') self.wait_for_everyone() diff --git a/src/lumo/utils/device.py b/src/lumo/utils/device.py new file mode 100644 index 00000000..ef6ee945 --- /dev/null +++ b/src/lumo/utils/device.py @@ -0,0 +1,105 @@ +from typing import Any, Mapping +import torch + + +def honor_type(obj, generator): + """ + Cast a generator to the same type as obj (list, tuple, or namedtuple) + """ + try: + return type(obj)(generator) + except TypeError: + # Some objects may not be able to instantiate from a generator directly + return type(obj)(*list(generator)) + + +def is_torch_tensor(tensor): + """ + Check if the input is a PyTorch tensor. + + Args: + tensor: The input tensor. + + Returns: + A boolean indicating if the input is a PyTorch tensor. + """ + return isinstance(tensor, torch.Tensor) + + +def recursively_apply(func, data, *args, test_type=is_torch_tensor, error_on_other_type=False, **kwargs): + """ + Recursively apply a function on a data structure that is a nested list/tuple/dictionary of a given base type. + + Args: + func (`callable`): + The function to recursively apply. + data (nested list/tuple/dictionary of `main_type`): + The data on which to apply `func` + *args: + Positional arguments that will be passed to `func` when applied on the unpacked data. + main_type (`type`, *optional*, defaults to `torch.Tensor`): + The base type of the objects to which apply `func`. + error_on_other_type (`bool`, *optional*, defaults to `False`): + Whether to return an error or not if after unpacking `data`, we get on an object that is not of type + `main_type`. If `False`, the function will leave objects of types different than `main_type` unchanged. + **kwargs: + Keyword arguments that will be passed to `func` when applied on the unpacked data. + + Returns: + The same data structure as `data` with `func` applied to every object of type `main_type`. + """ + if isinstance(data, (tuple, list)): + return honor_type( + data, + ( + recursively_apply( + func, o, *args, test_type=test_type, error_on_other_type=error_on_other_type, **kwargs + ) + for o in data + ), + ) + elif isinstance(data, Mapping): + return type(data)( + { + k: recursively_apply( + func, v, *args, test_type=test_type, error_on_other_type=error_on_other_type, **kwargs + ) + for k, v in data.items() + } + ) + elif test_type(data): + return func(data, *args, **kwargs) + elif error_on_other_type: + raise TypeError( + f"Can't apply {func.__name__} on object of type {type(data)}, only of nested list/tuple/dicts of objects " + f"that satisfy {test_type.__name__}." + ) + return data + + +def send_to_device(tensor, device, non_blocking=False): + """ + Recursively sends the elements in a nested list/tuple/dictionary of tensors to a given device. + + Args: + tensor (nested list/tuple/dictionary of `torch.Tensor`): + The data to send to a given device. + device (`torch.device`): + The device to send the data to. + + Returns: + The same data structure as `tensor` with all tensors sent to the proper device. + """ + if not isinstance(device, torch.device): + device = torch.device(device) + + if 'mps' in device.type: + non_blocking = False + + def _send_to_device(t, device): + return t.to(device, non_blocking=non_blocking) + + def _has_to_method(t): + return hasattr(t, "to") + + return recursively_apply(_send_to_device, tensor, device, test_type=_has_to_method) diff --git a/src/lumo/utils/fmt.py b/src/lumo/utils/fmt.py index 7ba7897c..8a527119 100644 --- a/src/lumo/utils/fmt.py +++ b/src/lumo/utils/fmt.py @@ -6,6 +6,7 @@ import numpy as np import torch +from lumo.proc.tz import timezone from . import re @@ -42,7 +43,7 @@ def strftime(fmt='%y-%m-%d-%H%M%S', dateobj: datetime = None): """get current date with formatted""" if dateobj is not None: return dateobj.strftime(fmt) - return datetime.now().strftime(fmt) + return datetime.now(timezone()).strftime(fmt) def strptime(datestr: str = None, fmt='%y-%m-%d-%H%M%S', ): @@ -72,7 +73,15 @@ def indent_print(text, indent=' '): def format_second(sec: int) -> str: - """Formats a duration given in seconds into a human-readable string.""" + """ + Formats a duration given in seconds into a human-readable string. + + Args: + sec: the duration in seconds. + + Returns: + A human-readable string representing the duration. + """ sec, ms = divmod(sec, 1) if sec > 60: min, sec = divmod(sec, 60) @@ -90,5 +99,14 @@ def format_second(sec: int) -> str: return fmt -def format_timedelta(td: timedelta): +def format_timedelta(td: timedelta) -> str: + """ + Formats a timedelta object into a human-readable string. + + Args: + td: a timedelta object. + + Returns: + A human-readable string representing the timedelta. + """ return format_second(td.total_seconds()) diff --git a/src/lumo/utils/logger.py b/src/lumo/utils/logger.py index 04c3e4f2..893db2e2 100644 --- a/src/lumo/utils/logger.py +++ b/src/lumo/utils/logger.py @@ -128,7 +128,7 @@ def _format(self, *values, inline=False, fix=0, raw=False, append=False, level=V return None, None if self.adddate and not raw: - cur_date = datetime.now().strftime(self.datefmt) + cur_date = strftime(self.datefmt) else: cur_date = "" diff --git a/src/lumo/utils/repository.py b/src/lumo/utils/repository.py index 6466ab74..fce12f8b 100644 --- a/src/lumo/utils/repository.py +++ b/src/lumo/utils/repository.py @@ -55,7 +55,7 @@ class branch: def __init__(self, repo: Repo, branch: str): self.repo = repo - self.lock = Lock(f'{hash(repo.git_dir)}_{branch}') + self.lock = Lock(f'{hash(repo.git_dir)}_lumo_auto_commit') self.lock.abtain() self.old_branch = self.repo.head.reference self.branch = branch diff --git a/src/lumo/utils/safe_io.py b/src/lumo/utils/safe_io.py index b87d5881..064404c9 100644 --- a/src/lumo/utils/safe_io.py +++ b/src/lumo/utils/safe_io.py @@ -16,6 +16,31 @@ load_nd = load_nd +def filter_unserializable_values(dic): + """ + Recursively filter out unserializable values in a dictionary. + + Args: + dic (dict): The dictionary to be filtered. + + Returns: + The filtered dictionary. + """ + for key, value in list(dic.items()): + if isinstance(value, dict): + filter_unserializable_values(value) + elif isinstance(value, list): + for i in range(len(value)): + if isinstance(value[i], dict): + filter_unserializable_values(value[i]) + elif not json.dumps(value[i], default=lambda x: None): + value[i] = None + elif json.dumps(value, default=lambda x: None) == 'null': + dic[key] = None + dic = {key: value for key, value in dic.items() if value is not None} + return dic + + def dump_json(obj, fn): """ Dumps the given object to a JSON file at the given file path. @@ -27,8 +52,11 @@ def dump_json(obj, fn): Notes: The JSON data will be written with an indentation of 2 spaces. """ - with open(fn, 'w', encoding='utf-8') as w: - json.dump(obj, w, indent=2) + try: + with open(fn, 'w', encoding='utf-8') as w: + json.dump(obj, w, indent=2) + except TypeError as e: + raise TypeError(str(obj)) from e def dump_yaml(obj, fn): diff --git a/src/lumo/utils/subprocess.py b/src/lumo/utils/subprocess.py index 5ed5e9f2..ee456e13 100644 --- a/src/lumo/utils/subprocess.py +++ b/src/lumo/utils/subprocess.py @@ -4,11 +4,41 @@ import signal -def run_command(command, cwd=None, env=None): +def consume(p: subprocess.Popen): + """ + Consume the standard output and standard error of the specified process. + + Args: + p (subprocess.Popen): The process to consume the output from. + """ + for stream in [p.stdout, p.stderr]: + while True: + line = stream.readline().decode('utf-8') + if not line: + break + print(line, end='') + + +def run_command(command: str, cwd=None, env=None, non_block=False): + """ + Executes a command in the shell and captures its standard output and standard error. + + Args: + command (str): A string representing the command to execute in the shell. + cwd (str): A string representing the working directory to execute the command in. Default is None. + env (dict): A dictionary representing the environment variables to set for the command. Default is None. + non_block (bool): A flag to indicate whether to run the command in a non-blocking manner. Default is False. + + Returns: + The return code of the executed command. + """ proc = subprocess.Popen(command, cwd=cwd, env=env, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) + if non_block: + return proc + try: while proc.poll() is None: # Wait for output from the process diff --git a/tests/exp/test_condition.py b/tests/exp/test_condition.py new file mode 100644 index 00000000..81d830ce --- /dev/null +++ b/tests/exp/test_condition.py @@ -0,0 +1,49 @@ +import pandas as pd +from lumo import C + +# create a sample DataFrame +data = {'name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'], + 'age': [25, 30, 35, 40, 45], + 'city': [{'index': 0}, {'index': 1}, {'index': 2}, {'index': 3}, {'index': 4}]} +df = pd.DataFrame(data) + + +def test_row_filter(): + # create and apply the condition to filter the DataFrame + filtered_df = (C['age'] >= 35).apply(df) + # assert that only the rows with age >= 35 are in the filtered DataFrame + assert filtered_df.shape == (3, 3) + assert filtered_df['age'].min() >= 35 + + +def test_column_edit_add(): + # test the column edit operation C+{'city.index':'cindex'} + assert (C + {'city.index': 'cindex'}).op == 'add_column' + new_df = (C + {'city.index': 'cindex'}).apply(df) + assert 'cindex' in new_df.columns + assert new_df.columns.tolist() == ['name', 'age', 'city', 'cindex'] + + # test the column edit operation C-['city'] + new_df = (C - ['city']).apply(df) + assert 'city' not in new_df.columns + assert new_df.columns.tolist() == ['name', 'age'] + + # test the column edit operation C-['city','name'] + new_df = (C - ['city', 'name']).apply(df) + assert 'city' not in new_df.columns + assert 'name' not in new_df.columns + assert new_df.columns.tolist() == ['age'] + + +def test_pipeline(): + # test the pipeline operation + filtered_df = C.pipe(df, [ + (C['age'] > 35), + C + {'city.index': 'cindex'}, + C - ['city', 'name'] + ]) + assert filtered_df.shape == (2, 2) + assert 'name' not in filtered_df.columns + assert 'city' not in filtered_df.columns + assert 'age' in filtered_df.columns + assert 'cindex' in filtered_df.columns diff --git a/tests/exp/test_metric.py b/tests/exp/test_metric.py new file mode 100644 index 00000000..ca42c54f --- /dev/null +++ b/tests/exp/test_metric.py @@ -0,0 +1,49 @@ +import os +import pytest +import tempfile + +from lumo.exp.metric import Metric + + +@pytest.fixture +def metric(): + with tempfile.TemporaryDirectory() as tmpdir: + metric_fn = os.path.join(tmpdir, "test_metric.pkl") + yield Metric(metric_fn) + + +def test_value(metric): + assert isinstance(metric.value, dict) + + +def test_dump_metric_max(metric): + key = "test_key_max" + values = [2, 4, 1, 7, 5] + expected_value = max(values) + cur = values[0] + for value in values: + max_val = metric.dump_metric(key, value, "max") + cur = max(value, cur) + assert max_val == cur + + assert metric.value[key] == expected_value + + +def test_dump_metric(metric): + key = "test_key" + value = 10 + metric.dump_metric(key, value, "max") + assert metric.value[key] == value + + +def test_dump_metrics(metric): + dic = {"key1": 1, "key2": 2, "key3": 3} + cmp = "min" + result = metric.dump_metrics(dic, cmp) + assert result == {"key1": 1, "key2": 2, "key3": 3} + + +def test_flush(metric): + metric.flush() + assert os.path.exists(metric.fn) + os.remove(metric.fn) diff --git a/tests/trainer/test_meter.py b/tests/trainer/test_meter.py index edf8faea..2956eb01 100644 --- a/tests/trainer/test_meter.py +++ b/tests/trainer/test_meter.py @@ -9,11 +9,10 @@ def test_ReduceItem(): item.update(i) assert item.res == np.mean(range(200)[-ReduceItem.SLIDE_WINDOW_SIZE:], dtype=float).item() - item = ReduceItem([1, 2, 3], 'last') + item = ReduceItem(3, 'last') for i in range(4): - res = [j for j in range(i)] - item.update(res) - assert item.res == res + item.update(i) + assert item.res == i rand = [np.random.rand(4) for i in range(4)] avg = ReduceItem(rand[0], 'sum') @@ -63,4 +62,13 @@ def test_avgmeter(): def test_meter(): - pass + from lumo import Meter, Record + import numpy as np + rnd = np.random.rand(1000) + r = Record() + m = Meter() + m.mean.rnd = rnd[:500] + r.record(m) + m.mean.rnd = rnd[500:] + r.record(m) + assert np.isclose(r.agg()['rnd'], rnd.mean()) diff --git a/tests/trainer/test_skip.py b/tests/trainer/test_skip.py index 966af2ac..2fa2c278 100644 --- a/tests/trainer/test_skip.py +++ b/tests/trainer/test_skip.py @@ -57,9 +57,8 @@ def to_device(self, item: Optional[Union[nn.Module, torch.Tensor, Sequence, Mapp device_args_kwargs=None): return super().to_device(item, device_args_kwargs) - def prepare_dataloader(self, stage: TrainStage, dataloader=None): - # assert self.context == 'prepare_dataloader' - return super().prepare_dataloader(stage, dataloader) + def prepare_dataloader(self, loader: DataLoaderType, stage: TrainStage = None): + return super().prepare_dataloader(loader, stage) def initialize(self): if not self.is_initialized: diff --git a/tests/trainer/test_trainer.py b/tests/trainer/test_trainer.py index 38c015cd..a75fa948 100644 --- a/tests/trainer/test_trainer.py +++ b/tests/trainer/test_trainer.py @@ -72,9 +72,8 @@ def to_device(self, item: Optional[Union[nn.Module, torch.Tensor, Sequence, Mapp device_args_kwargs=None): return super().to_device(item, device_args_kwargs) - def prepare_dataloader(self, stage: TrainStage, dataloader=None): - # assert self.context == 'prepare_dataloader' - return super().prepare_dataloader(stage, dataloader) + def prepare_dataloader(self, loader: DataLoaderType, stage: TrainStage = None): + return super().prepare_dataloader(loader, stage) def initialize(self): if not self.is_initialized: diff --git a/tests/utils/test_io.py b/tests/utils/test_io.py index fc52be61..c45e8f29 100644 --- a/tests/utils/test_io.py +++ b/tests/utils/test_io.py @@ -1,6 +1,6 @@ import yaml from lumo.utils.safe_io import * - +import os def test_dump_json(tmpdir): # Test that dump_json creates a valid JSON file