Skip to content

Commit

Permalink
Image save w/o compression from pil/base64; clean util code; TSV doc (#…
Browse files Browse the repository at this point in the history
…47)

* Add image save without compression from pil or base64; clean up util code; doc on TSV

*
  • Loading branch information
pj-ms authored Oct 14, 2022
1 parent 7ff3709 commit b446462
Show file tree
Hide file tree
Showing 25 changed files with 633 additions and 314 deletions.
27 changes: 27 additions & 0 deletions COCO_DATA_FORMAT.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,3 +109,30 @@ Here is one example of the json file for the image regression task, where the "t
]
}
```

## MultiTask dataset

Multitask dataset represents the kind of dataset, where a single set of images possesses multiple sets of annotations for different tasks of single/mutiple tasks mentioned above.

For example, a set of people images can have different attributes: gender/classification {make, female, other}, height/regression: {0-300cm}, person location/detection: {x, y, w, h}, etc.

To represent this kind of dataset, it is simple: create one independent coco file for each task:

```
people_dataset/
train_images/
...
test_images/
...
train_images.zip
test_images.zip
train_coco_gender.json
test_coco_gender.json
train_coco_height.json
test_coco_height.json
train_coco_location.json
test_coco_location.json
```
```
48 changes: 47 additions & 1 deletion DATA_PREPARATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,9 @@ After everything is done, remember to run the commands below to do a final check

## Host/manage datasets on cloud/disk

`DatasetHub` class is the one that manages access of multiple datasets either from local disk or cloud, using the dataset name and version information. It takes a dataset regisration json file, which contains the meta information of each dataset. For each dataset, there is a corresponding entry in the json:
`DatasetHub` class is the one that manages access of multiple datasets either from local disk or cloud, using the dataset name and version information. It takes a dataset regisration json file, which contains the meta information of each dataset. For each dataset, there is a corresponding entry in the json.

Below are two examples of single-task datasets:

```{json}
[
Expand Down Expand Up @@ -112,4 +114,48 @@ After everything is done, remember to run the commands below to do a final check
]
```

Below is an example of multitask dataset:

```json
[
{
"name": "people-dataset",
"description": "people dataset including gender, height, and location information",
"version": 1,
"format": "coco",
"type": "multitask",
"root_folder": "multitask/people_dataset",
"tasks": {
"gender":{
"type": "classification_multiclass",
"train": {
"index_path": "train_coco_gender.json", "files_for_local_usage": [ "train_images.zip" ],
},
"test": {
"index_path": "test_coco_gender.json", "files_for_local_usage": [ "test_images.zip" ],
}
},
"height":{
"type": "image_regression",
"train": {
"index_path": "train_coco_height.json", "files_for_local_usage": [ "train_images.zip" ],
},
"test": {
"index_path": "test_coco_height.json", "files_for_local_usage": [ "test_images.zip" ],
}
},
"location":{
"type": "object_detection",
"train": {
"index_path": "train_coco_location.json", "files_for_local_usage": [ "train_images.zip" ],
},
"test": {
"index_path": "test_coco_location.json", "files_for_local_usage": [ "test_images.zip" ],
}
},
}
}
]
```

Check the usage code example in [`README.md`](README.md).
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,3 +130,8 @@ When data exists on local disk, `blob_container_sas` can be `None`.
### Training with PyTorch

Training with PyTorch is easy. After instantiating a `ManifestDataset`, simply passing it in `vision_datasets.pytorch.torch_dataset.TorchDataset` together with the `transform`, then you are good to go with the PyTorch DataLoader for training.


## Helpful commands

There are a few commands that come with this repo once installed, such as TVS <=> COCO conversion, datset check and download, detection => classification dataset, and so on, check [`UTIL_COMMANDS.md`](./UTIL_COMMANDS.md) for details.
60 changes: 60 additions & 0 deletions TSV_FORMAT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Introduction

TSV format uses a single or multiple *.tvs format to store both the image annotation and image files, where the
- **first column** is the image_id, it can be string, int or any unique identifier
- **second column** being the annotaion
- **third column** being the [base64-encoded](https://en.wikipedia.org/wiki/Base64) string of the image data.

This repo does not support consuming TSV format, but we provide tools for converting to/from TSV format from/to coco, for limited tasks and data, such as

```bash
vision_convert_to_tsv {dataset_name} -r {dataset_registry_json} -k {data storage url} -f {local_dir} [-u Usages]
```

For more commands and help related to TSV, check out [UTIL_COMMANDS.md](./UTIL_COMMANDS.md).

# Task-wise Format

## Image Classifciation

Below is an example of multiclass classification:

```
1 [{"class": "dog"}] /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
2 [{"class": "cat"}] /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
3 [{"class": "wolff"}] /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
...
```

For multilabel classification,

```
1 [{"class": "dog"}, {"class": "canidae"}, {"class": "pet"}] /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
2 [{"class": "cat"}, {"class": "Felidae"}, {"class": "pet"}] /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
3 [{"class": "wolff"}, {"class": "canidae"}}] /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
...
```

## Object Detection

The format of object detection is very similar to mutlilable classification, with an additonal field `rect: [left, top, right, bottom]`

```
1 [{"class": "dog", "rect": [10, 10, 100, 100]}, {"class": "cat", "rect": [10, 10, 100, 100]}] /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
2 [{"class": "cat", "rect": [10, 20, 250, 100]}] /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
3 [{"class": "wolff", "rect": [100, 200, 250, 1000]}] /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
...
```


## Image Caption


The format of image caption is straightforward as

```
1 [{"caption": "dog playing with a cat"}] /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
2 [{"caption": "dog eating food"}] /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
3 [{"caption": "wolff sitting in snow"] /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
...
```
10 changes: 10 additions & 0 deletions UTIL_COMMANDS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Helpful commands

- `vision_download`: help you download the dataset files to local disk for consumption, it can be downloaded/converted to TSV directly as well
- `vision_check_dataset`: check if a dataset or [coco json + images] is problematic or not.
- `vision_convert_to_tsv`: convert a dataset or [coco json + images] to TSV format, currently only classification, object detection and caption tasks are supported, TSV format doc can be found at [`TSV_FORMAT.md`](./TSV_FORMAT.md)
- `vision_convert_tsv_to_coco`: convert TSV file to [coco json + images].
- `vision_convert_od_to_ic`: convert a detection dataset to classification dataset (with or without augmentations).
- `vision_merge_datasets`: merge multiple datasets into one.

For each commoand, run `command -h` for more details.
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ requests>=2.23.0
tenacity>=6.2.0
tqdm
torch>=1.7.1
matplotlib
16 changes: 10 additions & 6 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import setuptools
from os import path

VERSION = '0.2.19'
VERSION = '0.2.20'

# Get the long description from the README file
here = path.abspath(path.dirname(__file__))
Expand All @@ -25,7 +25,8 @@
'Pillow>=6.2.2',
'requests>=2.23.0',
'tenacity>=6.2.0',
'tqdm'
'tqdm',
'matplotlib'
],
classifiers=[
'Development Status :: 4 - Beta',
Expand All @@ -39,13 +40,16 @@
],
extras_require={'run': ['torch>=1.6.0', 'torchvision>=0.7.0']},
entry_points={
'console_scripts': ['vision_download=vision_datasets.commands.download:main',
'vision_check_dataset=vision_datasets.commands.dataset_check:main',
'vision_od_to_ic=vision_datasets.commands.converter_od_to_ic:main',
'console_scripts': ['vision_download=vision_datasets.commands.download_dataset:main',
'vision_merge_datasets=vision_datasets.commands.merge_datasets:main',

'vision_check_dataset=vision_datasets.commands.check_dataset:main',
'vision_convert_od_to_ic=vision_datasets.commands.converter_od_to_ic:main',
'vision_convert_local_dir_ic_data=vision_datasets.commands.convert_local_dir_ic_data:main',
'vision_convert_to_tsv=vision_datasets.commands.converter_to_tsv:main',
'vision_convert_tsv_to_coco=vision_datasets.commands.converter_tsv_to_coco:main',

'vision_tsv_to_iris=vision_datasets.commands.converter_tsv_to_iris:main',
'vision_tsv_to_coco=vision_datasets.commands.converter_tsv_to_coco:main',
'vision_yolo_to_iris=vision_datasets.commands.converter_yolo_darknet_to_iris:main',
'vision_gen_metafile=vision_datasets.commands.generate_image_meta_info:main']
})
29 changes: 29 additions & 0 deletions tests/test_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
import tempfile
import unittest
import pathlib
import numpy
from PIL import Image, ImageChops
from vision_datasets.commands.utils import Base64Utils


class TestBase64Utils(unittest.TestCase):
@staticmethod
def _create_rand_img(size=(100, 100)) -> Image.Image:
assert len(size) == 2

imarray = numpy.random.rand(size[0], size[1], 3) * 255
im = Image.fromarray(imarray.astype('uint8')).convert('RGB')
return im

def test_b64_to_file_loses_no_info(self):
with tempfile.TemporaryDirectory() as temp_dir:
temp_dir = pathlib.Path(temp_dir)
img_filepath_1 = temp_dir / 'temp_1.jpg'
TestBase64Utils._create_rand_img().save(img_filepath_1)

img_filepath_2 = temp_dir / 'temp_2.jpg'
b64str = Base64Utils.file_to_b64_str(img_filepath_1)
Base64Utils.b64_str_to_file(b64str, img_filepath_2)
img1 = Image.open(img_filepath_1)
img2 = Image.open(img_filepath_2)
assert not ImageChops.difference(img1, img2).getbbox()
3 changes: 2 additions & 1 deletion vision_datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from .common import DatasetRegistry, Usages, DatasetTypes, CocoManifestAdaptor, IrisManifestAdaptor, DatasetManifest, DatasetInfo, ManifestDataset, BaseDatasetInfo, MultiTaskDatasetInfo
from .resources import DatasetHub
from .commands import Base64Utils

__all__ = ['DatasetRegistry', 'Usages', 'DatasetTypes', 'CocoManifestAdaptor', 'IrisManifestAdaptor', 'DatasetManifest', 'DatasetInfo', 'ManifestDataset', 'BaseDatasetInfo', 'MultiTaskDatasetInfo',
'DatasetHub']
'DatasetHub', 'Base64Utils']
3 changes: 3 additions & 0 deletions vision_datasets/commands/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from .utils import Base64Utils

__all__ = ['Base64Utils']
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
"""
Check if a dataset is prepared well to be consumed by this pkg
"""

import argparse
import logging
import os.path
import pathlib
import random
import json
from tqdm import tqdm
from vision_datasets import DatasetRegistry, Usages, DatasetHub, DatasetTypes, ManifestDataset
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
from vision_datasets import DatasetHub, DatasetTypes, ManifestDataset
from .utils import add_args_to_locate_dataset, get_or_generate_data_reg_json_and_usages, set_up_cmd_logger

logger = set_up_cmd_logger(__name__)


def show_dataset_stats(dataset):
Expand Down Expand Up @@ -43,7 +45,8 @@ def check_images(dataset: ManifestDataset, err_msg_file: pathlib.Path):
file_not_found_list.append(str(e))

if file_not_found_list:
err_msg_file.write_text('\n'.join(file_not_found_list))
logger.info(f'Errors => {err_msg_file.as_posix()}')
err_msg_file.write_text('\n'.join(file_not_found_list), encoding='utf-8')


def classification_detection_check(dataset: ManifestDataset):
Expand Down Expand Up @@ -77,71 +80,37 @@ def classification_detection_check(dataset: ManifestDataset):
logger.info(str(stats))


def _generate_reg_json(name, type, coco_path):
data_info = [
{
'name': name,
'version': 1,
'type': type,
'format': 'coco',
'root_folder': '',
'train': {
'index_path': coco_path.name
}
}
]

return json.dumps(data_info)


def main():
parser = argparse.ArgumentParser('Check if a dataset is valid')
parser.add_argument('name', type=str, help='Dataset name.')
parser.add_argument('--reg_json', '-r', type=pathlib.Path, default=None, help='dataset registration json file path.', required=False)
parser.add_argument('--version', '-v', type=int, help='Dataset version.', default=None)

parser.add_argument('--coco_json', '-c', type=pathlib.Path, default=None, help='Single coco json file to check.', required=False)
parser.add_argument('--data_type', '-t', type=str, default=None, help='Type of data.', choices=DatasetTypes.VALID_TYPES, required=False)

parser.add_argument('--blob_container', '-k', type=str, help='Blob container (sas) url', required=False)
parser.add_argument('--folder_to_check', '-f', type=pathlib.Path, required=False, help='Check the dataset in this folder.')
parser = argparse.ArgumentParser('Check if a dataset is valid for pkg to consume.')
add_args_to_locate_dataset(parser)
parser.add_argument('--quick_check', '-q', action='store_true', default=False, help='Randomly check a few data samples from the dataset.')

args = parser.parse_args()
prefix = logging_prefix(args.name, args.version)

if args.reg_json:
usages = [Usages.TRAIN_PURPOSE, Usages.VAL_PURPOSE, Usages.TEST_PURPOSE]
data_reg_json = args.reg_json.read_text()
else:
assert args.coco_json, '--coco_json not provided'
assert args.data_type, '--data_type not provided'
usages = [Usages.TRAIN_PURPOSE]
data_reg_json = _generate_reg_json(args.name, args.data_type, args.coco_json)
data_reg_json, usages = get_or_generate_data_reg_json_and_usages(args)
dataset_hub = DatasetHub(data_reg_json)
dataset_info = dataset_hub.dataset_registry.get_dataset_info(args.name, args.version)

dataset_info = DatasetRegistry(data_reg_json).get_dataset_info(args.name, args.version)
if not dataset_info:
logger.error(f'{prefix} dataset does not exist.')
return
else:
logger.info(f'{prefix} dataset found in registration file.')

vision_datasets = DatasetHub(data_reg_json)
if args.blob_container and args.local_dir:
args.local_dir.mkdir(parents=True, exist_ok=True)

for usage in usages:
logger.info(f'{prefix} Check dataset with usage: {usage}.')
if args.folder_to_check and not args.folder_to_check.exists():
os.mkdir(args.folder_to_check)

# if args.folder_to_check is none, then this check will directly try to access data from azure blob. Images must be present in uncompressed folder on azure blob.
dataset = vision_datasets.create_manifest_dataset(container_sas=args.blob_container, local_dir=args.folder_to_check, name=dataset_info.name, version=args.version, usage=usage)
# if args.local_dir is none, then this check will directly try to access data from azure blob. Images must be present in uncompressed folder on azure blob.
dataset = dataset_hub.create_manifest_dataset(container_sas=args.blob_container, local_dir=args.local_dir, name=dataset_info.name, version=args.version, usage=usage)
if dataset:
err_msg_file = pathlib.Path(f'{args.name}_{usage}_errors.txt')
if args.quick_check:
quick_check_images(dataset)
else:
check_images(dataset, err_msg_file)

if args.data_type in [DatasetTypes.IC_MULTICLASS, DatasetTypes.IC_MULTILABEL, DatasetTypes.OD]:
classification_detection_check(dataset)
else:
Expand Down
Loading

0 comments on commit b446462

Please sign in to comment.