Image save w/o compression from pil/base64; clean util code; TSV doc (#…

…47) * Add image save without compression from pil or base64; clean up util code; doc on TSV *
microsoft · Oct 14, 2022 · b446462 · b446462
1 parent 7ff3709
commit b446462
Show file tree

Hide file tree

Showing 25 changed files with 633 additions and 314 deletions.
diff --git a/COCO_DATA_FORMAT.md b/COCO_DATA_FORMAT.md
@@ -109,3 +109,30 @@ Here is one example of the json file for the image regression task, where the "t
     ]
 }
 ```
+
+## MultiTask dataset
+
+Multitask dataset represents the kind of dataset, where a single set of images possesses multiple sets of annotations for different tasks of single/mutiple tasks mentioned above.
+
+For example, a set of people images can have different attributes: gender/classification {make, female, other}, height/regression: {0-300cm}, person location/detection: {x, y, w, h}, etc.
+
+To represent this kind of dataset, it is simple: create one independent coco file for each task:
+
+```
+people_dataset/
+    train_images/
+        ...
+    test_images/
+        ...
+
+    train_images.zip
+    test_images.zip
+    
+    train_coco_gender.json
+    test_coco_gender.json
+    train_coco_height.json
+    test_coco_height.json
+    train_coco_location.json
+    test_coco_location.json
+```
+```
diff --git a/DATA_PREPARATION.md b/DATA_PREPARATION.md
@@ -73,7 +73,9 @@ After everything is done, remember to run the commands below to do a final check
 
 ## Host/manage datasets on cloud/disk
 
-`DatasetHub` class is the one that manages access of multiple datasets either from local disk or cloud, using the dataset name and version information. It takes a dataset regisration json file, which contains the meta information of each dataset. For each dataset, there is a corresponding entry in the json:
+`DatasetHub` class is the one that manages access of multiple datasets either from local disk or cloud, using the dataset name and version information. It takes a dataset regisration json file, which contains the meta information of each dataset. For each dataset, there is a corresponding entry in the json.
+
+Below are two examples of single-task datasets:
 
 ```{json}
 [
@@ -112,4 +114,48 @@ After everything is done, remember to run the commands below to do a final check
 ]
 ```
 
+Below is an example of multitask dataset:
+
+```json
+[
+    {
+        "name": "people-dataset",
+        "description": "people dataset including gender, height, and location information",
+        "version": 1,
+        "format": "coco",
+        "type": "multitask",
+        "root_folder": "multitask/people_dataset",
+        "tasks": {
+            "gender":{
+                "type": "classification_multiclass",
+                "train": {
+                    "index_path": "train_coco_gender.json", "files_for_local_usage": [ "train_images.zip" ],
+                },
+                "test": {
+                    "index_path": "test_coco_gender.json", "files_for_local_usage": [ "test_images.zip" ],
+                }
+            },
+            "height":{
+                "type": "image_regression",
+                "train": {
+                    "index_path": "train_coco_height.json", "files_for_local_usage": [ "train_images.zip" ],
+                },
+                "test": {
+                    "index_path": "test_coco_height.json", "files_for_local_usage": [ "test_images.zip" ],
+                }
+            },
+            "location":{
+                "type": "object_detection",
+                "train": {
+                    "index_path": "train_coco_location.json", "files_for_local_usage": [ "train_images.zip" ],
+                },
+                "test": {
+                    "index_path": "test_coco_location.json", "files_for_local_usage": [ "test_images.zip" ],
+                }
+            },
+        }
+    }
+]
+```
+
 Check the usage code example in [`README.md`](README.md).
diff --git a/README.md b/README.md
@@ -130,3 +130,8 @@ When data exists on local disk, `blob_container_sas` can be `None`.
 ### Training with PyTorch
 
 Training with PyTorch is easy. After instantiating a `ManifestDataset`, simply passing it in `vision_datasets.pytorch.torch_dataset.TorchDataset` together with the `transform`, then you are good to go with the PyTorch DataLoader for training.
+
+
+## Helpful commands
+
+There are a few commands that come with this repo once installed, such as TVS <=> COCO conversion, datset check and download, detection => classification dataset, and so on, check [`UTIL_COMMANDS.md`](./UTIL_COMMANDS.md) for details.
diff --git a/TSV_FORMAT.md b/TSV_FORMAT.md
@@ -0,0 +1,60 @@
+# Introduction
+
+TSV format uses a single or multiple *.tvs format to store both the image annotation and image files, where the 
+- **first column** is the image_id, it can be string, int or any unique identifier
+- **second column** being the annotaion
+- **third column** being the [base64-encoded](https://en.wikipedia.org/wiki/Base64) string of the image data.
+
+This repo does not support consuming TSV format, but we provide tools for converting to/from TSV format from/to coco, for limited tasks and data, such as
+
+```bash
+vision_convert_to_tsv {dataset_name} -r {dataset_registry_json} -k {data storage url} -f {local_dir} [-u Usages]
+```
+
+For more commands and help related to TSV, check out [UTIL_COMMANDS.md](./UTIL_COMMANDS.md).
+
+# Task-wise Format
+
+## Image Classifciation
+
+Below is an example of multiclass classification:
+
+```
+1   [{"class": "dog"}]    /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
+2   [{"class": "cat"}]    /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
+3   [{"class": "wolff"}]    /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
+...
+```
+
+For multilabel classification,
+
+```
+1   [{"class": "dog"}, {"class": "canidae"}, {"class": "pet"}]    /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
+2   [{"class": "cat"}, {"class": "Felidae"}, {"class": "pet"}]    /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
+3   [{"class": "wolff"}, {"class": "canidae"}}]    /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
+...
+```
+
+## Object Detection
+
+The format of object detection is very similar to mutlilable classification, with an additonal field `rect: [left, top, right, bottom]`
+
+```
+1   [{"class": "dog", "rect": [10, 10, 100, 100]}, {"class": "cat", "rect": [10, 10, 100, 100]}]    /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
+2   [{"class": "cat", "rect": [10, 20, 250, 100]}]    /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
+3   [{"class": "wolff", "rect": [100, 200, 250, 1000]}]    /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
+...
+```
+
+
+## Image Caption
+
+
+The format of image caption is straightforward as
+
+```
+1   [{"caption": "dog playing with a cat"}]    /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
+2   [{"caption": "dog eating food"}]    /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
+3   [{"caption": "wolff sitting in snow"]    /9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAA.....
+...
+```
diff --git a/UTIL_COMMANDS.md b/UTIL_COMMANDS.md
@@ -0,0 +1,10 @@
+# Helpful commands
+
+- `vision_download`: help you download the dataset files to local disk for consumption, it can be downloaded/converted to TSV directly as well
+- `vision_check_dataset`: check if a dataset or [coco json + images] is problematic or not.
+- `vision_convert_to_tsv`: convert a dataset or [coco json + images] to TSV format, currently only classification, object detection and caption tasks are supported, TSV format doc can be found at [`TSV_FORMAT.md`](./TSV_FORMAT.md)
+- `vision_convert_tsv_to_coco`: convert TSV file to [coco json + images].
+- `vision_convert_od_to_ic`: convert a detection dataset to classification dataset (with or without augmentations).
+- `vision_merge_datasets`: merge multiple datasets into one.
+
+For each commoand, run `command -h` for more details.
diff --git a/requirements.txt b/requirements.txt
@@ -4,3 +4,4 @@ requests>=2.23.0
 tenacity>=6.2.0
 tqdm
 torch>=1.7.1
+matplotlib
diff --git a/setup.py b/setup.py
@@ -1,7 +1,7 @@
 import setuptools
 from os import path
 
-VERSION = '0.2.19'
+VERSION = '0.2.20'
 
 # Get the long description from the README file
 here = path.abspath(path.dirname(__file__))
@@ -25,7 +25,8 @@
                      'Pillow>=6.2.2',
                      'requests>=2.23.0',
                      'tenacity>=6.2.0',
-                     'tqdm'
+                     'tqdm',
+                     'matplotlib'
                  ],
                  classifiers=[
                      'Development Status :: 4 - Beta',
@@ -39,13 +40,16 @@
                  ],
                  extras_require={'run': ['torch>=1.6.0', 'torchvision>=0.7.0']},
                  entry_points={
-                     'console_scripts': ['vision_download=vision_datasets.commands.download:main',
-                                         'vision_check_dataset=vision_datasets.commands.dataset_check:main',
-                                         'vision_od_to_ic=vision_datasets.commands.converter_od_to_ic:main',
+                     'console_scripts': ['vision_download=vision_datasets.commands.download_dataset:main',
                                          'vision_merge_datasets=vision_datasets.commands.merge_datasets:main',
+
+                                         'vision_check_dataset=vision_datasets.commands.check_dataset:main',
+                                         'vision_convert_od_to_ic=vision_datasets.commands.converter_od_to_ic:main',
                                          'vision_convert_local_dir_ic_data=vision_datasets.commands.convert_local_dir_ic_data:main',
+                                         'vision_convert_to_tsv=vision_datasets.commands.converter_to_tsv:main',
+                                         'vision_convert_tsv_to_coco=vision_datasets.commands.converter_tsv_to_coco:main',
+
                                          'vision_tsv_to_iris=vision_datasets.commands.converter_tsv_to_iris:main',
-                                         'vision_tsv_to_coco=vision_datasets.commands.converter_tsv_to_coco:main',
                                          'vision_yolo_to_iris=vision_datasets.commands.converter_yolo_darknet_to_iris:main',
                                          'vision_gen_metafile=vision_datasets.commands.generate_image_meta_info:main']
                  })
diff --git a/tests/test_utils.py b/tests/test_utils.py
@@ -0,0 +1,29 @@
+import tempfile
+import unittest
+import pathlib
+import numpy
+from PIL import Image, ImageChops
+from vision_datasets.commands.utils import Base64Utils
+
+
+class TestBase64Utils(unittest.TestCase):
+    @staticmethod
+    def _create_rand_img(size=(100, 100)) -> Image.Image:
+        assert len(size) == 2
+
+        imarray = numpy.random.rand(size[0], size[1], 3) * 255
+        im = Image.fromarray(imarray.astype('uint8')).convert('RGB')
+        return im
+
+    def test_b64_to_file_loses_no_info(self):
+        with tempfile.TemporaryDirectory() as temp_dir:
+            temp_dir = pathlib.Path(temp_dir)
+            img_filepath_1 = temp_dir / 'temp_1.jpg'
+            TestBase64Utils._create_rand_img().save(img_filepath_1)
+
+            img_filepath_2 = temp_dir / 'temp_2.jpg'
+            b64str = Base64Utils.file_to_b64_str(img_filepath_1)
+            Base64Utils.b64_str_to_file(b64str, img_filepath_2)
+            img1 = Image.open(img_filepath_1)
+            img2 = Image.open(img_filepath_2)
+            assert not ImageChops.difference(img1, img2).getbbox()
diff --git a/vision_datasets/__init__.py b/vision_datasets/__init__.py
@@ -1,5 +1,6 @@
 from .common import DatasetRegistry, Usages, DatasetTypes, CocoManifestAdaptor, IrisManifestAdaptor, DatasetManifest, DatasetInfo, ManifestDataset, BaseDatasetInfo, MultiTaskDatasetInfo
 from .resources import DatasetHub
+from .commands import Base64Utils
 
 __all__ = ['DatasetRegistry', 'Usages', 'DatasetTypes', 'CocoManifestAdaptor', 'IrisManifestAdaptor', 'DatasetManifest', 'DatasetInfo', 'ManifestDataset', 'BaseDatasetInfo', 'MultiTaskDatasetInfo',
-           'DatasetHub']
+           'DatasetHub', 'Base64Utils']
diff --git a/vision_datasets/commands/__init__.py b/vision_datasets/commands/__init__.py
@@ -0,0 +1,3 @@
+from .utils import Base64Utils
+
+__all__ = ['Base64Utils']
diff --git a/vision_datasets/commands/dataset_check.py → vision_datasets/commands/check_dataset.py b/vision_datasets/commands/dataset_check.py → vision_datasets/commands/check_dataset.py
@@ -1,13 +1,15 @@
+"""
+Check if a dataset is prepared well to be consumed by this pkg
+"""
+
 import argparse
-import logging
-import os.path
 import pathlib
 import random
-import json
 from tqdm import tqdm
-from vision_datasets import DatasetRegistry, Usages, DatasetHub, DatasetTypes, ManifestDataset
-logger = logging.getLogger(__name__)
-logging.basicConfig(level=logging.INFO)
+from vision_datasets import DatasetHub, DatasetTypes, ManifestDataset
+from .utils import add_args_to_locate_dataset, get_or_generate_data_reg_json_and_usages, set_up_cmd_logger
+
+logger = set_up_cmd_logger(__name__)
 
 
 def show_dataset_stats(dataset):
@@ -43,7 +45,8 @@ def check_images(dataset: ManifestDataset, err_msg_file: pathlib.Path):
             file_not_found_list.append(str(e))
 
     if file_not_found_list:
-        err_msg_file.write_text('\n'.join(file_not_found_list))
+        logger.info(f'Errors => {err_msg_file.as_posix()}')
+        err_msg_file.write_text('\n'.join(file_not_found_list), encoding='utf-8')
 
 
 def classification_detection_check(dataset: ManifestDataset):
@@ -77,71 +80,37 @@ def classification_detection_check(dataset: ManifestDataset):
     logger.info(str(stats))
 
 
-def _generate_reg_json(name, type, coco_path):
-    data_info = [
-        {
-            'name': name,
-            'version': 1,
-            'type': type,
-            'format': 'coco',
-            'root_folder': '',
-            'train': {
-                'index_path': coco_path.name
-            }
-        }
-    ]
-
-    return json.dumps(data_info)
-
-
 def main():
-    parser = argparse.ArgumentParser('Check if a dataset is valid')
-    parser.add_argument('name', type=str, help='Dataset name.')
-    parser.add_argument('--reg_json', '-r', type=pathlib.Path, default=None, help='dataset registration json file path.', required=False)
-    parser.add_argument('--version', '-v', type=int, help='Dataset version.', default=None)
-
-    parser.add_argument('--coco_json', '-c', type=pathlib.Path, default=None, help='Single coco json file to check.', required=False)
-    parser.add_argument('--data_type', '-t', type=str, default=None, help='Type of data.', choices=DatasetTypes.VALID_TYPES, required=False)
-
-    parser.add_argument('--blob_container', '-k', type=str, help='Blob container (sas) url', required=False)
-    parser.add_argument('--folder_to_check', '-f', type=pathlib.Path, required=False, help='Check the dataset in this folder.')
+    parser = argparse.ArgumentParser('Check if a dataset is valid for pkg to consume.')
+    add_args_to_locate_dataset(parser)
     parser.add_argument('--quick_check', '-q', action='store_true', default=False, help='Randomly check a few data samples from the dataset.')
 
     args = parser.parse_args()
     prefix = logging_prefix(args.name, args.version)
 
-    if args.reg_json:
-        usages = [Usages.TRAIN_PURPOSE, Usages.VAL_PURPOSE, Usages.TEST_PURPOSE]
-        data_reg_json = args.reg_json.read_text()
-    else:
-        assert args.coco_json, '--coco_json not provided'
-        assert args.data_type, '--data_type not provided'
-        usages = [Usages.TRAIN_PURPOSE]
-        data_reg_json = _generate_reg_json(args.name, args.data_type, args.coco_json)
+    data_reg_json, usages = get_or_generate_data_reg_json_and_usages(args)
+    dataset_hub = DatasetHub(data_reg_json)
+    dataset_info = dataset_hub.dataset_registry.get_dataset_info(args.name, args.version)
 
-    dataset_info = DatasetRegistry(data_reg_json).get_dataset_info(args.name, args.version)
     if not dataset_info:
         logger.error(f'{prefix} dataset does not exist.')
         return
-    else:
-        logger.info(f'{prefix} dataset found in registration file.')
 
-    vision_datasets = DatasetHub(data_reg_json)
+    if args.blob_container and args.local_dir:
+        args.local_dir.mkdir(parents=True, exist_ok=True)
 
     for usage in usages:
         logger.info(f'{prefix} Check dataset with usage: {usage}.')
-        if args.folder_to_check and not args.folder_to_check.exists():
-            os.mkdir(args.folder_to_check)
 
-        # if args.folder_to_check is none, then this check will directly try to access data from azure blob. Images must be present in uncompressed folder on azure blob.
-        dataset = vision_datasets.create_manifest_dataset(container_sas=args.blob_container, local_dir=args.folder_to_check, name=dataset_info.name, version=args.version, usage=usage)
+        # if args.local_dir is none, then this check will directly try to access data from azure blob. Images must be present in uncompressed folder on azure blob.
+        dataset = dataset_hub.create_manifest_dataset(container_sas=args.blob_container, local_dir=args.local_dir, name=dataset_info.name, version=args.version, usage=usage)
         if dataset:
             err_msg_file = pathlib.Path(f'{args.name}_{usage}_errors.txt')
             if args.quick_check:
                 quick_check_images(dataset)
             else:
                 check_images(dataset, err_msg_file)
-                
+
             if args.data_type in [DatasetTypes.IC_MULTICLASS, DatasetTypes.IC_MULTILABEL, DatasetTypes.OD]:
                 classification_detection_check(dataset)
         else: