add bbox check to check dataset script (#54)

* add bbox check to check dataset script * fix doc
microsoft · Apr 11, 2023 · 6a99d8c · 6a99d8c
1 parent 0195e0c
commit 6a99d8c
Show file tree

Hide file tree

Showing 5 changed files with 77 additions and 37 deletions.
diff --git a/COCO_DATA_FORMAT.md b/COCO_DATA_FORMAT.md
@@ -6,7 +6,7 @@ In coco, we use `file_name` and `zip_file` to construct the file_path in `ImageD
 
 Here is one example of the train.json, val.json, or test.json in the `DatasetInfo` above. Note that the `"id"` for `images`, `annotations` and `categories` should be consecutive integers, **starting from 1**. Note that our lib might work with id starting from 0, but many tools like [CVAT](https://github.com/openvinotoolkit/cvat/issues/2085) and official [COCOAPI](https://github.com/cocodataset/cocoapi/issues/507) will fail.
 
-``` {json}
+```json
 {
   "images": [{"id": 1, "width": 224.0, "height": 224.0, "file_name": "train_images/siberian-kitten.jpg", "zip_file": "train_images.zip"},
               {"id": 2, "width": 224.0, "height": 224.0, "file_name": "train_images/kitten 3.jpg", "zip_file": "train_images.zip"}],
@@ -22,7 +22,7 @@ Here is one example of the train.json, val.json, or test.json in the `DatasetInf
 
 ## Object detection
 
-``` {json}
+```json
 {
   "images": [{"id": 1, "width": 224.0, "height": 224.0, "file_name": "train_images/siberian-kitten.jpg", "zip_file": "train_images.zip"},
               {"id": 2, "width": 224.0, "height": 224.0, "file_name": "train_images/kitten 3.jpg", "zip_file": "train_images.zip"}],
@@ -46,7 +46,7 @@ Note that
 
 Here is one example of the json file for image caption task.
 
-``` {json}
+```json
 {
   "images": [{"id": 1, "file_name": "train_images/honda.jpg", "zip_file": "train_images.zip"},
               {"id": 2, "file_name": "train_images/kitchen.jpg", "zip_file": "train_images.zip"}],
@@ -62,7 +62,7 @@ Here is one example of the json file for image caption task.
 
 Here is one example of the json file for image text matching task. `match: 1` indicates image and text match.
 
-``` {json}
+```json
 {
   "images": [{"id": 1, "file_name": "train_images/honda.jpg", "zip_file": "train_images.zip"},
               {"id": 2, "file_name": "train_images/kitchen.jpg", "zip_file": "train_images.zip"}],
@@ -84,7 +84,7 @@ Here is one example of the json file for image matting task. The "label" in the
 
 Specifically, **only** image files are supported for the label files. The ground turth image should be one channel image (i.e. `PIL.Image` mode "L", instead of "RGB") that has the same width and height with the image file. Refer to the images in [tests/image_matting_test_data.zip](tests/image_matting_test_data.zip) as an example.
 
-``` {json}
+```json
 {
     "images": [{"id": 1, "file_name": "train_images/image/test_1.jpg", "zip_file": "train_images.zip"},
                 {"id": 2, "file_name": "train_images/image/test_2.jpg", "zip_file": "train_images.zip"}],
@@ -99,7 +99,7 @@ Specifically, **only** image files are supported for the label files. The ground
 
 Here is one example of the json file for the image regression task, where the "target" in the "annotations" field is a real-valued number (e.g. a score, an age, etc.). Note that each image should only have one regression target (i.e. there should be exactly one annotation for each image).
 
-``` {json}
+```json
 {
     "images": [{"id": 1, "width": 224.0, "height": 224.0, "file_name": "train_images/image_1.jpg", "zip_file": "train_images.zip"},
               {"id": 2, "width": 224.0, "height": 224.0, "file_name": "train_images/image_2.jpg", "zip_file": "train_images.zip"}],
@@ -111,28 +111,38 @@ Here is one example of the json file for the image regression task, where the "t
 ```
 
 ## Image retrieval
-This is an example of a JSON file for the image retrieval task. This format is similar to the image_caption and image_text_matching dataset formats, as it contains text associated with images. However, there are some important differences:
 
-1. The file may contain an optional "categories" section, which defines both a category name and an optional super category. This allows for a hierarchical structure in the data, which cannot be achieved with only the query field.
-2. Unlike the other two dataset formats, the annotations in this file contain a "query" field rather than a "text" or "caption" field. Each image is associated with a query as well as a category_id.
+This task represents data of images retrieved by text queries.
 
-The category_id can provide additional information related to the nature of the image. For example, an image can belong to a group of images (called a supercategory, e.g. "race") and within that group fall into a subgroup (e.g. "white" or "black"). Overall, this format allows for more complex and nuanced associations between images and text than other formats, due to the hierarchical structure provided by the category section.
+```json
+{
+    "images": [
+        {"id": 1, "zip_file": "test1.zip", "file_name": "test/0/image_1.jpg"},
+        {"id": 2, "zip_file": "test2.zip", "file_name": "test/1/image_2.jpg"}
+    ],
+    "annotations": [
+        {"image_id": 1, "id": 1, "query": "Men eating a banana."},
+        {"image_id": 2, "id": 2, "query": "An apple on the desk."}
+    ]
+}
+```
 
+The retrieved images might come with additional classification data associated with images in the annotation field mixed up with query annotation. This might change in future, as it can be achieved by using multitask dataset concept with one solely for image retrieval, while the other one solely for classification.
 
 
-``` {json}
+```json
 {
     "images": [
         {"id": 1, "zip_file": "test1.zip", "file_name": "test/0/image_1.jpg"},
         {"id": 2, "zip_file": "test2.zip", "file_name": "test/1/image_2.jpg"}
     ],
     "categories": [
-        {"id": 1, "name": "white", "supercategory": "race"},
-        {"id": 2, "name": "black", "supercategory": "race"}
+        {"id": 1, "name": "banana", "supercategory": "fruit"},
+        {"id": 2, "name": "apple", "supercategory": "fruit"}
     ],
     "annotations": [
-        {"image_id": 1, "id": 1, "category_id": 1, "query": "european men giving a speech"},
-        {"image_id": 2, "id": 2, "category_id": 2, "query": "african-american men giving a speech"}
+        {"image_id": 1, "id": 1, "category_id": 1, "query": "Men eating a banana."},
+        {"image_id": 2, "id": 2, "category_id": 2, "query": "An apple on the desk."}
     ]
 }
 ```

diff --git a/README.md b/README.md
@@ -16,7 +16,7 @@ Currently, seven `basic` types of data are supported:
 - `image_text_matching`: each image is associated with a collection of texts describing the image, and whether each text description matches the image or not.
 - `image_matting`: each image has a pixel-wise annotation, where each pixel is labeled as 'foreground' or 'background'.
 - `image_regression`: each image is labeled with a real-valued numeric regression target.
-- `image_retrieval`: each image is labeled with a number of text queries describing the image. optionally an image is associated with one label.
+- `image_retrieval`: each image is labeled with a number of text queries describing the image. Optionally, an image is associated with one label.
 
 `multitask` type is a composition type, where one set of images has multiple sets of annotations available for different tasks, where each task can be of any basic type.
 

diff --git a/setup.py b/setup.py
@@ -1,7 +1,7 @@
 import setuptools
 from os import path
 
-VERSION = '0.2.26'
+VERSION = '0.2.27'
 
 # Get the long description from the README file
 here = path.abspath(path.dirname(__file__))

diff --git a/vision_datasets/commands/check_dataset.py b/vision_datasets/commands/check_dataset.py
@@ -35,7 +35,7 @@ def quick_check_images(dataset: ManifestDataset):
         show_img(dataset[idx])
 
 
-def check_images(dataset: ManifestDataset, err_msg_file: pathlib.Path):
+def check_images(dataset: ManifestDataset):
     show_dataset_stats(dataset)
     file_not_found_list = []
     for i in tqdm(range(len(dataset)), 'Checking image access..'):
@@ -45,18 +45,42 @@ def check_images(dataset: ManifestDataset, err_msg_file: pathlib.Path):
             file_not_found_list.append(str(e))
 
     if file_not_found_list:
-        logger.info(f'Errors => {err_msg_file.as_posix()}')
-        err_msg_file.write_text('\n'.join(file_not_found_list), encoding='utf-8')
+        return ['Files not accessible: ' + (', '.join(file_not_found_list))]
+
+    return []
+
+
+def _is_integer(bbox):
+    return all([isinstance(x, int) or (isinstance(x, float) and x.is_integer()) for x in bbox])
+
+
+def check_box(bbox, img_w, img_h):
+    if len(bbox) != 4 or not _is_integer(bbox):
+        return False
+
+    l, t, r, b = bbox
+    return l >= 0 and t >= 0 and l < r and t < b and r <= img_w and b <= img_h
 
 
 def classification_detection_check(dataset: ManifestDataset):
+    errors = []
     n_imgs_by_class = {x: 0 for x in range(len(dataset.labels))}
-    for sample in dataset.dataset_manifest.images:
+    for sample_idx, sample in enumerate(dataset.dataset_manifest.images):
         labels = sample.labels
         c_ids = set([label[0] if dataset.dataset_info.type == DatasetTypes.OD else label for label in labels])
         for c_id in c_ids:
             n_imgs_by_class[c_id] += 1
 
+        if dataset.dataset_info.type == DatasetTypes.OD:
+            w, h = sample.width, sample.height
+            if not w or not h or w < 0 or h < 0:
+                errors.append(f'Image {sample_idx} has invalid width or height: {w}, {h}')
+                continue
+
+            for box_id, box in enumerate(labels):
+                if not check_box(box[1:], w, h):
+                    errors.append(f'Image {sample_idx}, box {box_id} is invalid: {box}\n')
+
     c_id_with_max_images = max(n_imgs_by_class, key=n_imgs_by_class.get)
     c_id_with_min_images = min(n_imgs_by_class, key=n_imgs_by_class.get)
     mean_images = sum(n_imgs_by_class.values()) / len(n_imgs_by_class)
@@ -79,6 +103,8 @@ def classification_detection_check(dataset: ManifestDataset):
     plt.show()
     logger.info(str(stats))
 
+    return errors
+
 
 def main():
     parser = argparse.ArgumentParser('Check if a dataset is valid for pkg to consume.')
@@ -103,16 +129,18 @@ def main():
         logger.info(f'{prefix} Check dataset with usage: {usage}.')
 
         # if args.local_dir is none, then this check will directly try to access data from azure blob. Images must be present in uncompressed folder on azure blob.
-        dataset = dataset_hub.create_manifest_dataset(container_sas=args.blob_container, local_dir=args.local_dir, name=dataset_info.name, version=args.version, usage=usage)
+        dataset = dataset_hub.create_manifest_dataset(container_sas=args.blob_container, local_dir=args.local_dir, name=dataset_info.name, version=args.version, usage=usage, coordinates='absolute')
         if dataset:
             err_msg_file = pathlib.Path(f'{args.name}_{usage}_errors.txt')
+            errors = []
+            if args.data_type in [DatasetTypes.IC_MULTICLASS, DatasetTypes.IC_MULTILABEL, DatasetTypes.OD]:
+                errors.extend(classification_detection_check(dataset))
+
             if args.quick_check:
                 quick_check_images(dataset)
             else:
-                check_images(dataset, err_msg_file)
-
-            if args.data_type in [DatasetTypes.IC_MULTICLASS, DatasetTypes.IC_MULTILABEL, DatasetTypes.OD]:
-                classification_detection_check(dataset)
+                errors.extend(check_images(dataset))
+            err_msg_file.write_text('\n'.join(errors), encoding='utf-8')
         else:
             logger.info(f'{prefix} No split for {usage} available.')
 

diff --git a/vision_datasets/common/data_manifest.py b/vision_datasets/common/data_manifest.py
@@ -97,7 +97,7 @@ def __init__(self, id, img_path, width, height, labels, label_file_paths=None, l
             height (int): image height
             labels (list or dict):
                 classification: [c_id] for multiclass, [c_id1, c_id2, ...] for multilabel;
-                detection: [[c_id, left, top, right, bottom], ...];
+                detection: [[c_id, left, top, right, bottom], ...] (absolute coordinates);
                 image_caption: [caption1, caption2, ...];
                 image_text_matching: [(text1, match (0 or 1), text2, match (0 or 1), ...)];
                 multitask: dict[task, labels];
@@ -121,7 +121,7 @@ def __init__(self, id, img_path, width, height, labels, label_file_paths=None, l
     def labels(self):
         if self._labels:
             return self._labels
-        elif self.label_file_paths:
+        elif self.label_file_paths: # lazy load only for image matting
             file_reader = FileReader()
             self._labels = []
             for label_file_path in self.label_file_paths:
@@ -150,7 +150,7 @@ def __init__(self, images: List[ImageDataManifest], labelmap, data_type):
             data_type (str or dict) : data type, or data type by task name
 
         """
-        assert data_type != DatasetTypes.MULTITASK, 'For multitask, data_type should be a dict mapping task name to concrete data type.'
+        assert data_type and data_type != DatasetTypes.MULTITASK, 'For multitask, data_type should be a dict mapping task name to concrete data type.'
 
         if isinstance(labelmap, dict):
             assert isinstance(data_type, dict), 'labelmap being a dict indicating this is a multitask dataset, however the data_type is not a dict.'
@@ -167,6 +167,7 @@ def create_dataset_manifest(dataset_info, usage: str, container_sas_or_root_dir:
 
         if dataset_info.data_format == Formats.IRIS:
             return IrisManifestAdaptor.create_dataset_manifest(dataset_info, usage, container_sas_or_root_dir)
+
         if dataset_info.data_format == Formats.COCO:
             container_sas_or_root_dir = _construct_full_url_or_path_generator(container_sas_or_root_dir, dataset_info.root_folder)('')
             if dataset_info.type == DatasetTypes.MULTITASK:
@@ -828,15 +829,16 @@ def create_dataset_manifest(coco_file_path_or_url: Union[str, dict, pathlib.Path
 
         file_reader.close()
 
-        def get_file_path(info_dict: dict, file_name):
+        def append_zip_prefix_if_needed(info_dict: dict, file_name):
             zip_prefix = info_dict.get('zip_file', '')
             if zip_prefix:
                 zip_prefix += '@'
 
             return get_full_sas_or_path(zip_prefix + file_name)
 
-        images_by_id = {img['id']: ImageDataManifest(img['id'], get_file_path(img, img['file_name']), img.get('width'), img.get('height'), [], {}) for img in coco_manifest['images']}
+        images_by_id = {img['id']: ImageDataManifest(img['id'], append_zip_prefix_if_needed(img, img['file_name']), img.get('width'), img.get('height'), [], {}) for img in coco_manifest['images']}
         process_labels_without_categories = None
+
         if data_type == DatasetTypes.IMCAP:
             def process_labels_without_categories(image):
                 image.labels.append(annotation['caption'])
@@ -846,7 +848,7 @@ def process_labels_without_categories(image):
         elif data_type == DatasetTypes.IMAGE_MATTING:
             def process_labels_without_categories(image):
                 image.label_file_paths = image.label_file_paths or []
-                image.label_file_paths.append(get_file_path(annotation, annotation['label']))
+                image.label_file_paths.append(append_zip_prefix_if_needed(annotation, annotation['label']))
         elif data_type == DatasetTypes.IMAGE_REGRESSION:
             def process_labels_without_categories(image):
                 assert len(image.labels) == 0, f"There should be exactly one label per image for image_regression datasets, but image with id {annotation['image_id']} has more than one"
@@ -861,12 +863,12 @@ def process_labels_without_categories(image):
             images = [x for x in images_by_id.values()]
             return DatasetManifest(images, None, data_type)
 
-        supercategory_field_in_categories = False
-        if len(coco_manifest['categories']) > 0 and 'supercategory' in coco_manifest['categories'][0]:
-            supercategory_field_in_categories = True
+        supercategory_field_in_categories = len(coco_manifest['categories']) > 0 and 'supercategory' in coco_manifest['categories'][0]
+        if supercategory_field_in_categories:
             cate_id_name = [(cate['id'], cate['name'], cate['supercategory']) for cate in coco_manifest['categories']]
         else:
             cate_id_name = [(cate['id'], cate['name']) for cate in coco_manifest['categories']]
+
         cate_id_name.sort(key=lambda x: x[0])
         label_id_to_pos = {x[0]: i for i, x in enumerate(cate_id_name)}
         if supercategory_field_in_categories:
@@ -882,8 +884,8 @@ def process_labels_without_categories(image):
             img = images_by_id[annotation['image_id']]
             if 'bbox' in annotation:
                 bbox = annotation['bbox']
-                if bbox_format == BBoxFormat.LTWH:
-                    bbox = [bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3]]
+                bbox = bbox if bbox_format == BBoxFormat.LTRB else [bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3]]
+
                 label = [c_id] + bbox
                 img.labels_extra_info['iscrowd'] = img.labels_extra_info.get('iscrowd', [])
                 img.labels_extra_info['iscrowd'].append(annotation.get('iscrowd', 0))