upload readme and figs

ViTAE-Transformer · May 2, 2023 · 86d269f · 86d269f
1 parent 7b610a8
commit 86d269f
Show file tree

Hide file tree

Showing 7 changed files with 179 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -1,2 +1,179 @@
-# SAMText
-The official repo for the technical report "Scalable Mask Annotation for Video Text Spotting"
+<h1 align="center">[Arxiv 2023] Scalable Mask Annotation for Video Text Spotting</a></h1>
+<p align="center">
+<h4 align="center">This is the official repository of the paper <a href="https://xxxx.com">Scalable Mask Annotation for Video Text Spotting</a>.</h4>
+<h5 align="center"><em>Haibin He, Jing Zhang, Mengyang Xu, Juhua Liu, Bo Du, Dacheng Tao</em></h5>
+<p align="center">
+  <a href="#news">News</a> |
+  <a href="#abstract">Abstract</a> |
+  <a href="#method">Method</a> |
+  <a href="#usage">Usage</a> |
+  <a href="#results">Results</a> |
+  <a href="#statement">Statement</a>
+</p>
+
+
+
+
+# News
+
+***02/05/2023***
+
+- The paper is post on arxiv! The code will be made public available once cleaned up.
+
+- Relevant Project: 
+
+  > [**DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer** ](https://arxiv.org/abs/2207.04491) | [Code](https://github.com/ymy-k/DPText-DETR)
+  >
+  > Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Bo Du, Dacheng Tao
+  >
+  > [**DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting** ](https://arxiv.org/pdf/2211.10772v3) | [Code](https://github.com/ViTAE-Transformer/DeepSolo)
+
+  Other applications of [ViTAE](https://github.com/ViTAE-Transformer/ViTAE-Transformer) inlcude: [ViTPose](https://github.com/ViTAE-Transformer/ViTPose) | [Remote Sensing](https://github.com/ViTAE-Transformer/ViTAE-Transformer-Remote-Sensing) | [Matting](https://github.com/ViTAE-Transformer/ViTAE-Transformer-Matting) | [VSA](https://github.com/ViTAE-Transformer/ViTAE-VSA) | [Video Object Segmentation](https://github.com/ViTAE-Transformer/VOS-LLB)
+
+# Abstract
+
+<p align="left">Video text spotting refers to localizing, recognizing, and tracking textual elements
+such as captions, logos, license plates, signs, and other forms of text within consecutive
+video frames. However, current datasets available for this task rely on
+quadrilateral ground truth annotations, which may result in including excessive
+background content and inaccurate text boundaries. Furthermore, methods trained
+on these datasets often produce prediction results in the form of quadrilateral boxes,
+which limits their ability to handle complex scenarios such as dense or curved text.
+To address these issues, we propose a scalable mask annotation pipeline called
+SAMText for video text spotting.SAMText leverages the <a href="https://arxiv.org/abs/2304.02643">SAM</a> model to
+generate mask annotations for scene text images or video frames at scale. Using
+SAMText, we have created a large-scale dataset, SAMText-9M, that contains over
+2,400 video clips sourced from existing datasets and over 9 million mask annotations.
+We have also conducted a thorough statistical analysis of the generated
+masks and their quality, identifying several research topics that could be further
+explored based on this dataset. 
+
+
+
+
+# Method
+<figure>
+<img src="figs/opening.png">
+<figcaption align = "center"><b>Figure 1: Overview of the SAMText pipeline that builds upon the <a href="https://arxiv.org/abs/2304.02643">SAM</a>   approach to generate
+mask annotations for scene text images or video frames at scale. The input bounding box may be
+sourced from existing annotations or derived from a scene text detection model.</b></figcaption>
+</figure>
+
+
+
+
+
+# Usage
+The code and models will be released soon.
+
+
+
+# Results
+# The Quality of Generated Masks
+
+<figure>
+<img src="figs/figure3.png">
+<figcaption align = "center"><b>Figure 3: The distribution of IoU between the generated
+masks and ground truth masks in the COCOText
+training dataset:  <a href="https://arxiv.org/abs/1601.07140">COCO_Text V2</a>  
+ </b></figcaption>
+</figure>
+
+To evaluate the performance of SAMText, we
+select the COCO-Text training dataset [25] as it
+provides ground truth mask annotations for text
+instances. Specifically, we randomly sample
+10% of the training data and calculate the IoU
+between the masks generated by SAMText and
+their corresponding ground truth masks. Our
+findings show that SAMText has high accuracy,
+with an average IoU of 0.70. The histogram of
+IoU scores is shown in Fig. 3. Figure 3 presents
+the histogram of IoU scores. Notably, the majority
+of IoU scores are centered around 0.75,
+suggesting that SAMText performs well.
+
+
+
+
+
+# Visualization of Generated Masks
+
+
+
+<figure>
+<img src="figs/figure2.jpg">
+<figcaption align = "center"><b>Figure 2: Some visualization results of the generated masks in five datasets using the SAMText
+pipeline. The top row shows the scene text frames while the bottom row shows the generated masks.</a>  
+ </b></figcaption>
+</figure>
+
+In Figure 2, we show some visualization results of the generated masks in five datasets using the
+SAMText pipeline. The top row shows the scene text frames while the bottom row shows the
+generated masks. As can be seen, the generated masks possess fewer background components and
+align more precisely with the text boundaries than the bounding boxes. As a result, the generated
+mask annotations facilitate conducting more comprehensive research on this dataset, e.g., video text
+segmentation and video text spotting using mask annotations.
+
+
+
+
+
+
+## Dataset Statistics and Analysis
+### The size distribution.
+
+<figure>
+<img src="figs/figure4.png">
+<figcaption align = "center"><b>Figure 4: (a) The mask size distributions of the ICDAR15, RoadText-1k, LSVDT, and DSText datasets.
+Masks exceeding 10,000 pixels are excluded from the statistics. (b) The mask size distributions of
+the BOVText datasets. Masks exceeding 80,000 pixels are excluded from the statistics.</a>  
+ </b></figcaption>
+</figure>
+
+
+
+### The IoU and COV distribution.
+
+<figure>
+<img src="figs/figure5.png">
+<figcaption align = "center"><b>Figure 5: (a) The distribution of IoU between the generated masks and ground truth bounding boxes
+in each dataset. (b) The CoV distribution of mask size changes for the same individual in consecutive
+frames in all five datasets, excluding the CoV scores exceeding 1.0 from the statistics.</a>  
+ </b></figcaption>
+</figure>
+
+
+
+### The spatial distribution.
+
+<figure>
+<img src="figs/figure6.png">
+<figcaption align = "center"><b>Figure 6: Visualization of the heatmaps that depict the spatial distribution of the generated masks in
+the five video text spotting datasets employed to establish SAMText-9M.</a>  
+ </b></figcaption>
+</figure>
+
+
+
+# Statement
+
+This project is for research purpose only. For any other questions please contact [[email protected]](mailto:[email protected]).
+
+
+
+## Citation
+
+If you find SAMText helpful, please consider giving this repo a star:star: and citing:
+
+```
+@inproceedings{SAMText,
+  title={ Scalable Mask Annotation for Video Text Spotting},
+  author={Haibin He, Jing Zhang, Mengyang Xu, Juhua Liu, Bo Du, Dacheng Tao},
+  booktitle={arxiv},
+  year={2023}
+}
+```
+
+
+
diff --git a/figs/figure2.jpg b/figs/figure2.jpg
diff --git a/figs/figure3.png b/figs/figure3.png
diff --git a/figs/figure4.png b/figs/figure4.png
diff --git a/figs/figure5.png b/figs/figure5.png
diff --git a/figs/figure6.png b/figs/figure6.png
diff --git a/figs/opening.png b/figs/opening.png