Object detection has been expanded from a limited number of categories to open vocabulary. Moving forward, a complete intelligent vision system requires understanding more fine-grained object descriptions, object parts. In this work, we propose a detector with the ability to predict both open-vocabulary objects and their part segmentation. This ability comes from two designs:
- We train the detector on the joint of part-level, object-level and image-level data.
- We parse the novel object into its parts by its dense semantic correspondence with the base object.
[arXiv
]
See installation instructions.
See Preparing Datasets and Preparing Models.
See Getting Started for demo, training and inference.
We provide a large set of baseline results and trained models in the Model Zoo.
The majority of this project is licensed under a MIT License. Portions of the project are available under separate license of referred projects, including CLIP, Detic and dino-vit-features. Many thanks for their wonderful works.
If you use VLPart in your research or wish to refer to the baseline results published here, please use the following BibTeX entries:
@article{peize2023vlpart,
title = {Going Denser with Open-Vocabulary Part Segmentation},
author = {Sun, Peize and Chen, Shoufa and Zhu, Chenchen and Xiao, Fanyi and Luo, Ping and Xie, Saining and Yan, Zhicheng},
journal = {arXiv preprint arXiv:2305.11173},
year = {2023}
}
Grounded Segment Anything: From Objects to Parts: A dialogue system to detect, segment and edit anything in part-level in the image.
Semantic-SAM: A universal image segmentation model to enable segment and recognize anything at any desired granularity.