This repo provide easy scripts to construct image dataset based on keywords from Common Crawl data. The dataset is constructed by downloading images from the webpages that contain the keywords. The images are filtered based on OPENAI CLIP model.
- Install Pytorch from the official website: https://pytorch.org/get-started/locally/ e.g., on Linux, you can install Pytorch by running the following command:
conda install pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia
- Install other required packages by running the following command:
pip install -r requirements.txt
(Make sure Python >= 3.10)
You can find crawls at Common Craw website. e.g., CC-MAIN-2023-50
.
You need to setup AWS credentials to download the crawl data. You can find the command line instructions at AWS website.
You can query the crawl by running the following command:
python query_crawl.py --crawl CC-MAIN-2023-50 --metadata_file /path/to/ketword/json --archive_folder /path/to/output --output_folder /path/to/output
--crawl
: the crawl id. e.g.,CC-MAIN-2023-50
.--keyword_json
: the path to the json file that contains the keywords. e.g.,keywords.json
.--archive_folder
: the path to the folder to store the temporary data downloaded crawl data. e.g.,temp_data/cc
.--output_folder
: the path to the folder to store the output json which contains the matched items and image url. e.g.,output_data/cc_matches
.
If you did not setup AWS credentials, you can specify the credientials by using the following arguments:
--aws_access_key_id
: the access key id.--aws_secret_access_key
: the secret access key.--aws_session_token
: the session token.
You can download images by running the following command:
python download_images.py --meta_folder /path/to/matches/json --output_folder /path/for/output/images --num_workers 25 --keyword_json /path/to/keyword/json
You can filter images based on CLIP model by running the following command:
python filter_images.py --image_folder /path/to/images --meta_folder /path/to/matches/json --model_name "ViT-B/32" --device "cuda:0" --output_folder /path/to/output --threshold 27.5 --delete_images
The threshold can be adjusted based on the specific task. It is recommanded to use a desired subset of data to find the threshold. The delete_images
argument is used to delete the images that do not pass the threshold.