Construct image dataset from Common Crawl data

This repo provide easy scripts to construct image dataset based on keywords from Common Crawl data. The dataset is constructed by downloading images from the webpages that contain the keywords. The images are filtered based on OPENAI CLIP model.

Prerequisites

Install Pytorch from the official website: https://pytorch.org/get-started/locally/ e.g., on Linux, you can install Pytorch by running the following command:

conda install pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia

Install other required packages by running the following command:

pip install -r requirements.txt

(Make sure Python >= 3.10)

Usage

Step1: Specify a crawl

You can find crawls at Common Craw website. e.g., CC-MAIN-2023-50.

Step2: Setup AWS credentials (Optional, recommanded)

You need to setup AWS credentials to download the crawl data. You can find the command line instructions at AWS website.

Step3: Query the crawl

You can query the crawl by running the following command:

python query_crawl.py --crawl CC-MAIN-2023-50 --metadata_file /path/to/ketword/json --archive_folder /path/to/output --output_folder /path/to/output

--crawl: the crawl id. e.g., CC-MAIN-2023-50.
--keyword_json: the path to the json file that contains the keywords. e.g., keywords.json.
--archive_folder: the path to the folder to store the temporary data downloaded crawl data. e.g., temp_data/cc.
--output_folder: the path to the folder to store the output json which contains the matched items and image url. e.g., output_data/cc_matches.

If you did not setup AWS credentials, you can specify the credientials by using the following arguments:

--aws_access_key_id: the access key id.
--aws_secret_access_key: the secret access key.
--aws_session_token: the session token.

Step4: Download images

You can download images by running the following command:

python download_images.py --meta_folder /path/to/matches/json --output_folder /path/for/output/images --num_workers 25 --keyword_json /path/to/keyword/json

Step5: Filter images based on CLIP model

You can filter images based on CLIP model by running the following command:

python filter_images.py --image_folder /path/to/images --meta_folder /path/to/matches/json --model_name "ViT-B/32" --device "cuda:0" --output_folder /path/to/output --threshold 27.5 --delete_images

The threshold can be adjusted based on the specific task. It is recommanded to use a desired subset of data to find the threshold. The delete_images argument is used to delete the images that do not pass the threshold.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
aggregate_metafiles.py		aggregate_metafiles.py
download_cc_images.py		download_cc_images.py
filter_images.py		filter_images.py
query_common_crawl.py		query_common_crawl.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Construct image dataset from Common Crawl data

Prerequisites

Usage

Step1: Specify a crawl

Step2: Setup AWS credentials (Optional, recommanded)

Step3: Query the crawl

Step4: Download images

Step5: Filter images based on CLIP model

About

Releases

Packages

Languages

License

Chanfeechen/datasaet-from-CommonCrawl

Folders and files

Latest commit

History

Repository files navigation

Construct image dataset from Common Crawl data

Prerequisites

Usage

Step1: Specify a crawl

Step2: Setup AWS credentials (Optional, recommanded)

Step3: Query the crawl

Step4: Download images

Step5: Filter images based on CLIP model

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages