Benetech Video Deduplication Project

Near Duplicate, object, and metadata detection for video files.

Installation
- Prerequisites
- Building and Running Application
Configuration
Running

e-Learning Module

To find out more about the project, installation, and running the tool you may review our e-Learning module: https://benetech.github.io/VideoDeduplication/

Installation (Ubuntu with Docker)

Prerequisites

Install and configure Docker

The easiest, most consistent method for installing Docker on Ubuntu can be found at: https://get.docker.com/

run:

curl -fsSL https://get.docker.com -o get-docker.sh

followed by:

bash get-docker.sh

To allow docker to by used by non-root users:

Create the docker group.

sudo groupadd docker

Add your user to the docker group.

sudo usermod -aG docker $USER

Log out and log back in so that your group membership is re-evaluated.

If testing on a virtual machine, it may be necessary to restart the virtual machine for changes to take effect.

On a desktop Linux environment such as X Windows, log out of your session completely and then log back in.

On Linux, you can also run the following command to activate the changes to groups:

newgrp docker

Once the above has been completed. Open a command prompt window and type the ‘docker’ command to confirm that the Docker service is available and returning the help guide.

Enable GPU support for Docker

Assuming docker has been installed run the following command and install the NVIDIA Docker runtime using the script in the main project folder [GPU LINUX ONLY]:

bash install_nvidia_docker.sh

Install docker-compose

Run:

sudo curl -L "https://github.com/docker/compose/releases/download/1.26.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

then modify permissions:

sudo chmod +x /usr/local/bin/docker-compose

Fetch Codebase

git clone https://github.com/benetech/VideoDeduplication.git

Building and Running Application

Docker-Compose

The default approach to build and run the application is to use docker-compose utility.

Shortcut commands to run the application are:

make run - build and run application
make stop - stop application

The make run will ask you the following questions:

Location of your source video files
Availability of Nvidia GPU support for Docker (see Enable GPU support for Docker)
Whether you want to use pre-built images

The docker-compose.yml configuration relies on various environment variables. The only required variable is

BENETECH_DATA_LOCATION - path to the root folder containing your video files.

You can set environment variables in the .env file at the repository root folder.

By default docker-compose will build all required containers and assume Nvidia GPU support is available. You can also use various predefined configuration extensions placed in the ./docker-compose directory (see ./docker-compose/README.md)

The make run shortcut is a tiny wrapper around the docker-compose command which chooses appropriate configuration extensions. If you specified the BENETECH_DATA_LOCATION environment variable (either in your shell or in .env file) you can simply execute sudo docker-compose up -d to run the default configuration.

The command above might throw an error if you already have postgres server running. If that's the case run systemctl stop postgresql (Linux) before using docker-compose or choose alternative postgres-port by setting the BENETECH_PG_PORT environment variable.

Exploring Application

Once the docker-compose is running, you will be able to access the following:

User interface on http://localhost:5000
projects notebooks on http://localhost:8888
pgAdmin on http://localhost:16543

You can check your running instances using this command:

sudo docker ps

Take note of the following names:

Deduplication App -> videodeduplication_dedup-app_1
User Interface -> videodeduplication_server_1
Postgres Server -> videodeduplication_postgres_1
PgAdmin -> videodeduplication_pgadmin-compose_1

In order to use pgAdmin, follow these instructions:

go to http://localhost:1643 and use the credentials as defined on the docker-compose.yml file.
Click create new server
Choose a reference name for the server
Go the connection tab and set the host name to postgres, maintenance database to "videodeduplicationdb" and user / password as postgres and admin

In order to run the main scripts, simply enter the app's docker container by running the following command:

docker exec -it videodeduplication_dedup-app_1 /bin/bash

Once within the container, run one of the main scripts as described on the "running" section of this documentation.

Pre-Built Images

If you don't want to build Docker images locally you can use prebuilt-images hosted on Docker Hub

If you use make run command you can set BENETECH_PREBUILT=YES in the .env file.

If you use docker-compose explicitly you can run:

sudo docker-compose -f docker-compose.yml -f docker-compose/prebuilt.yml up -d

To pull images run:

docker pull johnhbenetech/videodeduplication:gpu

Build Images Manually

You can build and run containers manually:

sudo docker build -f docker/Dockerfile.dedup-gpu -t benetech-dedup:gpu .
sudo docker build -f docker/Dockerfile.server -t benetech-server .

Configuration

This repo contains three main scripts that perform the following tasks:

1. extract_features.py : Signature extraction Pipeline
2. generate_matches.py : Signature to Matches (saved as CSV)
3. template_matching.py: Uses source templates to query the extracted embeddings and generates a report containing potential matches
4. audio_processing.py: Audio processing pipeline developed in collaboration with Microsoft as described on our [wiki](https://github.com/benetech/VideoDeduplication/wiki/Audio-Processing)

Important notebooks include (located inside the notebooks folder):

1. Visualization and Annotation Tool.ipynb: Allows the output of the generate_matches script to be reviewed and annotated.
2. Template Matching Demo.ipynb: Allows the output of the extract_features script to be queried against known videos / images [as defined in custom templates built by the user]

These scripts use the 'config.yaml' file to define where to collect data from, hyperparameters (...)

video_source_folder: Directory where the source video files are located

destination_folder: Destination of the output files generated from the scripts

root_folder_intermediate: Folder name used for the intermediate representations (Make sure it's compatible with the next paremeter)

match_distance: Distance threshold that determines whether two videos are a match [FLOAT - 0.0 to 1.0]

video_list_filename: Name of the file that contains the list of processed video files (to be saved by the extraction script)

filter_dark_videos: [true / false] Whether to remove dark videos from final output files.

filter_dark_videos_thr:[1-10 int range] Ideally a number between 1 and 10. Higher numbers means we will be less strict when filtering out dark videos.

*min_video_duration_seconds: Minimum video duration in secondds

detect_scenes: [true / false] Whether to run scene detection or not.

minimum_scene_duration: [1-5 int range] Ideally a number between 1 and 10. Higher numbers means we will be append smaller scenes into larger oners.

use_pretrained_model_local_path: [true / false] Whether to use the pretrained model from your local file system

pretrained_model_local_path:: Absolute path to pretrained model in case the user doesn't want to download it from S3

use_db: : [true / false] true conninfo: Connection string (eg. postgres://[USER]:[PASSWORD]@[URL]:[PORT]/[DBNAME]). When using it using our Docker workflow, URL should default to "videodeduplication_postgres_1" instead of localhost

keep_fileoutput: [true / false]. Whether to keep regular output even with results being saved in DB

templates_source_path: Directory where templates of interest are located (should be the path to a directory where each folder contains images related to the template - eg: if set for the path datadrive/templates/, this folder could contain sub-folders like plane, smoke or bomb with its respective images on each folder)

Running

Within the docker command line

Extract video signatures

python extract_features.py

Arguments:

'--config', '-cp' : Path to the project config file [default:'config.yml']
'--list-of-files', '-lof' : path to txt with a list of files for processing - overrides source folder from the config file
[default:'']
'--frame-sampling', '-fs': 'Sets the sampling strategy (values from 1 to 10 - eg sample one frame every X seconds) - overrides frame sampling from the config file' [default:1]
--save-frames', '-sf': 'Whether to save the frames sampled from the videos - overrides save_frames on the config file'[default:False]

Generate matches

python generate_matches.py

Arguments:

'--config', '-cp' : Path to the project config file [default:'config.yml']
'--list-of-files', '-lof' : path to txt with a list of files for processing and generating matches / scene detection / metadata extraction - overrides loading all signatures available from the file system

Audio processing

python audio_processing.py

Arguments:

'--config', '-cp' : Path to the project config file [default:'config.yml']
'--list-of-files', '-lof' : path to txt with a list of files for processing and generating matches / scene detection / metadata extraction - overrides loading all signatures available from the file system
'--cores', '-' : Number of cores to be used on parallel processing routines [default:5]
"--model", "-m" : Path to the audio processing model", [default:'data/audio_model.h5']

Template Object Matching

python template_matching.py

Arguments:

'--override', '-ovr' : Overrides the previous template matches saved on the DB [default:False]
'--template-dir', '-td' : path to a directory containing templates - overrides source folder from the config file'
[default:'']

Exif Extraction

python extract_exif.py

Benchmarks

We have created a few benchmarking scripts to allow performance testing for a few features of the project.

Video Deduplication

In order to evalute video deduplication, please run the script below:

python benchmarks/evaluate.py --benchmark augmented_dataset

This script will download our testing dataset and run our pipeline on it. Results are stress tested using random sampling to create random query/answers pairs at different levels of positive/negative examples (eg. what's the performance of our model when 10% of the content is duplicated? what about at 15%). The results of the benchmarking script are saved at the root of the data folder.

For more details about our evaluation metric please refer to our wiki

Template Matching

In order to evaluate template matching, please run the script below:

python benchmarks/evaluate.py --benchmark landmarks

This script will download our subset of the google landmark dataset. Our script uses samples of landmarks to create query templates and runs those templates against random subsets landmarks.

The results of the benchmarking script are saved at the root of the data folder.

Scene detection

In order to evaluate scene detection, please run the script below:

python benchmarks/evaluate.py --benchmark scene_detection

This script will download our subset of the Planet Earth.

The results of the benchmarking script are saved at the root of the data folder.

Name		Name	Last commit message	Last commit date
Latest commit History 869 Commits
.github		.github
.mk		.mk
benchmarks		benchmarks
cli		cli
db		db
docker		docker
docs		docs
models		models
notebooks		notebooks
references		references
repo_admin		repo_admin
reports		reports
scripts		scripts
server		server
task_queue		task_queue
template_support		template_support
tests		tests
web		web
winnow.egg-info		winnow.egg-info
winnow		winnow
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
audio_processing.py		audio_processing.py
conftest.py		conftest.py
default.config.yaml		default.config.yaml
docker-compose.yml		docker-compose.yml
docker_tensorflow.sh		docker_tensorflow.sh
environment-gpu.yaml		environment-gpu.yaml
environment.yaml		environment.yaml
extract_exif.py		extract_exif.py
extract_features.py		extract_features.py
generate_matches.py		generate_matches.py
generate_matches_remote.py		generate_matches_remote.py
generate_remote_matches.py		generate_remote_matches.py
ingest_jobs.py		ingest_jobs.py
install_nvidia_docker.sh		install_nvidia_docker.sh
network_vis.py		network_vis.py
process_video_url.py		process_video_url.py
requirements-winnow-unit-tests.txt		requirements-winnow-unit-tests.txt
requirements.txt		requirements.txt
run_docker_container.sh		run_docker_container.sh
serve_jupyter.sh		serve_jupyter.sh
servers.json		servers.json
setup.py		setup.py
template_matching.py		template_matching.py
test_environment.py		test_environment.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benetech Video Deduplication Project

e-Learning Module

Installation (Ubuntu with Docker)

Prerequisites

Install and configure Docker

Enable GPU support for Docker

Install docker-compose

Fetch Codebase

Building and Running Application

Docker-Compose

Exploring Application

Pre-Built Images

Build Images Manually

Configuration

Running

Video Deduplication

Template Matching

Scene detection

About

Releases

Packages

Contributors 4

Languages

License

benetech/VideoDeduplication

Folders and files

Latest commit

History

Repository files navigation

Benetech Video Deduplication Project

e-Learning Module

Installation (Ubuntu with Docker)

Prerequisites

Install and configure Docker

Enable GPU support for Docker

Install docker-compose

Fetch Codebase

Building and Running Application

Docker-Compose

Exploring Application

Pre-Built Images

Build Images Manually

Configuration

Running

Video Deduplication

Template Matching

Scene detection

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages