This feature extracts video features via an action recognition model in GluonCV, audio featues usins panns_interfernce, Mae features VideoMAE, ASR features using Whisper for transcriptions and Bert for tokenization, and CLIP features using using CLIP. These features will be used to train an audio descriptive captioning moddel.
The following step will require Conda installed. Run the following to create the Conda environment with all the dependencies:
conda env create -f ENV.yml
Run the below command to run the extractor with the videos set to the directory path of of the videos and output to where you want the features stored
python full_extraction.py --videos=video_path --output=output_path