Video transcription utilizing Speaker Diarization with Pyannote and Whisper.cpp

Uses yt-dlp to download and convert media, Whisper.cpp to transcribe audio, and then performs speaker diarization with Pyannote.

Usage

Set HF_TOKEN (Hugging Face token) and VIDEO_URL environment variables in docker-compose.yml, and then run main.py with docker compose up.

The large whisper model is automatically downloaded, but this can be adjusted in the Dockerfile.

Performance for diarization seems to be improved when segment length for whisper is decreased, such as --max-len 50.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.devcontainer		.devcontainer
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt