A Python-based tool that converts PDF files to text using OCR (Optical Character Recognition) technology.
- Python 3.13 or higher
- Tesseract OCR engine installed on your system
- UV package manager
-
Install Tesseract OCR on your system:
- Ubuntu/Debian:
sudo apt-get install tesseract-ocr
- macOS:
brew install tesseract
- Windows: Download and install from GitHub Tesseract releases
- Ubuntu/Debian:
-
Install UV if you haven't already:
curl -LsSf https://astral.sh/uv/install.sh | sh
-
Clone the repository:
git clone https://github.com/davibusanello/pdf2txt.git cd pdf2txt
-
Create and activate a virtual environment with UV:
uv venv source .venv/bin/activate # On Unix/macOS # or .venv\Scripts\activate # On Windows
-
Install dependencies:
uv sync
To enable shell completion (for bash/zsh), add this to your shell's rc file (~/.bashrc, ~/.zshrc):
eval "$(register-python-argcomplete $(which pdf2txt))"
# or with explicit path
eval "$(register-python-argcomplete /path/to/your/virtual/env/bin/pdf2txt)"
After setting up completion, you can use TAB to:
- Auto-complete PDF files when entering the input file
- Auto-complete directories and .txt files when entering the output file
pdf2image
: For converting PDF pages to imagesPillow
: For image processingpytesseract
: Python wrapper for Google's Tesseract OCR engine
# Basic usage (default 4 threads and 3 pages per thread)
uv run pdf2txt input.pdf output.txt
# With wildcards
uv run pdf2txt "documents/*.pdf" outputs.txt
# Changing threads and chunk size
uv run pdf2txt input.pdf output.txt --max-threads 8 --chunk-size 5
# Help
uv run pdf2txt --help
When using wildcards with multiple input files, each output will be named as output_filename.txt
where filename is the name of the input PDF.
MIT Copyright (c) 2025 Davi Busanello [email protected]
Contributions are welcome! Please feel free to submit a Pull Request.