Skip to content

A simple CLI to to convert PDF files into TXT using OCR

License

Notifications You must be signed in to change notification settings

davibusanello/pdf2txt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF to Text Converter (pdf2txt)

A Python-based tool that converts PDF files to text using OCR (Optical Character Recognition) technology.

Requirements

  • Python 3.13 or higher
  • Tesseract OCR engine installed on your system
  • UV package manager

Installation

  1. Install Tesseract OCR on your system:

    • Ubuntu/Debian: sudo apt-get install tesseract-ocr
    • macOS: brew install tesseract
    • Windows: Download and install from GitHub Tesseract releases
  2. Install UV if you haven't already:

    curl -LsSf https://astral.sh/uv/install.sh | sh
  3. Clone the repository:

    git clone https://github.com/davibusanello/pdf2txt.git
    cd pdf2txt
  4. Create and activate a virtual environment with UV:

    uv venv
    source .venv/bin/activate  # On Unix/macOS
    # or
    .venv\Scripts\activate     # On Windows
  5. Install dependencies:

    uv sync

Shell Completion

To enable shell completion (for bash/zsh), add this to your shell's rc file (~/.bashrc, ~/.zshrc):

eval "$(register-python-argcomplete $(which pdf2txt))"
# or with explicit path
eval "$(register-python-argcomplete /path/to/your/virtual/env/bin/pdf2txt)"

After setting up completion, you can use TAB to:

  • Auto-complete PDF files when entering the input file
  • Auto-complete directories and .txt files when entering the output file

Dependencies

  • pdf2image: For converting PDF pages to images
  • Pillow: For image processing
  • pytesseract: Python wrapper for Google's Tesseract OCR engine

Usage

# Basic usage (default 4 threads and 3 pages per thread)
uv run pdf2txt input.pdf output.txt

# With wildcards
uv run pdf2txt "documents/*.pdf" outputs.txt

# Changing threads and chunk size
uv run pdf2txt input.pdf output.txt --max-threads 8 --chunk-size 5

# Help
uv run pdf2txt --help

When using wildcards with multiple input files, each output will be named as output_filename.txt where filename is the name of the input PDF.

License

MIT Copyright (c) 2025 Davi Busanello [email protected]

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Releases

No releases published

Packages

No packages published

Languages