PDF to Text Converter (pdf2txt)

A Python-based tool that converts PDF files to text using OCR (Optical Character Recognition) technology.

Requirements

Python 3.13 or higher
Tesseract OCR engine installed on your system
UV package manager

Installation

Install Tesseract OCR on your system:
- Ubuntu/Debian: sudo apt-get install tesseract-ocr
- macOS: brew install tesseract
- Windows: Download and install from GitHub Tesseract releases

Install UV if you haven't already:

curl -LsSf https://astral.sh/uv/install.sh | sh

Clone the repository:

git clone https://github.com/davibusanello/pdf2txt.git
cd pdf2txt

Create and activate a virtual environment with UV:

uv venv
source .venv/bin/activate  # On Unix/macOS
# or
.venv\Scripts\activate     # On Windows

Install dependencies:
```
uv sync
```

Shell Completion

To enable shell completion (for bash/zsh), add this to your shell's rc file (~/.bashrc, ~/.zshrc):

eval "$(register-python-argcomplete $(which pdf2txt))"
# or with explicit path
eval "$(register-python-argcomplete /path/to/your/virtual/env/bin/pdf2txt)"

After setting up completion, you can use TAB to:

Auto-complete PDF files when entering the input file
Auto-complete directories and .txt files when entering the output file

Dependencies

pdf2image: For converting PDF pages to images
Pillow: For image processing
pytesseract: Python wrapper for Google's Tesseract OCR engine

Usage

# Basic usage (default 4 threads and 3 pages per thread)
uv run pdf2txt input.pdf output.txt

# With wildcards
uv run pdf2txt "documents/*.pdf" outputs.txt

# Changing threads and chunk size
uv run pdf2txt input.pdf output.txt --max-threads 8 --chunk-size 5

# Help
uv run pdf2txt --help

When using wildcards with multiple input files, each output will be named as output_filename.txt where filename is the name of the input PDF.

License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
src		src
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF to Text Converter (pdf2txt)

Requirements

Installation

Shell Completion

Dependencies

Usage

License

Contributing

About

Releases

Packages

Languages

License

davibusanello/pdf2txt

Folders and files

Latest commit

History

Repository files navigation

PDF to Text Converter (pdf2txt)

Requirements

Installation

Shell Completion

Dependencies

Usage

License

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages