GitHub - genieincodebottle/parsemypdf: Collection of PDF parsing libraries like AI based docling, claude, openai, llama-vision, unstructured-io, and pdfminer, pymupdf, pdfplumber etc for efficient snapshot, text, table, and metadata extraction.

📑 Complex PDF Parsing

A comprehensive example codes for extracting content from PDFs

Also, check -> Pdf Parsing Guide

📌 Core Features

📤 Content Extraction

Multiple extraction methods with different tools/libraries:
- Cloud-based: Claude 3.5 Sonnet, GPT-4 Vision, Unstructured.io
- Local: Llama 3.2 11B, Docling, PDFium
- Specialized: Camelot (tables), PDFMiner (text), PDFPlumber (mixed), PyPdf etc
Maintains document structure and formatting
Handles complex PDFs with mixed content including extracting image data

📦 Implementation Options

1. ☁️ Cloud-Based Methods

Claude 3.5 Sonnet: Excellent for complex PDFs with mixed content
GPT-4 Vision: Excellent for visual content analysis
Unstructured.io: Advanced content partitioning and classification
llama-parse
Amazon Textract: Advanced content partitioning and classification

2. 🖥️ Local Methods

Llama 3.2 11B Vision: Good for Image-based PDF processing.
Docling: Excellent for complex PDFs with mixed content. Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding, and providing seamless integrations with the gen AI ecosystem.
markitdown : Excellent for complex PDFs with mixed content. MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc). It supports: PDF, PowerPoint, Word, Excel, Images (EXIF metadata and OCR), Audio (EXIF metadata and speech transcription), HTML, Text-based formats (CSV, JSON, XML), ZIP files (iterates over contents)
Marker : Marker quickly converts PDFs and images to Markdown, JSON, and HTML with high accuracy. It supports all languages and document types while handling tables, forms, math, links, and code blocks. It extracts images, removes artifacts, and allows customization with user-defined formatting and logic. Accuracy can be enhanced with LLM integration, and it runs on GPU, CPU, or MPS. Code is not included here but can be checked on their GitHub repo.
Camelot: Specialized table extraction
PyPdf: pypdf is a free and open-source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. pypdf can retrieve text and metadata from PDFs as well.
PDFMiner: Basic text and layout extraction
PDFPlumber: Basic text and layout extraction
PyMUPDF: PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF
pdfium: High-fidelity processing using Chrome's PDF engine
PyPdfDirectory: Batch PDF Content Extraction Script using PyPDF2 Directory Loader

🔗 Dependencies

📚 Python Libraries

# PDF Processing Libraries
pypdf
pymupdf
pdfplumber
PyPDF2<3.0
camelot-py[cv]
Ghostscript
docling # IBM's Opensource
markitdown # Microsoft's Opensource 

# Computer Vision
opencv-python

# LLM related Libraries
ollama
tiktoken
openai
anthropic
langchain_ollama
langchain_huggingface
langchain_community

# Vector Store and Embeddings
faiss-cpu
sentence_transformers

# AWS Libraries
boto3
amazon-textract-caller>=0.2.0

# Utilities
python-dotenv

🛠️ Setup

Environment Variables

ANTHROPIC_API_KEY=your_key_here    # For Claude
OPENAI_API_KEY=your_key_here       # For OpenAI
UNSTRUCTURED_API_KEY=your_key_here # For Unstructured.io
LLAMA_CLOUD_API_KEY=your_key_here # For llama-parse

For ANTHROPIC_API_KEY follow this -> https://console.anthropic.com/settings/keys

For OPENAI_API_KEY follow this -> https://platform.openai.com/api-keys

For UNSTRUCTURED_API_KEY follow this -> https://unstructured.io/api-key-free

For LLAMA_CLOUD_API_KEY follow this -> https://cloud.llamaindex.ai/api-key

Install Dependencies

pip install -r requirements.txt

Install Ollama & Models (for local processing)

# Install Ollama
curl https://ollama.ai/install.sh | sh

# Pull required models
ollama pull llama3.1
ollama pull x/llama3.2-vision:11b

📈 Usage

Place PDF files in input/ directory

📄 Example Complex Pdf placed in Input folder

sample-1.pdf: Standard tables
sample-2.pdf: Image-based simple tables
sample-3.pdf: Image-based complex tables
sample-4.pdf: Mixed content (text, tables, images)
sample-5.pdf: Multi-column Texts

📝 Notes

System resources needed for local LLM operations
API keys required for cloud based implementations
Consider PDF complexity when choosing implementation
Ghostscript required for Camelot
Different processors suit different use cases
- Cloud: Complex documents, mixed content
- Local: Simple text, basic tables
- Specialized: Specific content types (tables, forms)

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
converted_images/llama		converted_images/llama
input		input
output		output
parser		parser
utils		utils
.env		.env
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
pdf-parsing-guide.pdf		pdf-parsing-guide.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📑 Complex PDF Parsing

📌 Core Features

📤 Content Extraction

📦 Implementation Options

1. ☁️ Cloud-Based Methods

2. 🖥️ Local Methods

🔗 Dependencies

📚 Python Libraries

🛠️ Setup

📈 Usage

📄 Example Complex Pdf placed in Input folder

📝 Notes

About

Releases

Packages

Languages

License

genieincodebottle/parsemypdf

Folders and files

Latest commit

History

Repository files navigation

📑 Complex PDF Parsing

📌 Core Features

📤 Content Extraction

📦 Implementation Options

1. ☁️ Cloud-Based Methods

2. 🖥️ Local Methods

🔗 Dependencies

📚 Python Libraries

🛠️ Setup

📈 Usage

📄 Example Complex Pdf placed in Input folder

📝 Notes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages