Developed by codad5
PDFz is designed to streamline the extraction and processing of text from PDF files, making it easier to manage and analyze large volumes of documents. By leveraging Rust for the Extractor Service, the project addresses performance bottlenecks, ensuring efficient and fast processing of PDF files.
This project consists of a microservices-based system for extracting and processing PDF files. It includes:
- Extractor Service (Rust): Handles file processing using Tesseract OCR.
- API Service (Node.js): Provides endpoints for uploading and managing file extraction.
- Redis: Caching and tracking progress.
- RabbitMQ: Message queuing between API and Extractor.
- Docker: Containerized deployment for all services.
- Upload PDF files via the API.
- Queue files for processing.
- Extract text from PDFs using OCR (Tesseract).
- Track file processing progress.
- Store extracted data.
We are working on integrating Ollama to enable advanced text processing capabilities using locally run large language models (LLMs). This will allow you to:
- Summarize extracted text from PDFs.
- Perform question-answering on the content.
- Generate insights or reports from the processed data.
Stay tuned for updates!
- API Service: Handles file uploads and processing requests.
- Extractor Service: Processes queued files asynchronously.
- Redis: Tracks file processing states.
- RabbitMQ: Message queue for job dispatch.
- Docker & Docker Compose
- API Service:
- Node.js & npm
- Redis
- RabbitMQ
- Extractor Service:
- Rust & Cargo
- Redis
- RabbitMQ
- Tesseract OCR
-
Clone the repository:
git clone https://github.com/codad5/pdfz.git cd pdfz
-
Create an
.env
file for environment variables:cp .env.example .env
-
Update
.env
variables (e.g., ports, RabbitMQ, Redis credentials). -
Build and start the services:
docker-compose up --build
RUST_LOG=debug
- Log levelREDIS_URL
- Redis connection URLRABBITMQ_URL
- RabbitMQ connectionEXTRACTOR_PORT
- Service portSHARED_STORAGE_PATH
- Mounted storageTRAINING_DATA_PATH
- Path to Tesseract training dataPROTO_PATH
- Path to Protobuf files
NODE_ENV=development
REDIS_URL
- Redis connectionRABBITMQ_URL
- RabbitMQ connectionAPI_PORT
- API listening portSHARED_STORAGE_PATH
- Mounted storagePROTO_PATH
- Path to Protobuf files
POST /upload
Request: Multipart form-data with a pdf
file.
Response:
{
"success": true,
"message": "File uploaded successfully",
"data": {
"id": "file-id",
"filename": "file.pdf",
"path": "/shared_storage/upload/pdf/file.pdf",
"size": 12345
}
}
POST /process/:id
Request: JSON body
{
"startPage": 1,
"pageCount": 10,
"priority": 1
}
Response:
{
"success": true,
"message": "File processing started",
"data": {
"id": "file-id",
"file": "file.pdf",
"options": {
"startPage": 1,
"pageCount": 10,
"priority": 1
},
"status": "queued",
"progress": 0,
"queuedAt": "2023-10-01T12:00:00Z"
}
}
GET /progress/:id
Response:
{
"success": true,
"message": "Progress retrieved successfully",
"data": {
"id": "file-id",
"progress": 50,
"status": "processing"
}
}
GET /content/:id
Response:
{
"success": true,
"message": "Processed content retrieved successfully",
"data": {
"id": "file-id",
"content": [
{
"page_num": 1,
"text": "This is the text from page 1."
},
{
"page_num": 2,
"text": "This is the text from page 2."
}
],
"status": "completed"
}
}
-
Install dependencies:
cd api npm install
-
Start the API service:
npm run dev
-
Ensure Redis and RabbitMQ are running locally.
-
Install dependencies:
cd extractor cargo build
-
Install Tesseract OCR:
- On Ubuntu:
sudo apt install tesseract-ocr
- On macOS:
brew install tesseract
- On Ubuntu:
-
Start the Extractor service:
cargo run
-
Ensure Redis and RabbitMQ are running locally.
The docker-compose.yml
file defines the following services:
- extractor: Rust-based service for processing PDFs.
- api: Node.js-based service for handling API requests.
- redis: Redis instance for caching and tracking progress.
- rabbitmq: RabbitMQ instance for message queuing.
cargo-cache
: Caches Rust dependencies.training_data
: Stores Tesseract training data.redis-data
: Persists Redis data.shared_storage
: Shared storage for uploaded and processed files.rabbitmq-data
: Persists RabbitMQ data.
For more details, visit the GitHub repository.
- Fork the repository and create a new branch.
- Make changes and test locally.
- Submit a pull request.
MIT License