PDFz

Developed by codad5

PDFz is designed to streamline the extraction and processing of text from PDF files, making it easier to manage and analyze large volumes of documents. By leveraging Rust for the Extractor Service, the project addresses performance bottlenecks, ensuring efficient and fast processing of PDF files.

This project consists of a microservices-based system for extracting and processing PDF files. It includes:

Extractor Service (Rust): Handles file processing using Tesseract OCR.
API Service (Node.js): Provides endpoints for uploading and managing file extraction.
Redis: Caching and tracking progress.
RabbitMQ: Message queuing between API and Extractor.
Docker: Containerized deployment for all services.

Features

Upload PDF files via the API.
Queue files for processing.
Extract text from PDFs using OCR (Tesseract).
Track file processing progress.
Store extracted data.

Upcoming Features

Ollama Integration (Coming Soon)

We are working on integrating Ollama to enable advanced text processing capabilities using locally run large language models (LLMs). This will allow you to:

Summarize extracted text from PDFs.
Perform question-answering on the content.
Generate insights or reports from the processed data.

Stay tuned for updates!

Architecture

API Service: Handles file uploads and processing requests.
Extractor Service: Processes queued files asynchronously.
Redis: Tracks file processing states.
RabbitMQ: Message queue for job dispatch.

Setup

Prerequisites

For Docker Deployment:

Docker & Docker Compose

For Local Development:

API Service:
- Node.js & npm
- Redis
- RabbitMQ
Extractor Service:
- Rust & Cargo
- Redis
- RabbitMQ
- Tesseract OCR

Installation

Clone the repository:

git clone https://github.com/codad5/pdfz.git
cd pdfz

Create an .env file for environment variables:
```
cp .env.example .env
```
Update .env variables (e.g., ports, RabbitMQ, Redis credentials).
Build and start the services:
```
docker-compose up --build
```

Services & Environment Variables

Extractor Service (Rust)

RUST_LOG=debug - Log level
REDIS_URL - Redis connection URL
RABBITMQ_URL - RabbitMQ connection
EXTRACTOR_PORT - Service port
SHARED_STORAGE_PATH - Mounted storage
TRAINING_DATA_PATH - Path to Tesseract training data
PROTO_PATH - Path to Protobuf files

API Service (Node.js)

NODE_ENV=development
REDIS_URL - Redis connection
RABBITMQ_URL - RabbitMQ connection
API_PORT - API listening port
SHARED_STORAGE_PATH - Mounted storage
PROTO_PATH - Path to Protobuf files

API Endpoints

Upload a File

POST /upload

Request: Multipart form-data with a pdf file.

Response:

{
  "success": true,
  "message": "File uploaded successfully",
  "data": {
    "id": "file-id",
    "filename": "file.pdf",
    "path": "/shared_storage/upload/pdf/file.pdf",
    "size": 12345
  }
}

Process a File

POST /process/:id

Request: JSON body

{
  "startPage": 1,
  "pageCount": 10,
  "priority": 1
}

Response:

{
  "success": true,
  "message": "File processing started",
  "data": {
    "id": "file-id",
    "file": "file.pdf",
    "options": {
      "startPage": 1,
      "pageCount": 10,
      "priority": 1
    },
    "status": "queued",
    "progress": 0,
    "queuedAt": "2023-10-01T12:00:00Z"
  }
}

Track Progress

GET /progress/:id

Response:

{
  "success": true,
  "message": "Progress retrieved successfully",
  "data": {
    "id": "file-id",
    "progress": 50,
    "status": "processing"
  }
}

Retrieve Processed Content

GET /content/:id

Response:

{
  "success": true,
  "message": "Processed content retrieved successfully",
  "data": {
    "id": "file-id",
    "content": [
      {
        "page_num": 1,
        "text": "This is the text from page 1."
      },
      {
        "page_num": 2,
        "text": "This is the text from page 2."
      }
    ],
    "status": "completed"
  }
}

Local Development

Running the API Locally

Install dependencies:
```
cd api
npm install
```
Start the API service:
```
npm run dev
```
Ensure Redis and RabbitMQ are running locally.

Running the Extractor Locally

Install dependencies:
```
cd extractor
cargo build
```

Install Tesseract OCR:

On Ubuntu:
```
sudo apt install tesseract-ocr
```
On macOS:
```
brew install tesseract
```

Start the Extractor service:
```
cargo run
```
Ensure Redis and RabbitMQ are running locally.

Docker Compose Setup

The docker-compose.yml file defines the following services:

extractor: Rust-based service for processing PDFs.
api: Node.js-based service for handling API requests.
redis: Redis instance for caching and tracking progress.
rabbitmq: RabbitMQ instance for message queuing.

Volumes:

cargo-cache: Caches Rust dependencies.
training_data: Stores Tesseract training data.
redis-data: Persists Redis data.
shared_storage: Shared storage for uploaded and processed files.
rabbitmq-data: Persists RabbitMQ data.

Repository

For more details, visit the GitHub repository.

Contributing

Fork the repository and create a new branch.
Make changes and test locally.
Submit a pull request.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
api		api
extractor		extractor
.env.sample		.env.sample
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDFz

Features

Upcoming Features

Ollama Integration (Coming Soon)

Architecture

Setup

Prerequisites

For Docker Deployment:

For Local Development:

Installation

Services & Environment Variables

Extractor Service (Rust)

API Service (Node.js)

API Endpoints

Upload a File

Process a File

Track Progress

Retrieve Processed Content

Local Development

Running the API Locally

Running the Extractor Locally

Docker Compose Setup

Volumes:

Repository

Contributing

License

About

Languages

codad5/pdfz

Folders and files

Latest commit

History

Repository files navigation

PDFz

Features

Upcoming Features

Ollama Integration (Coming Soon)

Architecture

Setup

Prerequisites

For Docker Deployment:

For Local Development:

Installation

Services & Environment Variables

Extractor Service (Rust)

API Service (Node.js)

API Endpoints

Upload a File

Process a File

Track Progress

Retrieve Processed Content

Local Development

Running the API Locally

Running the Extractor Locally

Docker Compose Setup

Volumes:

Repository

Contributing

License

About

Topics

Resources

Stars

Watchers

Forks

Languages