Skip to content

Simple FastAPI wrapper for Document-to-Markdown conversion using Microsoft's MarkItDown library.

License

Notifications You must be signed in to change notification settings

9bow/markitdown-api-fly-io

Repository files navigation

MarkItDown API

A REST API service that converts documents and web content to Markdown. Supports various file formats using Microsoft's MarkItDown and web content extraction using Trafilatura and python-readability.

Features

  • Document-to-Markdown conversion via file upload or URL
    • Office documents (DOCX, XLSX, PPTX)
    • PDF files
    • Images (PNG, JPEG, GIF, WebP)
    • Data files (CSV, JSON, XML)
  • Web content extraction and conversion
    • Primary extraction using Trafilatura
    • Fallback to python-readability for robust content extraction
    • Intelligent character encoding detection
    • Clean Markdown output with preserved formatting
  • Rich metadata for conversion results
  • API Key-based authentication and OpenAPI documentation
  • Robust content type detection and handling

Installation & Development

Prerequisites

  • Python 3.8 or higher
  • pip (Python package installer)
  • Virtual environment (recommended)

Clone & Local Setup

  1. Clone the repository
git clone https://github.com/9bow/markitdown-api-fly-io.git
cd markitdown-api-fly-io
  1. Create and activate virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies
pip install -r requirements.txt
  1. Configure environment variables
# Create .env file with the following variables (via .env.template)
cp .env.example .env
# Update the following variables in .env
VERSION=0.0.1
MAX_DOWNLOAD_SIZE=52428800  # 50MB in bytes
TIMEOUT_SECONDS=30
  1. Run development server
cd app/
uvicorn main:app --reload

Deployment (via Fly.io)

  1. Install Fly.io CLI
curl -L https://fly.io/install.sh | sh
  1. Login and deploy
flyctl auth login
flyctl launch
flyctl secrets set API_KEY="your-secure-api-key"
flyctl deploy

API Usage

Authentication

All API endpoints require authentication using either:

  • API key in the X-API-Key header
  • Bearer token in the Authorization header

Endpoints

Health Check

curl -X GET \
  -H "X-API-Key: your-secure-api-key" \
  http://localhost:8000/health

Convert Document

# via file upload
curl -X POST \
  -H "X-API-Key: your-secure-api-key" \
  -F "[email protected]" \
  http://localhost:8000/convert

# via file URL
curl -X POST \
  -H "X-API-Key: your-secure-api-key" \
  -F "url=https://example.com/document.pdf" \
  http://localhost:8000/convert

Response Format

Successful conversions return a JSON object with the following structure:

{
  "result": "# Converted Markdown Content...",
  "metadata": {
    "content_type": "application/pdf",
    "file_size": 12345,
    "processing_time": 0.532,
    "original_url": "https://example.com/document.pdf",
    "conversion_method": "markitdown"
  }
}

Error Handling

The API returns appropriate HTTP status codes and error messages:

  • 400: Bad Request (invalid input)
    • Unsupported file format
    • Invalid URL
    • Missing file/URL
  • 401: Unauthorized (invalid API key)
  • 408: Request Timeout
  • 413: Payload Too Large (file size exceeds limit)
  • 500: Internal Server Error

Content Type Support

Documents

  • PDF (.pdf)
  • Microsoft Word (.docx)
  • Microsoft Excel (.xlsx)
  • Microsoft PowerPoint (.pptx)

Web Content

  • HTML pages (.html, .htm)
  • XML documents (.xml)

Data Files

  • CSV (.csv)
  • JSON (.json)
  • XML (.xml)

Images

  • JPEG (.jpg, .jpeg)
  • PNG (.png)
  • GIF (.gif)
  • WebP (.webp)

Development

Running Tests

pytest

Type Checking

mypy app/

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Simple FastAPI wrapper for Document-to-Markdown conversion using Microsoft's MarkItDown library.

Topics

Resources

License

Stars

Watchers

Forks