Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a small utility to fix broken pdf font #671

Merged
merged 4 commits into from
Oct 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions parsr-fix-pdf-font/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Parsr - Fix PDF Font

**Parsr-fix-pdf-font** is a utility designed specifically to remedy broken unicode maps for PDF fonts. Issues with broken unicode maps can arise due to various reasons, including incomplete or corrupt font embedding, or issues during the PDF creation process. Such problems can render text in a PDF file unreadable or undecipherable.

This tool leverages Tesseract.js, an optical character recognition engine, to recognize the broken glyphs present in the PDF. Once these glyphs are identified, **Parsr-fix-pdf-font** rebuilds the unicode map, ensuring that the PDF becomes readable and retains its original design and layout.

## Features

- OCR Powered Correction: Uses Tesseract.js to perform Optical Character Recognition on the broken glyphs, ensuring accurate text representation.

- Rebuilding Unicode Maps: After identifying the incorrect mappings, the tool regenerates the correct unicode map, preserving the original design of the PDF.

- Easy-to-Use Command Line Interface: Simplified command line usage for quick fixes.

## Requirements

nodejs >18

ImageMagick Convert


## Usage
Use the command line interface to run the Parsr tool:

```
parsr-fix-pdf-fonts --input <path-to-pdf> --ouput <path-to-out-pdf> --lang eng
```
Parameters:

- --input <path-to-pdf>: Specifies the path to the source PDF file that needs to be fixed.
- --ouput <path-to-out-pdf>: Designates the path where the fixed PDF will be saved. If the specified file already exists, it will be overwritten.
- --lang eng: Sets the language for the OCR process. By default, it's set to English (eng). Tesseract supports multiple languages, so ensure you choose the appropriate one for your document.

## Troubleshooting
If you encounter any issues:

Inspect PDF: Ensure that the PDF isn't password protected or encrypted. If it is, decrypt it before running the tool.

Language Mismatch: If the OCR isn't accurate, ensure you've chosen the correct language setting for the document.

## Limits

Tesseract OCR is not really good on single Glyph, but at least the text is readable / understandable for an LLM.

We do not reconstruct the XREF table yet. Using a tool like ```mutools clean ``` will allow you to fix them if needed.

## Contribution
Parsr is an open-source tool. Contributions in the form of bug reports, feature requests, or code are always welcome. Check our GitHub repository for more details.

Binary file added parsr-fix-pdf-font/eng.traineddata
Binary file not shown.
158 changes: 158 additions & 0 deletions parsr-fix-pdf-font/package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

22 changes: 22 additions & 0 deletions parsr-fix-pdf-font/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{
"name": "fixfontinpdf",
"version": "1.0.0",
"description": "# Usage",
"main": "fixPdfFonts.js",
"directories": {
"test": "test"
},
"bin": {
"parsr-fix-pdf-font": "fix-pdf-font.js"
},
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"author": "",
"license": "ISC",
"dependencies": {
"commander": "^11.0.0",
"opentype.js": "^1.3.4",
"tesseract.js": "^5.0.0"
}
}
49 changes: 49 additions & 0 deletions parsr-fix-pdf-font/parsr-fix-pdf-font.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
//const dotenv = require('dotenv');
//dotenv.config({ path: require('find-config')('.env') });

const path = require('path');
const fs = require('fs');
const outDirPath = `${__dirname}/tmp`;


const extractAndCorrectFontsFromPDF = require('./src/extractAndCorrectFontsFromPDF.js');

let filePath = (process.argv.length > 2) ? process.argv[2] : `${__dirname}/testPDF/test.pdf`;

const { Command } = require('commander');
const program = new Command();

program
.name('parsr-fix-pdf-font')
.description('CLI to fix PDF fonts')
.version('0.0.1')
.option('--input <pdf-input-file-path>')
.option('--output <pdf-output-file-path>')
.option('--lang <language-code>')
.parse();

const options = program.opts();

if (!options.input) {
console.error('--input is Required');
return;
}

if (!options.output) {
console.error('--output is Required');
return;
}


async function main(input, output, lang='eng') {
if (!fs.existsSync(outDirPath)) {
fs.mkdirSync(outDirPath);
}

await extractAndCorrectFontsFromPDF(input, output, lang, outDirPath);

return;

}

main(options.input, options.output, options.lang);
Loading
Loading