PDF Parser

All in one PDF Parser Toolkit

Introduction

This is a script tool that integrates multiple PDF parsers, which can extract images, structural information, tables, and references from PDFs. The development progress can be found in the table at the end.

Requirements

To use this project, ensure that your environment has Python 3.6+ and Java 1.8+. If using the Grobid backend, make sure that your current network can access the Grobid server network.

Installation

Install using pip.

git clone https://github.com/Acemap/pdf_parser.git
cd pdf_parser
pip install -r requirements.txt
python setup install

Usage

Command Line

To execute the Python script, refer to the example below for the parameters. Please see the table at the end for the available values for backend and type.

python -m pdf_parser --backend=grobid --type=text <pdf_file|directory> output_directory

API

The Parser class takes the backend parameter to specify the backend to use.

class pdf_parser.Parser(backend='grobid')

To parse the structural information of all PDF files in the input_dir and save the results to output_dir, use the following command:

pdf_parser.Parser.parse('text', input_dir, output_dir, n_threads=0)

To parse the image information of all PDF files in the input_dir and save the results to output_dir, use the following command:

pdf_parser.Parser.parse('figure', input_dir, output_dir, n_threads=0)

Note: The n_threads parameter specifies the number of threads to use for parsing. The default value is 0, which means it will use all available CPU cores.

Example:

from pdf_parser import Parser
parser = Parser('cermine')
parser.parse('text', '/path/to/xxx.pdf', '/path/to/output', 50)
parser.parse('figure', '/path/to/pdf_dir    ', '/path/to/output', 50)

Development progress

Backend↓ / Type→	text	image	reference
grobid	√	×	×
cermine	√	√	×
scienceparse	√	×	×
pdffigures	×	√	×
pdffigures2	√	√	×

Detail demand

Backend↓ / Requirements→	OS	java	Other
grobid	All (Windows/Linux/Mac)	Not Need	No
cermine	All (Windows/Linux/Mac)	Need	No
scienceparse	All (Windows/Linux/Mac)	Need	No
pdffigures	Linux/Mac	Not Need	leptonica & poppler (Ubuntu: sudo apt install libpoppler-private-dev libleptonica-dev)
pdffigures2	All (Windows/Linux/Mac)	Need	No

Citation

@misc{sciparser,
  author = {Cheng Deng, Yuting Jia, Shuhao Li},
  title = {pdf_parser: All in one PDF Parser Toolkits},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Acemap/pdf_parser}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docker		docker
pdf_parser		pdf_parser
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Parser

Introduction

Requirements

Installation

Usage

Command Line

API

Development progress

Detail demand

Citation

About

Releases

Packages

Languages

License

Acemap/pdf_parser

Folders and files

Latest commit

History

Repository files navigation

PDF Parser

Introduction

Requirements

Installation

Usage

Command Line

API

Development progress

Detail demand

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages