Skip to content

Commit

Permalink
Update documentation and bump version
Browse files Browse the repository at this point in the history
  • Loading branch information
icaropires committed Jul 29, 2020
1 parent b98f170 commit 483c3cb
Show file tree
Hide file tree
Showing 2 changed files with 62 additions and 17 deletions.
77 changes: 61 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Converts a whole subdirectory with big a volume of PDF documents to a dataset (p
* Incremental writing of resulting DataFrame, to save memory
* Ability to save processing progress and resume from it
* Error tracking of faulty documents
* Ability to extracting text through [pdftotext](https://github.com/jalan/pdftotext)
* Ability to extract text through [pdftotext](https://github.com/jalan/pdftotext)
* Ability to use OCR for extracting text through [pytesseract](https://github.com/madmaze/pytesseract) and [pdf2image](https://github.com/Belval/pdf2image)
* Custom behavior through parameters (number of CPUs, text language, etc)

Expand Down Expand Up @@ -58,11 +58,11 @@ $ poetry install
### Simple - CLI

``` bash
# Reads all PDFs from my_pdfs_folder and saves the resultant dataframe to my_df.parquet.gzip
$ pdf2dataset my_pdfs_folder my_df.parquet.gzip # Most basic
$ pdf2dataset my_pdfs_folder my_df.parquet.gzip --num-cpus 1 # Reduce parallelism to the maximum
$ pdf2dataset my_pdfs_folder my_df.parquet.gzip --ocr true # For scanned PDFs
$ pdf2dataset my_pdfs_folder my_df.parquet.gzip --ocr true --lang eng # Scanned documents with english text
# Reads all PDFs from my_pdfs_dir and saves the resultant dataframe to my_df.parquet.gzip
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip # Most basic
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --num-cpus 1 # Reduce parallelism to the maximum
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --ocr true # For scanned PDFs
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --ocr true --lang eng # For scanned documents with english text
```

### Save Processing Progress - CLI
Expand All @@ -71,7 +71,7 @@ It's possible to save the progress to a temporary folder and resume from the sav
any error or interruption. To resume the processing, just use the `--tmp-dir [directory]` flag:

``` bash
$ pdf2dataset my_pdfs_folder my_df.parquet.gzip --tmp-dir my_progress
$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --tmp-dir my_progress
```

The indicated temporary directory can also be used for debugging purposes and **is not** deleted
Expand All @@ -85,7 +85,7 @@ The `extract_text` function can be used analogously to the CLI:
``` python
from pdf2dataset import extract_text

extract_text('my_pdfs_folder', 'my_df.parquet.gzip', tmp_dir='my_progress')
extract_text('my_pdfs_dir', 'my_df.parquet.gzip', tmp_dir='my_progress')
```

#### Small
Expand All @@ -105,9 +105,55 @@ The complete list of differences are:
``` python
from pdf2dataset import extract_text

df = extract_text('my_pdfs_folder', small=True)
df = extract_text('my_pdfs_dir', small=True)
# ...
```
#### Passing specific tasks

If you don't want to specify a directory for the documents, you can specify the tasks that
will be processed.

The tasks can be of the form `(document_name, document_bytes, page_number)`
or just `(document_name, document_bytes)`, _document_name_ must ends with `.pdf` but
don't need to be a real file, _document_bytes_ are the bytes of the pdf document and _page_number_
is the number of the page to process (all pages if not specified).

##### Example:

``` python
from pdf2dataset import extract_text

tasks = [
('a.pdf', a_bytes), # Processing all pages of this document
('b.pdf', b_bytes, 1),
('b.pdf', b_bytes, 2),
]

# 'df' will contain all pages from 'a.pdf' and page 1 and 2 from 'b.pdf'
df = extract_text(tasks, 'my_df.parquet.gzip', small=True)

# ...
```

#### Returning a list with the contents, instead of DataFrame

If you are just interested on the texts, it's possible to return a list that contains only
the pages content. Each document will be a list which each element is a page.

##### Example:

``` python
>>> from pdf2dataset import extract_text
>>> extract_text('tests/samples', return_list=True)
[[''],
['First page', 'Second page', 'Third page'],
['My beautiful sample!'],
['First page', 'Second page', 'Third page'],
['My beautiful sample!']]
```

_Note:_ Pages/Documents with parsing error will have an empty string as text result

### Results File

The resulting "file" is a parquet hive written with [fastparquet](https://github.com/dask/fastparquet), it can be
Expand Down Expand Up @@ -152,9 +198,8 @@ With version >= 0.2.0, only the head node needs to have access to the documents

```
$ pdf2dataset -h
usage: pdf2dataset [-h] [--tmp-dir TMP_DIR] [--lang LANG] [--ocr OCR]
[--num-cpus NUM_CPUS] [--address ADDRESS]
[--webui-host WEBUI_HOST] [--redis-password REDIS_PASSWORD]
usage: pdf2dataset [-h] [--tmp-dir TMP_DIR] [--lang LANG] [--ocr OCR] [--chunksize CHUNKSIZE] [--num-cpus NUM_CPUS] [--address ADDRESS] [--webui-host WEBUI_HOST]
[--redis-password REDIS_PASSWORD]
input_dir results_file
Extract text from all PDF files in a directory
Expand All @@ -165,11 +210,11 @@ positional arguments:
optional arguments:
-h, --help show this help message and exit
--tmp-dir TMP_DIR The folder to keep all the results, including log
files and intermediate files
--tmp-dir TMP_DIR The folder to keep all the results, including log files and intermediate files
--lang LANG Tesseract language
--ocr OCR 'pytesseract' if true, else 'pdftotext'. default:
false
--ocr OCR 'pytesseract' if true, else 'pdftotext'. default: false
--chunksize CHUNKSIZE
Chunksize to use while processing pages, otherwise is calculated
--num-cpus NUM_CPUS Number of cpus to use
--address ADDRESS Ray address to connect
--webui-host WEBUI_HOST
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "pdf2dataset"
version = "0.2.1"
version = "0.3.0"
readme = "README.md"
description = "Easily convert a big folder with PDFs into a dataset, with extracted text using OCR"
authors = ["Ícaro Pires <[email protected]>"]
Expand Down

0 comments on commit 483c3cb

Please sign in to comment.