Update documentation and bump version

icaropires · Jul 29, 2020 · 483c3cb · 483c3cb
1 parent b98f170
commit 483c3cb
Show file tree

Hide file tree

Showing 2 changed files with 62 additions and 17 deletions.
diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@ Converts a whole subdirectory with big a volume of PDF documents to a dataset (p
 * Incremental writing of resulting DataFrame, to save memory
 * Ability to save processing progress and resume from it
 * Error tracking of faulty documents
-* Ability to extracting text through [pdftotext](https://github.com/jalan/pdftotext)
+* Ability to extract text through [pdftotext](https://github.com/jalan/pdftotext)
 * Ability to use OCR for extracting text through [pytesseract](https://github.com/madmaze/pytesseract) and [pdf2image](https://github.com/Belval/pdf2image)
 * Custom behavior through parameters (number of CPUs, text language, etc)
 
@@ -58,11 +58,11 @@ $ poetry install
 ### Simple - CLI
 
 ``` bash
-# Reads all PDFs from my_pdfs_folder and saves the resultant dataframe to my_df.parquet.gzip
-$ pdf2dataset my_pdfs_folder my_df.parquet.gzip  # Most basic
-$ pdf2dataset my_pdfs_folder my_df.parquet.gzip --num-cpus 1  # Reduce parallelism to the maximum
-$ pdf2dataset my_pdfs_folder my_df.parquet.gzip --ocr true  # For scanned PDFs
-$ pdf2dataset my_pdfs_folder my_df.parquet.gzip --ocr true --lang eng  # Scanned documents with english text
+# Reads all PDFs from my_pdfs_dir and saves the resultant dataframe to my_df.parquet.gzip
+$ pdf2dataset my_pdfs_dir my_df.parquet.gzip  # Most basic
+$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --num-cpus 1  # Reduce parallelism to the maximum
+$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --ocr true  # For scanned PDFs
+$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --ocr true --lang eng  # For scanned documents with english text
 ```
 
 ### Save Processing Progress - CLI
@@ -71,7 +71,7 @@ It's possible to save the progress to a temporary folder and resume from the sav
 any error or interruption. To resume the processing, just use the `--tmp-dir [directory]` flag:
 
 ``` bash
-$ pdf2dataset my_pdfs_folder my_df.parquet.gzip --tmp-dir my_progress
+$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --tmp-dir my_progress
 ```
 
 The indicated temporary directory can also be used for debugging purposes and **is not** deleted
@@ -85,7 +85,7 @@ The `extract_text` function can be used analogously to the CLI:
 ``` python
 from pdf2dataset import extract_text
 
-extract_text('my_pdfs_folder', 'my_df.parquet.gzip', tmp_dir='my_progress')
+extract_text('my_pdfs_dir', 'my_df.parquet.gzip', tmp_dir='my_progress')
 ```
 
 #### Small
@@ -105,9 +105,55 @@ The complete list of differences are:
 ``` python
 from pdf2dataset import extract_text
 
-df = extract_text('my_pdfs_folder', small=True)
+df = extract_text('my_pdfs_dir', small=True)
 # ...
 ```
+#### Passing specific tasks
+
+If you don't want to specify a directory for the documents, you can specify the tasks that
+will be processed.
+
+The tasks can be of the form `(document_name, document_bytes, page_number)`
+or just `(document_name, document_bytes)`, _document_name_ must ends with `.pdf` but 
+don't need to be a real file, _document_bytes_ are the bytes of the pdf document and _page_number_
+is the number of the page to process (all pages if not specified).
+
+##### Example:
+
+``` python
+from pdf2dataset import extract_text
+
+tasks = [
+    ('a.pdf', a_bytes),  # Processing all pages of this document
+    ('b.pdf', b_bytes, 1),
+    ('b.pdf', b_bytes, 2),
+]
+
+# 'df' will contain all pages from 'a.pdf' and page 1 and 2 from 'b.pdf'
+df = extract_text(tasks, 'my_df.parquet.gzip', small=True)
+
+# ...
+```
+
+#### Returning a list with the contents, instead of DataFrame
+
+If you are just interested on the texts, it's possible to return a list that contains only
+the pages content. Each document will be a list which each element is a page.
+
+##### Example:
+
+``` python
+>>> from pdf2dataset import extract_text
+>>> extract_text('tests/samples', return_list=True)
+[[''],
+ ['First page', 'Second page', 'Third page'],
+ ['My beautiful sample!'],
+ ['First page', 'Second page', 'Third page'],
+ ['My beautiful sample!']]
+```
+
+_Note:_ Pages/Documents with parsing error will have an empty string as text result
+
 ### Results File
 
 The resulting "file" is a parquet hive written with [fastparquet](https://github.com/dask/fastparquet), it can be
@@ -152,9 +198,8 @@ With version >= 0.2.0, only the head node needs to have access to the documents
 
 ```
 $ pdf2dataset -h
-usage: pdf2dataset [-h] [--tmp-dir TMP_DIR] [--lang LANG] [--ocr OCR]
-                   [--num-cpus NUM_CPUS] [--address ADDRESS]
-                   [--webui-host WEBUI_HOST] [--redis-password REDIS_PASSWORD]
+usage: pdf2dataset [-h] [--tmp-dir TMP_DIR] [--lang LANG] [--ocr OCR] [--chunksize CHUNKSIZE] [--num-cpus NUM_CPUS] [--address ADDRESS] [--webui-host WEBUI_HOST]
+                   [--redis-password REDIS_PASSWORD]
                    input_dir results_file
 
 Extract text from all PDF files in a directory
@@ -165,11 +210,11 @@ positional arguments:
 
 optional arguments:
   -h, --help            show this help message and exit
-  --tmp-dir TMP_DIR     The folder to keep all the results, including log
-                        files and intermediate files
+  --tmp-dir TMP_DIR     The folder to keep all the results, including log files and intermediate files
   --lang LANG           Tesseract language
-  --ocr OCR             'pytesseract' if true, else 'pdftotext'. default:
-                        false
+  --ocr OCR             'pytesseract' if true, else 'pdftotext'. default: false
+  --chunksize CHUNKSIZE
+                        Chunksize to use while processing pages, otherwise is calculated
   --num-cpus NUM_CPUS   Number of cpus to use
   --address ADDRESS     Ray address to connect
   --webui-host WEBUI_HOST

diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "pdf2dataset"
-version = "0.2.1"
+version = "0.3.0"
 readme = "README.md"
 description = "Easily convert a big folder with PDFs into a dataset, with extracted text using OCR"
 authors = ["Ícaro Pires <[email protected]>"]