Skip to content

Commit

Permalink
Merge branch 'master' into docs/how-to-target-page-toc
Browse files Browse the repository at this point in the history
  • Loading branch information
pietermarsman authored Dec 29, 2023
2 parents f3243e3 + adf95a4 commit b861528
Show file tree
Hide file tree
Showing 7 changed files with 210 additions and 8 deletions.
6 changes: 3 additions & 3 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ Please *remove* this paragraph with a description of how this PR has been tested

**Checklist**

- [ ] I have read [CONTRIBUTING.md](../CONTRIBUTING.md).
- [ ] I have added a concise human-readable description of the change to [CHANGELOG.md](../CHANGELOG.md).
- [ ] I have read [CONTRIBUTING.md](https://github.com/pdfminer/pdfminer.six/blob/master/CONTRIBUTING.md).
- [ ] I have added a concise human-readable description of the change to [CHANGELOG.md](https://github.com/pdfminer/pdfminer.six/blob/master/CHANGELOG.md).
- [ ] I have tested that this fix is effective or that this feature works.
- [ ] I have added docstrings to newly created methods and classes.
- [ ] I have updated the [README.md](../README.md) and the [readthedocs](../docs/source) documentation. Or verified that this is not necessary.
- [ ] I have updated the [README.md](https://github.com/pdfminer/pdfminer.six/blob/master/README.md) and the [readthedocs](https://github.com/pdfminer/pdfminer.six/tree/master/docs/source) documentation. Or verified that this is not necessary.
10 changes: 8 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Any contribution is appreciated! You might want to:
* Help others by sharing your thoughs in comments on issues and pull requests.
* Join the chat on [gitter](https://gitter.im/pdfminer-six/Lobby)

## Guidelines for creating issues
## Guideline for creating issues

* Search previous issues, as yours might be a duplicate.
* When creating a new issue for a bug, include a minimal reproducible example.
Expand All @@ -37,14 +37,20 @@ Any contribution is appreciated! You might want to:
* Check spelling and grammar.
* Don't forget to update the [CHANGELOG.md](CHANGELOG.md#[Unreleased]).

## Guidelines for posting comments
## Guideline for posting comments

* [Be cordial and positive](https://www.kennethreitz.org/essays/be-cordial-or-be-on-your-way)

## Guidelines for publishing

* Publishing is automated. Add a YYYYMMDD version tag and GitHub workflows will do the rest.

## Guideline for dependencies

* This package is distributed under the [MIT license](LICENSE).
* All dependencies should be compatible with this license.
* Use [licensecheck](https://pypi.org/project/licensecheck/) to validate if new packages are compatible.

## Getting started

1. Clone the repository
Expand Down
34 changes: 34 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Working on documentation

The pdfminer.six docs are generated with [Sphinx](https://www.sphinx-doc.org/en/master/), using
[reStructuredText](https://docutils.sourceforge.io/rst.html).

The documentation is hosted on https://pdfminersix.readthedocs.io/.

## Deploying new documentation

New documentation is deployed automatically when PR's are merged.

## Building documentation locally

You can build the documentation locally on your machine using the following steps.

1. (Recommended) create a and activate a Python virtual environment.

```console
python -m venv .venv
source .venv/bin/activate
```
2. With the virtual environment activated, install the dependencies for building the documentation.

```console
pip install '.[docs]'
```
3. Build the documentation.

```console
make clean && make html
```

1 change: 0 additions & 1 deletion docs/requirements.txt

This file was deleted.

162 changes: 162 additions & 0 deletions docs/source/howto/character_properties.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
.. _character_properties:

How to extract font names and sizes from PDF's
******************************************************

Before you start, make sure you have :ref:`installed pdfminer.six<install>`.

The following code sample shows how to extract font names and sizes for each of the characters. This uses the
[simple1.pdf](https://raw.githubusercontent.com/pdfminer/pdfminer.six/master/samples/simple1.pdf).

.. code-block:: python
from pathlib import Path
from typing import Iterable, Any

from pdfminer.high_level import extract_pages


def show_ltitem_hierarchy(o: Any, depth=0):
"""Show location and text of LTItem and all its descendants"""
if depth == 0:
print('element font stroking color text')
print('------------------------------ --------------------- -------------- ----------')

print(
f'{get_indented_name(o, depth):<30.30s} '
f'{get_optional_fontinfo(o):<20.20s} '
f'{get_optional_color(o):<17.17s}'
f'{get_optional_text(o)}'
)

if isinstance(o, Iterable):
for i in o:
show_ltitem_hierarchy(i, depth=depth + 1)


def get_indented_name(o: Any, depth: int) -> str:
"""Indented name of class"""
return ' ' * depth + o.__class__.__name__


def get_optional_fontinfo(o: Any) -> str:
"""Font info of LTChar if available, otherwise empty string"""
if hasattr(o, 'fontname') and hasattr(o, 'size'):
return f'{o.fontname} {round(o.size)}pt'
return ''

def get_optional_color(o: Any) -> str:
"""Font info of LTChar if available, otherwise empty string"""
if hasattr(o, 'graphicstate'):
return f'{o.graphicstate.scolor}'
return ''


def get_optional_text(o: Any) -> str:
"""Text of LTItem if available, otherwise empty string"""
if hasattr(o, 'get_text'):
return o.get_text().strip()
return ''


path = Path('samples/simple1.pdf').expanduser()
pages = extract_pages(path)
show_ltitem_hierarchy(pages)
.. note::

The output looks like below. Note that it shows the hierarchical structure of the layout elements. The layout algorithm
groups characters into lines and lines into boxes. And boxes appear on a page. The pages, boxes and lines do not have
font information because this can change for each character. The stroking color is always `None` in this example, but
it will contain the color if the PDF does specify colors.

```
element font stroking color text
------------------------------ --------------------- -------------- ----------
generator
LTPage
LTTextBoxHorizontal Hello
LTTextLineHorizontal Hello
LTChar Helvetica 24pt None H
LTChar Helvetica 24pt None e
LTChar Helvetica 24pt None l
LTChar Helvetica 24pt None l
LTChar Helvetica 24pt None o
LTChar Helvetica 24pt None
LTAnno
LTTextBoxHorizontal World
LTTextLineHorizontal World
LTChar Helvetica 24pt None W
LTChar Helvetica 24pt None o
LTChar Helvetica 24pt None r
LTChar Helvetica 24pt None l
LTChar Helvetica 24pt None d
LTAnno
LTTextBoxHorizontal Hello
LTTextLineHorizontal Hello
LTChar Helvetica 24pt None H
LTChar Helvetica 24pt None e
LTChar Helvetica 24pt None l
LTChar Helvetica 24pt None l
LTChar Helvetica 24pt None o
LTChar Helvetica 24pt None
LTAnno
LTTextBoxHorizontal World
LTTextLineHorizontal World
LTChar Helvetica 24pt None W
LTChar Helvetica 24pt None o
LTChar Helvetica 24pt None r
LTChar Helvetica 24pt None l
LTChar Helvetica 24pt None d
LTAnno
LTTextBoxHorizontal H e l l o
LTTextLineHorizontal H e l l o
LTChar Helvetica 24pt None H
LTAnno
LTChar Helvetica 24pt None e
LTAnno
LTChar Helvetica 24pt None l
LTAnno
LTChar Helvetica 24pt None l
LTAnno
LTChar Helvetica 24pt None o
LTAnno
LTChar Helvetica 24pt None
LTAnno
LTTextBoxHorizontal W o r l d
LTTextLineHorizontal W o r l d
LTChar Helvetica 24pt None W
LTAnno
LTChar Helvetica 24pt None o
LTAnno
LTChar Helvetica 24pt None r
LTAnno
LTChar Helvetica 24pt None l
LTAnno
LTChar Helvetica 24pt None d
LTAnno
LTTextBoxHorizontal H e l l o
LTTextLineHorizontal H e l l o
LTChar Helvetica 24pt None H
LTAnno
LTChar Helvetica 24pt None e
LTAnno
LTChar Helvetica 24pt None l
LTAnno
LTChar Helvetica 24pt None l
LTAnno
LTChar Helvetica 24pt None o
LTAnno
LTChar Helvetica 24pt None
LTAnno
LTTextBoxHorizontal W o r l d
LTTextLineHorizontal W o r l d
LTChar Helvetica 24pt None W
LTAnno
LTChar Helvetica 24pt None o
LTAnno
LTChar Helvetica 24pt None r
LTAnno
LTChar Helvetica 24pt None l
LTAnno
LTChar Helvetica 24pt None d
LTAnno
```
1 change: 1 addition & 0 deletions docs/source/howto/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ How-to guides help you to solve specific problems with pdfminer.six.
images
acro_forms
toc_target_page
character_properties
4 changes: 2 additions & 2 deletions pdfminer/pdfinterp.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,8 +115,8 @@ def reset(self) -> None:
Color = Union[
float, # Greyscale
Tuple[float, float, float], # R, G, B
Tuple[float, float, float, float],
] # C, M, Y, K
Tuple[float, float, float, float], # C, M, Y, K
]


class PDFGraphicState:
Expand Down

0 comments on commit b861528

Please sign in to comment.