Merge branch 'master' into add-contains-ltcomponent

pdfminer · Dec 28, 2023 · 73b1b34 · 73b1b34
2 parents edd91d1 + 38686dd
commit 73b1b34
Show file tree

Hide file tree

Showing 29 changed files with 270 additions and 97 deletions.
diff --git a/.github/workflows/actions.yml b/.github/workflows/actions.yml
@@ -20,9 +20,9 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: Checkout code
-        uses: actions/checkout@v2
+        uses: actions/checkout@v4
       - name: Set up Python ${{ env.default-python }}
-        uses: actions/setup-python@v2
+        uses: actions/setup-python@v4
         with:
           python-version: ${{ env.default-python }}
       - name: Upgrade pip, Install nox
@@ -38,9 +38,9 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: Checkout code
-        uses: actions/checkout@v2
+        uses: actions/checkout@v4
       - name: Set up Python ${{ env.default-python }}
-        uses: actions/setup-python@v2
+        uses: actions/setup-python@v4
         with:
           python-version: ${{ env.default-python }}
       - name: Upgrade pip, Install nox
@@ -56,9 +56,9 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: Checkout code
-        uses: actions/checkout@v2
+        uses: actions/checkout@v4
       - name: Set up Python ${{ env.default-python }}
-        uses: actions/setup-python@v2
+        uses: actions/setup-python@v4
         with:
           python-version: ${{ env.default-python }}
       - name: Upgrade pip, Install nox
@@ -75,20 +75,20 @@ jobs:
     strategy:
       matrix:
         os: [ ubuntu-latest ]
-        python-version: [ "3.6", "3.7", "3.8", "3.9", "3.10" ]
+        python-version: [ "3.8", "3.9", "3.10", "3.11", "3.12" ]
     steps:
       - name: Checkout code
-        uses: actions/checkout@v2
+        uses: actions/checkout@v4
       - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v2
+        uses: actions/setup-python@v4
         with:
           python-version: ${{ matrix.python-version }}
       - name: Determine pip cache directory
         id: pip-cache
         run: |
           echo "::set-output name=dir::$(pip cache dir)"
       - name: Cache pip cache
-        uses: actions/cache@v2
+        uses: actions/cache@v3
         with:
           path: ${{ steps.pip-cache.outputs.dir }}
           key: ${{ runner.os }}-pip${{ matrix.python-version }}
@@ -105,9 +105,9 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: Checkout code
-        uses: actions/checkout@v2
+        uses: actions/checkout@v4
       - name: Set up Python ${{ env.default-python }}
-        uses: actions/setup-python@v2
+        uses: actions/setup-python@v4
         with:
           python-version: ${{ env.default-python }}
       - name: Upgrade pip and install nox
@@ -129,19 +129,9 @@ jobs:
       - build-docs
     steps:
       - name: Checkout code
-        uses: actions/checkout@v2
+        uses: actions/checkout@v4
       - name: Install dependencies
         run: python -m pip install wheel
-      - name: Set version
-        run: |
-          if [[ "${{ github.ref }}" == "refs/tags/"* ]]
-          then
-            VERSION=$(echo "${{ github.ref }}" | sed -e 's,.*/\(.*\),\1,' | sed -e 's/^v//')
-          else
-            VERSION=$(date +%Y%m%d).$(date +%H%M%S)
-          fi
-          echo ${VERSION}
-          sed -i "s/__VERSION__/${VERSION}/g" pdfminer/__init__.py
       - name: Build package
         run: python setup.py sdist bdist_wheel
       - name: Generate changelog
@@ -161,4 +151,4 @@ jobs:
           body_path: ${{ github.workspace }}-CHANGELOG.md
           files: |
             dist/*.tar.gz
-            dist/*.whl
+            dist/*.whl
diff --git a/.gitignore b/.gitignore
@@ -26,3 +26,4 @@ Pipfile.lock
 .vscode/
 pyproject.toml
 poetry.lock
+.eggs
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,24 +7,48 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 
 ### Added
 
+- Adds `contains` method to `LTComponent` to check whether it contains another `LTComponent`.
+
+## [20231228]
+
+### Removed
+- Support for Python 3.6 and 3.7 ([#921](https://github.com/pdfminer/pdfminer.six/pull/921))
+
+### Added
+
 - Output converter for the hOCR format ([#651](https://github.com/pdfminer/pdfminer.six/pull/651))
 - Font name aliases for Arial, Courier New and Times New Roman ([#790](https://github.com/pdfminer/pdfminer.six/pull/790))
-- Adds `contains` method to `LTComponent` to check whether it contains another `LTComponent`.
+- Documentation on why special characters can sometimes not be extracted ([#829](https://github.com/pdfminer/pdfminer.six/pull/829))
+- Storing Bezier path and dashing style of line in LTCurve ([#801](https://github.com/pdfminer/pdfminer.six/pull/801))
 
 ### Fixed
 
+- Broken CI/CD pipeline by setting upper version limit for black, mypy, pip and setuptools ([#921](https://github.com/pdfminer/pdfminer.six/pull/921))
+- `flake8` failures ([#921](https://github.com/pdfminer/pdfminer.six/pull/921))
 - `ValueError` when bmp images with 1 bit channel are decoded ([#773](https://github.com/pdfminer/pdfminer.six/issues/773))
 - `ValueError` when trying to decrypt empty metadata values ([#766](https://github.com/pdfminer/pdfminer.six/issues/766))
 - Sphinx errors during building of documentation ([#760](https://github.com/pdfminer/pdfminer.six/pull/760))
 - `TypeError` when getting default width of font ([#720](https://github.com/pdfminer/pdfminer.six/issues/720))
 - Installing typing-extensions on Python 3.6 and 3.7 ([#775](https://github.com/pdfminer/pdfminer.six/pull/775))
 - `TypeError` in cmapdb.py when parsing null characters ([#768](https://github.com/pdfminer/pdfminer.six/pull/768))
-- Color "convenience operators" now (per spec) also set color space ([#779](https://github.com/pdfminer/pdfminer.six/issues/779))
+- Color "convenience operators" now (per spec) also set color space ([#794](https://github.com/pdfminer/pdfminer.six/pull/794))
+- `ValueError` when extracting images, due to breaking changes in Pillow ([#827](https://github.com/pdfminer/pdfminer.six/pull/827))
+- Small typo's and issues in the documentation ([#828](https://github.com/pdfminer/pdfminer.six/pull/828))
+- Ignore non-Unicode cmaps in TrueType fonts ([#806](https://github.com/pdfminer/pdfminer.six/pull/806))
+
+### Changed
+
+- Using non-hardcoded version string and setuptools-git-versioning to enable installation from source and building on Python 3.12 ([#922](https://github.com/pdfminer/pdfminer.six/issues/922))
+
 
 ### Deprecated
 
 - Usage of `if __name__ == "__main__"` where it was only intended for testing purposes ([#756](https://github.com/pdfminer/pdfminer.six/pull/756))
 
+### Removed
+
+- Support for Python 3.6 and 3.7 because they are end-of-life ([#923](https://github.com/pdfminer/pdfminer.six/pull/923))
+
 ## [20220524]
 
 ### Fixed

diff --git a/README.md b/README.md
@@ -39,18 +39,27 @@ Features
 How to use
 ----------
 
-* Install Python 3.6 or newer.
-* Install
+* Install Python 3.8 or newer.
+* Install pdfminer.six.
 
   `pip install pdfminer.six`
 
 * (Optionally) install extra dependencies for extracting images.
 
   `pip install 'pdfminer.six[image]'`
 
-* Use command-line interface to extract text from pdf:
+* Use the command-line interface to extract text from pdf.
 
-  `python pdf2txt.py samples/simple1.pdf`
+  `pdf2txt.py example.pdf`
+
+* Or use it with Python. 
+
+```python
+from pdfminer.high_level import extract_text
+
+text = extract_text("example.pdf")
+print(text)
+```
 
 Contributing
 ------------

diff --git a/docs/source/faq.rst b/docs/source/faq.rst
@@ -7,11 +7,11 @@ Why is it called pdfminer.six?
 ==============================
 
 Pdfminer.six is a fork of the `original pdfminer created by Euske
-<https://github.com/euske>`_. Almost all of the code and architecture is in
-fact created by Euske. But, for a long time this original pdfminer did not
+<https://github.com/euske>`_. Almost all of the code and architecture are in
+-fact created by Euske. But, for a long time, this original pdfminer did not
 support Python 3. Until 2020 the original pdfminer only supported Python 2.
 The original goal of pdfminer.six was to add support for Python 3. This was
-done with the six package. The six package helps to write code that is
+done with the `six` package. The `six` package helps to write code that is
 compatible with both Python 2 and Python 3. Hence, pdfminer.six.
 
 As of 2020, pdfminer.six dropped the support for Python 2 because it was
@@ -27,15 +27,42 @@ also equal to six feet.
 How does pdfminer.six compare to other forks of pdfminer?
 ==========================================================
 
-Pdfminer.six is now an independent and community maintained package for
-extracting text from PDF's with Python. We actively fix bugs (also for PDF's
+Pdfminer.six is now an independent and community-maintained package for
+extracting text from PDFs with Python. We actively fix bugs (also for PDFs
 that don't strictly follow the PDF Reference), add new features and improve
 the usability of pdfminer.six. This community separates pdfminer.six from the
 other forks of the original pdfminer. PDF as a format is very diverse and
 there are countless deviations from the official format. The only way to
-support all the PDF's out there is to have a community that actively uses and
+support all the PDFs out there is to have a community that actively uses and
 improves pdfminer.
 
 Since 2020, the original pdfminer is `dormant
 <https://github.com/euske/pdfminer#pdfminer>`_, and pdfminer.six is the fork
 which Euske recommends if you need an actively maintained version of pdfminer.
+
+Why are there `(cid:x)` values in the textual output?
+=====================================================
+
+One of the most common issues with pdfminer.six is that the textual output
+contains raw character id's `(cid:x)`. This is often experienced as confusing
+because the text is shown fine in a PDF viewer and other text from the same
+PDF is extracted properly.
+
+The underlying problem is that a PDF has two different representations
+of each character. Each character is mapped to a glyph that determines
+how the character is shown in a PDF viewer. And each character is also
+mapped to its unicode value that is used when copy-pasting the character.
+Some PDF's have incomplete unicode mappings and therefore it is impossible
+to convert the character to unicode. In these cases pdfminer.six defaults
+to showing the raw character id `(cid:x)`
+
+A quick test to see if pdfminer.six should be able to do better is to
+copy-paste the text from a PDF viewer to a text editor. If the result
+is proper text, pdfminer.six should also be able to extract proper text.
+If the result is gibberish, pdfminer.six will also not be able to convert
+the characters to unicode.
+
+References: 
+
+#. `Chapter 5: Text, PDF Reference 1.7 <https://opensource.adobe.com/dc-acrobat-sdk-docs/pdflsdk/index.html#pdf-reference>`_
+#. `Text: PDF, Wikipedia <https://en.wikipedia.org/wiki/PDF#Text>`_
diff --git a/docs/source/howto/acro_forms.rst b/docs/source/howto/acro_forms.rst
@@ -65,7 +65,7 @@ Only AcroForm interactive forms are supported, XFA forms are not supported.
               
             print(name, values)
 
-This code snippet will print all the fields name and value and save them in the "data" dictionary.
+This code snippet will print all the fields' names and values and save them in the "data" dictionary.
 
 
 How it works:
@@ -77,9 +77,9 @@ How it works:
     parser = PDFParser(fp)
     doc = PDFDocument(parser)
 
-- Get the catalog
+- Get the Catalog
 
-  (the catalog contains references to other objects defining the document structure, see section 7.7.2 of PDF 32000-1:2008 specs: https://www.adobe.com/devnet/pdf/pdf_reference.html)
+  (the catalog contains references to other objects defining the document structure, see section 7.7.2 of PDF 32000-1:2008 specs: https://opensource.adobe.com/dc-acrobat-sdk-docs/pdflsdk/index.html#pdf-reference)
 
 .. code-block:: python
 
@@ -122,7 +122,7 @@ How it works:
 
 - Call the value(s) decoding method as needed
 
-  (a single field can hold multiple values, for example a combo box can hold more than one value at time)
+  (a single field can hold multiple values, for example, a combo box can hold more than one value at a time)
 
 .. code-block:: python
 
@@ -131,7 +131,7 @@ How it works:
     else:
         values = decode_value(values)
         
-(the decode_value method takes care of decoding the fields value returning a string)
+(the decode_value method takes care of decoding the field's value, returning a string)
 
 - Decode PSLiteral and PSKeyword field values
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -59,18 +59,31 @@ Features
 Installation instructions
 =========================
 
-Before using it, you must install it using Python 3.6 or newer.
+* Install Python 3.6 or newer.
+* Install pdfminer.six.
 
 ::
+    $ pip install pdfminer.six`
 
-    $ pip install pdfminer.six
+* (Optionally) install extra dependencies for extracting images.
 
+::
+    $ pip install 'pdfminer.six[image]'`
 
-Optionally install extra dependencies that are needed to extract jpg images.
+* Use the command-line interface to extract text from pdf.
 
 ::
+    $ pdf2txt.py example.pdf`
+
+* Or use it with Python.
+
+.. code-block:: python
+
+    from pdfminer.high_level import extract_text
+
+    text = extract_text("example.pdf")
+    print(text)
 
-    $ pip install 'pdfminer.six[image]'
 
 
 Contributing

diff --git a/docs/source/topic/converting_pdf_to_text.rst b/docs/source/topic/converting_pdf_to_text.rst
@@ -3,7 +3,7 @@
 Converting a PDF file to text
 *****************************
 
-Most PDF files look like they contain well structured text. But the reality  is
+Most PDF files look like they contain well-structured text. But the reality is
 that a PDF file does not contain anything that resembles paragraphs,
 sentences or even words. When it comes to text, a PDF file is only aware of
 the characters and their placement.
@@ -14,7 +14,7 @@ compose the table, the page footer or the description of a figure. Unlike
 other document formats, like a `.txt` file or a word document, the PDF format
 does not contain a stream of text.
 
-A PDF document does consists of a collection of objects that together describe
+A PDF document consists of a collection of objects that together describe
 the appearance of one or more pages, possibly accompanied by additional
 interactive elements and higher-level application data. A PDF file contains
 the objects making up a PDF document along with associated structural
@@ -53,7 +53,7 @@ uses these bounding boxes to decide which characters belong together.
 
 Characters that are both horizontally and vertically close are grouped onto
 one line. How close they should be is determined by the `char_margin`
-(M in figure) and the `line_overlap` (not in figure) parameter. The horizontal
+(M in the figure) and the `line_overlap` (not in figure) parameter. The horizontal
 *distance* between the bounding boxes of two characters should be smaller than
 the `char_margin` and the vertical *overlap* between the bounding boxes should
 be smaller than the `line_overlap`.
@@ -76,7 +76,7 @@ be separated by a space.
 
 The result of this stage is a list of lines. Each line consists of a list of
 characters. These characters are either original `LTChar` characters that
-originate from the PDF file, or inserted `LTAnno` characters that
+originate from the PDF file or inserted `LTAnno` characters that
 represent spaces between words or newlines at the end of each line.
 
 Grouping lines into boxes
@@ -91,7 +91,7 @@ Lines that are both horizontally overlapping and vertically close are grouped.
 How vertically close the lines should be is determined by the `line_margin`.
 This margin is specified relative to the height of the bounding box. Lines
 are close if the gap between the tops (see L :sub:`1` in the figure) and bottoms
-(see L :sub:`2`) in the figure) of the bounding boxes is closer together
+(see L :sub:`2`) in the figure) of the bounding boxes are closer together
 than the absolute line margin, i.e. the `line_margin` multiplied by the
 height of the bounding box.
 
@@ -120,7 +120,7 @@ Working with rotated characters
 
 The algorithm described above assumes that all characters have the same
 orientation. However, any writing direction is possible in a PDF. To
-accommodate for this, pdfminer.six allows to detect vertical writing with the
+accommodate for this, pdfminer.six allows detecting vertical writing with the
 `detect_vertical` parameter. This will apply all the grouping steps as if the
 pdf was rotated 90 (or 270) degrees
 

diff --git a/docs/source/tutorial/commandline.rst b/docs/source/tutorial/commandline.rst
@@ -18,7 +18,7 @@ pdf2txt.py
 
 ::
 
-    $ python tools/pdf2txt.py example.pdf
+    $ pdf2txt.py example.pdf
     all the text from the pdf appears on the command line
 
 The :ref:`api_pdf2txt` tool extracts all the text from a PDF. It uses layout
@@ -29,7 +29,7 @@ dumppdf.py
 
 ::
 
-    $ python tools/dumppdf.py -a example.pdf
+    $ dumppdf.py -a example.pdf
     <pdf><object id="1">
     ...
     </object>