Remove text from a PDF #4221

samuelbradshaw · 2025-01-13T16:25:52Z

samuelbradshaw
Jan 13, 2025

Hi! Is there an efficient way to remove/delete all text from a PDF with PyMuPDF?
Also, is there a way to remove all text that uses a specific font from a PDF?

Answered by JorjMcKie

Jan 13, 2025

The easiest way to remove all text is using "redaction annotations" (from all or selected pages):

doc = pymupdf.open("input.pdf")
page = doc[0]  # 0 or any 0-based page number
page.add_redact_annot(page.rect)  # redaction annotation covering the full page
page.apply_redactions(images=pymupdf.PDF_REDACT_IMAGE_NONE,  # keep the images
    graphics=pymupdf.PDF_REDACT_LINE_ART_NONE,  # keep vector graphics
    )

Specific text erasures work the same way, except you have to determine the desired boundary box to use instead of page.rect.

# extract text and full meta data exclusively (no images)
for block in page.get_text("dict", flags=pymupdf.TEXTFLAGS_TEXT)["blocks"]:
    for line in block["lines"

View full answer

JorjMcKie · 2025-01-13T17:07:31Z

JorjMcKie
Jan 13, 2025
Maintainer

The easiest way to remove all text is using "redaction annotations" (from all or selected pages):

doc = pymupdf.open("input.pdf")
page = doc[0]  # 0 or any 0-based page number
page.add_redact_annot(page.rect)  # redaction annotation covering the full page
page.apply_redactions(images=pymupdf.PDF_REDACT_IMAGE_NONE,  # keep the images
    graphics=pymupdf.PDF_REDACT_LINE_ART_NONE,  # keep vector graphics
    )

Specific text erasures work the same way, except you have to determine the desired boundary box to use instead of page.rect.

# extract text and full meta data exclusively (no images)
for block in page.get_text("dict", flags=pymupdf.TEXTFLAGS_TEXT)["blocks"]:
    for line in block["lines"]:
        for span in line["spans"]:
            if "unwanted font name" in span["font"]:
                page.add_redact_annot(span["bbox"])  # cover text span with redact annot
page.apply_redactions(images=pymupdf.PDF_REDACT_IMAGE_NONE,  # keep the images
    graphics=pymupdf.PDF_REDACT_LINE_ART_NONE,  # keep vector graphics
    )

Important:

Save the PDF with garbage collection and compression: doc.ez_save(..).
Every text will be deleted which touches (intersects) any redaction annotation. This may be more than desired if e.g. text is written overlapping each other, or lines have a distances narrower than the font-specific line height. When this occurs, some special precaution defining the redact rectangles is required.
Obviously, the second alternative can be adapted to other selection criteria like text color or font size etc.

8 replies

samuelbradshaw Jan 14, 2025
Author

It seems that this method of removing text works on some PDFs, but not others. Here's my code:

import pymupdf

doc = pymupdf.open('/Users/sbradshaw/Downloads/conferencereport151chur.pdf')
for pg, page in enumerate(doc):
  print(f'Page {pg}')
  page.add_redact_annot(page.rect)
  page.apply_redactions(images=pymupdf.PDF_REDACT_IMAGE_NONE, graphics=pymupdf.PDF_REDACT_LINE_ART_NONE)
doc.save('/Users/sbradshaw/Downloads/conferencereport151chur_notext.pdf')
doc.close()

An example PDF where text gets removed correctly:
https://assets.churchofjesuschrist.org/6e/4e/6e4ecd14ebb511ee863aeeeeac1eaa7887580260/come_thou_fount_of_every_blessing.pdf

And a PDF where text doesn’t get removed:
https://ia902808.us.archive.org/23/items/conferencereport1880a/conferencereport151chur.pdf
The text in this second PDF is from OCR – I'm not sure if that should make a difference (I don't expect the visible text in the scanned page images to be removed, of course – only the invisible text from OCR).

Is this a bug – should I log a GitHub issue? (EDIT: For the record, it seems like pypdf has the same problem: py-pdf/pypdf#3049)

samuelbradshaw Jan 14, 2025
Author

I checked again, and it is working for both PDFs – my Mac was being smart and making the text from the image selectable (Live Text). It was responding so quickly that I assumed there was still a text layer. This works great! Thanks for your help and patience.

zergb Jan 14, 2025

Thanks a lot for the solution! The method is realy elegent!

Still have one more question about the solution. The texts inside widgets cannot removed. It looks like widgets are always above any element in page, such as redaction. So the texts are not "covered" by redaction. Is there any way to handle them with the same method?

JorjMcKie Jan 14, 2025
Maintainer

You are right to say that annotations and widgets are "above" the page. I like to use the metaphor "like dust on a glass frame of a painting".
You have the option to remove annotations or widgets via respective page.delete_annot()/page.delete_widget().
You can also "bake in" both types document-wide via method bake(). This converts annots and / or widgets to permanent page content and thus makes them reachable for redactions.

zergb Jan 15, 2025

Understood. The doc.bake() function works perfect as expected! After that the texts in input widgets can be hidden by redaction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove text from a PDF #4221

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Remove text from a PDF #4221

samuelbradshaw Jan 13, 2025

Replies: 1 comment · 8 replies

JorjMcKie Jan 13, 2025 Maintainer

samuelbradshaw Jan 14, 2025 Author

samuelbradshaw Jan 14, 2025 Author

zergb Jan 14, 2025

JorjMcKie Jan 14, 2025 Maintainer

zergb Jan 15, 2025

samuelbradshaw
Jan 13, 2025

Replies: 1 comment 8 replies

JorjMcKie
Jan 13, 2025
Maintainer

samuelbradshaw Jan 14, 2025
Author

samuelbradshaw Jan 14, 2025
Author

JorjMcKie Jan 14, 2025
Maintainer