Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add contains method to LTComponent #808

Closed
wants to merge 5 commits into from
Closed

Add contains method to LTComponent #808

wants to merge 5 commits into from

Conversation

dmlls
Copy link
Contributor

@dmlls dmlls commented Sep 8, 2022

Pull request
This PR adds a utility method to check whether a LTComponent is contained within another LTComponent.

How Has This Been Tested?

test_doc.pdf

from pdfminer.high_level import extract_pages


def main():
    for page_layout in extract_pages("test_doc.pdf"):
        text_1, text_2, rect = page_layout

        print("Rect contains Text 1? (true):", rect.contains(text_1))
        assert(rect.contains(text_1))

        print("Rect contains Text 2? (false):", rect.contains(text_2))
        assert(not rect.contains(text_2))

        print("Text 1 contains Text 2? (false):", text_1.contains(text_2))
        assert(not text_1.contains(text_2))

        print("Text 2 contains Rect? (false):", text_2.contains(rect))
        assert(not text_2.contains(rect))

        print("Rect contains itself? (true):", rect.contains(rect))
        assert(rect.contains(rect))


if __name__ == "__main__":
    main()

Checklist

  • I have read CONTRIBUTING.md.
  • I have added a concise human-readable description of the change to CHANGELOG.md.
  • I have tested that this fix is effective or that this feature works.
  • I have added docstrings to newly created methods and classes.
  • I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

Add utility method to check whether a `LTComponent` is contained within another `LTComponent`.
@KunalGehlot
Copy link
Contributor

Was there any particular reason you needed to implement this feature? I could do the same in my project natively.

Also contains is not a good name for the method, maybe rename it to LTcontainer or containsLT ? Eitherways I don't see the feature being handy.

@dmlls
Copy link
Contributor Author

dmlls commented Sep 14, 2022

Hi @KunalGehlot. I'm writing a header/footer remover and I need to know whether a text block is contained within the upper/lower part of a page.

The name (and functionality) of it was "inspired" by MuPDF. I'm open to any other better naming :)

@pietermarsman
Copy link
Member

I agree with @ZackCodes.ai. If we add the method we are extending the interface of the class, and people will start to depend on it. To add the method, it should either be internally used in the layout analysis, or it should be a no-brainer.

The alternative for you is also rather simple:

def contains(a: LTContainer, b: LTContainer) -> bool:
    ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants