Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I read the table that have started on page 1 and extends on multiple pages. #192

Open
dejanmarkovic opened this issue Oct 19, 2024 · 4 comments

Comments

@dejanmarkovic
Copy link

pypdf_table_extraction/camelot does not recognize the table on pages after page 1 with the lattice flavor.

With the stream method, I get a messed-up output like this one

   0            1            2                                  3                       4         5
0                                                                      2059001013453712313
1                               289 Transakcije po nalogu građana                    PBO:
2                                                                        MARY MILAN
3  5  12.05.2024.  12.05.2024.     n 9001013454849 III rata   maj                    PBZ:  1.600,00
4                                                                  KNEZ MILET 456 4 11
5                                                   Instant nalog            FT241123YJFB4
6                                                         Belgrade

This is the output from the lattice from page one which looks great

0  REDNI\nBROJ  DATUM\nPRIJEMA  DATUM\nIZVRŠENJA  ...  REFERENCA KLIJENTA\nREFERENCA PARTNERA\nREFERE...  NA TERET  U KORIST
1            1     11.05.2024.       12.05.2024.  ...                           PBO:\nPBZ:\nFT201661TXR4            4.200,00
2            2     12.05.2024.       12.05.2024.  ...                           PBO:\nPBZ:\nFT20122CK6Y6            5.600,00
3            3     12.05.2024.       12.05.2024.  ...                           PBO:\nPBZ:\nFT20134Y5NWL            5.600,00
4            4     12.05.2024.       12.05.2024.  ...                           PBO:\nPBZ:\nFT20124QY6JZ            5.600,00

The document is a PDF bank statement.
NOTE: I have randomized the numbers in the output for privacy and security purposes.

@bosd
Copy link
Collaborator

bosd commented Oct 20, 2024

Are there 2 separate issues here?

pypdf_table_extraction/camelot does not recognize the table on pages after page 1 with the lattice flavor.

This could be a bug.

  1. Merging tables which span multiple pages is afaik not a covered use case. Merging the tables can be done in post processing.

Have you tried the output with the Network parser?

@dejanmarkovic
Copy link
Author

dejanmarkovic commented Oct 21, 2024

With this code

`import pypdf_table_extraction

file_path = r"C:\Projects\temp123\attachments\test\er\er3.pdf"

flavors = ["hybrid", "lattice", "network", "stream"]

for flavor in flavors:
print(f"\nTrying {flavor} flavor:")
try:
tables = pypdf_table_extraction.read_pdf(
file_path, pages="all", flavor=flavor # Use the current flavor
)

    print(f"Number of tables found: {len(tables)}")

    for i, table in enumerate(tables):
        print(f"\nTable {i} data:")
        print(table.df)

        csv_path = f"{flavor}_table_{i}.csv"
        table.df.to_csv(csv_path, index=False)
        print(f"Table {i} saved to {csv_path}")

    for i, table in enumerate(tables):
        print(f"\nParsing report for {flavor} Table {i}:")
        print(table.parsing_report)

except Exception as e:
    print(f"An error occurred with {flavor} flavor: {str(e)}")
    continue

print("\nTable extraction process completed.")
`
I am getting the following errors:

  1. 'Trying hybrid flavor:'
    **An error occurred with hybrid flavor: Unknown flavor specified. Use either 'lattice' or 'stream''
  2. An error occurred with network flavor: Unknown flavor specified. Use either 'lattice' or 'stream**

NOTE: I have uninstalled the Camelot and pypdf_table_extraction and Installed again only pypdf_table_extraction library so there should be no conflicts or any other issues.

Can you please help/advise?

@bosd
Copy link
Collaborator

bosd commented Oct 21, 2024

Based on the following error message:

2. An error occurred with network flavor: Unknown flavor specified. Use either 'lattice' or 'stream'

It looks like somhow you are running an old code base.
As of V0.0.2 the error message changed to:


        raise NotImplementedError(
            "Unknown flavor specified."
            " Use either 'lattice', 'stream', 'network' or 'hybrid'"
        )

Maybe uninstall both again.
Then reinstall pypdf_table_exctraction.
What is the output of pip show pypdf_table_exctraction or camelot --version

@dejanmarkovic
Copy link
Author

dejanmarkovic commented Oct 21, 2024

Maybe my response is too long so I have cut out most of the data to make it more concise.
I am using both conda and pip.

Here is the listing of pypdf libraries that I have installed:

pypdf 4.3.1 pypi_0 pypi
pypdf-table-extraction 0.0.2 pypi_0 pypi
pypdf2 2.11.1 pyhd8ed1ab_0 conda-forge

I have uninstalled the camelot as per @stefan6419846 suggestion in a thread here
#191 (comment) to avoid side effects.

Any feedback is greatly appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants