How can I read the table that have started on page 1 and extends on multiple pages. #192

dejanmarkovic · 2024-10-19T14:17:38Z

pypdf_table_extraction/camelot does not recognize the table on pages after page 1 with the lattice flavor.

With the stream method, I get a messed-up output like this one

   0            1            2                                  3                       4         5
0                                                                      2059001013453712313
1                               289 Transakcije po nalogu građana                    PBO:
2                                                                        MARY MILAN
3  5  12.05.2024.  12.05.2024.     n 9001013454849 III rata   maj                    PBZ:  1.600,00
4                                                                  KNEZ MILET 456 4 11
5                                                   Instant nalog            FT241123YJFB4
6                                                         Belgrade

This is the output from the lattice from page one which looks great

0  REDNI\nBROJ  DATUM\nPRIJEMA  DATUM\nIZVRŠENJA  ...  REFERENCA KLIJENTA\nREFERENCA PARTNERA\nREFERE...  NA TERET  U KORIST
1            1     11.05.2024.       12.05.2024.  ...                           PBO:\nPBZ:\nFT201661TXR4            4.200,00
2            2     12.05.2024.       12.05.2024.  ...                           PBO:\nPBZ:\nFT20122CK6Y6            5.600,00
3            3     12.05.2024.       12.05.2024.  ...                           PBO:\nPBZ:\nFT20134Y5NWL            5.600,00
4            4     12.05.2024.       12.05.2024.  ...                           PBO:\nPBZ:\nFT20124QY6JZ            5.600,00

The document is a PDF bank statement.
NOTE: I have randomized the numbers in the output for privacy and security purposes.

The text was updated successfully, but these errors were encountered:

bosd · 2024-10-20T14:29:17Z

Are there 2 separate issues here?

pypdf_table_extraction/camelot does not recognize the table on pages after page 1 with the lattice flavor.

This could be a bug.

Merging tables which span multiple pages is afaik not a covered use case. Merging the tables can be done in post processing.

Have you tried the output with the Network parser?

dejanmarkovic · 2024-10-21T15:27:28Z

With this code

`import pypdf_table_extraction

file_path = r"C:\Projects\temp123\attachments\test\er\er3.pdf"

flavors = ["hybrid", "lattice", "network", "stream"]

for flavor in flavors:
print(f"\nTrying {flavor} flavor:")
try:
tables = pypdf_table_extraction.read_pdf(
file_path, pages="all", flavor=flavor # Use the current flavor
)

    print(f"Number of tables found: {len(tables)}")

    for i, table in enumerate(tables):
        print(f"\nTable {i} data:")
        print(table.df)

        csv_path = f"{flavor}_table_{i}.csv"
        table.df.to_csv(csv_path, index=False)
        print(f"Table {i} saved to {csv_path}")

    for i, table in enumerate(tables):
        print(f"\nParsing report for {flavor} Table {i}:")
        print(table.parsing_report)

except Exception as e:
    print(f"An error occurred with {flavor} flavor: {str(e)}")
    continue

print("\nTable extraction process completed.")
`
I am getting the following errors:

'Trying hybrid flavor:'
**An error occurred with hybrid flavor: Unknown flavor specified. Use either 'lattice' or 'stream''
An error occurred with network flavor: Unknown flavor specified. Use either 'lattice' or 'stream**

NOTE: I have uninstalled the Camelot and pypdf_table_extraction and Installed again only pypdf_table_extraction library so there should be no conflicts or any other issues.

Can you please help/advise?

bosd · 2024-10-21T17:03:01Z

Based on the following error message:

2. An error occurred with network flavor: Unknown flavor specified. Use either 'lattice' or 'stream'

It looks like somhow you are running an old code base.
As of V0.0.2 the error message changed to:


        raise NotImplementedError(
            "Unknown flavor specified."
            " Use either 'lattice', 'stream', 'network' or 'hybrid'"
        )

Maybe uninstall both again.
Then reinstall pypdf_table_exctraction.
What is the output of pip show pypdf_table_exctraction or camelot --version

dejanmarkovic · 2024-10-21T17:24:43Z

Maybe my response is too long so I have cut out most of the data to make it more concise.
I am using both conda and pip.

Here is the listing of pypdf libraries that I have installed:

pypdf 4.3.1 pypi_0 pypi
pypdf-table-extraction 0.0.2 pypi_0 pypi
pypdf2 2.11.1 pyhd8ed1ab_0 conda-forge

I have uninstalled the camelot as per @stefan6419846 suggestion in a thread here
#191 (comment) to avoid side effects.

Any feedback is greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I read the table that have started on page 1 and extends on multiple pages. #192

How can I read the table that have started on page 1 and extends on multiple pages. #192

dejanmarkovic commented Oct 19, 2024

bosd commented Oct 20, 2024

dejanmarkovic commented Oct 21, 2024 •

edited

Loading

bosd commented Oct 21, 2024

dejanmarkovic commented Oct 21, 2024 •

edited

Loading

How can I read the table that have started on page 1 and extends on multiple pages. #192

How can I read the table that have started on page 1 and extends on multiple pages. #192

Comments

dejanmarkovic commented Oct 19, 2024

bosd commented Oct 20, 2024

dejanmarkovic commented Oct 21, 2024 • edited Loading

bosd commented Oct 21, 2024

dejanmarkovic commented Oct 21, 2024 • edited Loading

dejanmarkovic commented Oct 21, 2024 •

edited

Loading

dejanmarkovic commented Oct 21, 2024 •

edited

Loading