Information Extraction from Invoices

Objective

The project's primary aim is to extract structured information from invoices utilising Donut transformer models.

📂 Directory Overview

🔵 01_webapp

A Streamlit application for uploading and processing images or PDFs using the Donut model. Outputs can be saved in JSON or Excel format.

🔵 02_donut

Encompasses the main data processing scripts and associated resources.

Directories:
- excels/: For Excel files.
- metadata/: Contains metadata formatted as .jsonl.
- pdfs_1077/: Stores PDFs awaiting processing.
Python Scripts Overview:
- 1_xlsx_to_jsonl_.py: Transforms Excel files to the JSONL metadata format.
- 2_pdf_to_pic.py: Converts PDFs into images at different resolutions.
- 3_check_sizes.py: Offers insights into image dimensions.
- 4_padding.py: Adjusts images below a certain size by adding padding.
- 5_create_stratas_for_dataset.py: Facilitates stratified sampling based on vendor details.
- 6_prepare_data.py: Categorises images into train, validation, or test groups.
- 7_remove_hf_dataset.py: Deletes datasets from a Hugging Face repository.
- 8_dataset_generator.py: Includes classes like DonutDatasetGenerator (provides image details), DonutDatasetUploader (uploads datasets to HuggingFace), and DonutDatasetTester (tests and provides details of dataset samples).

🔵 03_gpt

Primarily related to GPT-based operations.

Directories:

excels/: Holds the initial Excel datasets.
final_results/: Consists of the processed results.
- _template/: Sub-folders include:
  - csv: Where CSV files are stored.
  - images: For images related to results.
  - merged_jsonl: Contains merged results in JSONL format.
  - report: Likely for generated reports.
  - results_json: Holds processed results in JSON format.
Additional directories include:
- gpt_3.5_outputs: Contains GPT-3.5 outputs.
- gpt_4_outputs: Contains GPT-4 outputs.
- images: Contains images derived from various sources like PDFs.
- metadata: Contains metadata about the datasets.
- ocrs: Contains OCR outputs from scanned invoices.
- pdfs_1077: Contains the PDFs ready for processing.

Scripts Overview:

1_sort_out_pdfs.py: Filters PDFs using metadata. Unmatched PDFs are segregated.
2_xlsx_to_jsonl_.py: Converts Excel invoice details to JSONL formatted metadata. Provides a summary of processed invoices.
3_pdf_to_ocr_v03.py: Focuses on OCR of PDFs. Steps include image preprocessing, using Tesseract for OCR, and other associated operations.
4_gpt4_ocr_to_json_v04.py: Converts OCR outputs to structured JSON using GPT-4. Manages various tasks like token counting, error handling, logging, and data saving.
5_json_to_merged.jsonl_gpt4.py: Merges GPT-4 outputs with metadata into a single file.

🔵 04_eda

Focuses on the Exploratory Data Analysis (EDA) of all contained data.

🔵 05_error_analysis

Geared towards setting up experiments (covering 36 experiments in total) and conducting error analysis. Contains:

error_analysis_setup_12.ipynb notebook to delve deep into GPT-4 output errors.
Experiment_setup_12.ipynb for conducting the experiments using GPU

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
01_webapp		01_webapp
02_donut		02_donut
03_gpt		03_gpt
04_eda		04_eda
05_experiments_and_error_analysis		05_experiments_and_error_analysis
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Extraction from Invoices

Objective

📂 Directory Overview

🔵 01_webapp

🔵 02_donut

🔵 03_gpt

Directories:

Scripts Overview:

🔵 04_eda

🔵 05_error_analysis

About

Releases

Packages

Languages

yonlas/information-extraction-from-invoices

Folders and files

Latest commit

History

Repository files navigation

Information Extraction from Invoices

Objective

📂 Directory Overview

🔵 01_webapp

🔵 02_donut

🔵 03_gpt

Directories:

Scripts Overview:

🔵 04_eda

🔵 05_error_analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages