The project's primary aim is to extract structured information from invoices utilising Donut transformer models.
A Streamlit application for uploading and processing images or PDFs using the Donut model. Outputs can be saved in JSON or Excel format.
Encompasses the main data processing scripts and associated resources.
-
Directories:
excels/
: For Excel files.metadata/
: Contains metadata formatted as .jsonl.pdfs_1077/
: Stores PDFs awaiting processing.
-
Python Scripts Overview:
1_xlsx_to_jsonl_.py
: Transforms Excel files to the JSONL metadata format.2_pdf_to_pic.py
: Converts PDFs into images at different resolutions.3_check_sizes.py
: Offers insights into image dimensions.4_padding.py
: Adjusts images below a certain size by adding padding.5_create_stratas_for_dataset.py
: Facilitates stratified sampling based on vendor details.6_prepare_data.py
: Categorises images into train, validation, or test groups.7_remove_hf_dataset.py
: Deletes datasets from a Hugging Face repository.8_dataset_generator.py
: Includes classes like DonutDatasetGenerator (provides image details), DonutDatasetUploader (uploads datasets to HuggingFace), and DonutDatasetTester (tests and provides details of dataset samples).
Primarily related to GPT-based operations.
excels/
: Holds the initial Excel datasets.final_results/
: Consists of the processed results._template/
: Sub-folders include:csv
: Where CSV files are stored.images
: For images related to results.merged_jsonl
: Contains merged results in JSONL format.report
: Likely for generated reports.results_json
: Holds processed results in JSON format.
- Additional directories include:
gpt_3.5_outputs
: Contains GPT-3.5 outputs.gpt_4_outputs
: Contains GPT-4 outputs.images
: Contains images derived from various sources like PDFs.metadata
: Contains metadata about the datasets.ocrs
: Contains OCR outputs from scanned invoices.pdfs_1077
: Contains the PDFs ready for processing.
1_sort_out_pdfs.py
: Filters PDFs using metadata. Unmatched PDFs are segregated.2_xlsx_to_jsonl_.py
: Converts Excel invoice details to JSONL formatted metadata. Provides a summary of processed invoices.3_pdf_to_ocr_v03.py
: Focuses on OCR of PDFs. Steps include image preprocessing, using Tesseract for OCR, and other associated operations.4_gpt4_ocr_to_json_v04.py
: Converts OCR outputs to structured JSON using GPT-4. Manages various tasks like token counting, error handling, logging, and data saving.5_json_to_merged.jsonl_gpt4.py
: Merges GPT-4 outputs with metadata into a single file.
Focuses on the Exploratory Data Analysis (EDA) of all contained data.
Geared towards setting up experiments (covering 36 experiments in total) and conducting error analysis. Contains:
error_analysis_setup_12.ipynb
notebook to delve deep into GPT-4 output errors.Experiment_setup_12.ipynb
for conducting the experiments using GPU