Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added updated docs #503

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
167 changes: 133 additions & 34 deletions OCR/README.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,63 @@
## OCR
# OCR Layer - ReportVision

The **OCR Layer** in the ReportVision project processes document images, performs segmentation and optical character recognition (OCR), and computes accuracy metrics by comparing OCR outputs to ground truth data.

---

## Table of Contents
1. [Introduction](#introduction)
2. [Installation](#installation)
3. [Running the Application](#running-the-application)
4. [Development Tools](#development-tools)
5. [Testing](#testing)
6. [End-to-End Benchmarking](#end-to-end-benchmarking)
7. [Dockerized Development](#dockerized-development)
8. [Benchmarking](#end-to-end-benchmarking)
9. [Project Architecture](#project-architecture)
10. [API Endpoints](#api-endpoints)


---

## Introduction

The OCR layer uses **Poetry** for dependency management and virtual environment setup. It provides:
- An API for performing OCR operations.
- Support for benchmarking OCR accuracy.
- Configuration for different OCR models and segmentation templates.

### Installation

### Prerequisites
- Python 3.9 or later
- [Poetry](https://python-poetry.org/) for dependency management
- Docker (optional for containerized development)

```shell
pipx install poetry
```

### Running The Application
Activate the virtual environment and install dependencies, all subsequent commands assume you are in the virtual env

```shell
poetry shell
poetry install
```

Run unit tests

```shell
poetry run pytest
```

Run benchmark tests

```shell
cd tests
poetry run pytest benchmark_test.py -v
fastapi dev ocr/api.py
```

poetry run pytest bench_test.py -v
### Testing

Run main, hoping to convert this to a cli at some point
Run unit tests

```shell
poetry run main
```shell
poetry run pytest
```

To build the OCR service into an executable artifact

```shell
poetry run build
```
### Development Tools

Adding new dependencies

Expand Down Expand Up @@ -82,12 +101,37 @@ To run the API in prod mode
poetry run api
```

### Test Data Sets

You can also run the script `pytest run reportvision-dataset-1/medical_report_import.py` to pull in all relevant data.
To build the OCR service into an executable artifact

```shell
poetry run build
```

### Dockerized Development

It is also possible to run the project in a collection of docker containers. This is useful for development and testing purposes as it doesn't require any additional dependencies to be installed.

To start the containers, run the following command:

```shell
docker compose -f dev-env.yaml up
```

This will start the following containers:

- ocr: The OCR service container
- frontend: The frontend container

The frontend container will automatically reload when changes are made to the frontend. To access the frontend, navigate to http://localhost:5173 in your browser.

The OCR service container will restart automatically when changes are made to the OCR code. To access the API, navigate to http://localhost:8000/ in your browser.


### Run end-to-end benchmarking
### End to End Benchmarking

#### Overview
End-to-end benchmarking evaluates OCR accuracy by:

End-to-end benchmarking scripts can:

Expand Down Expand Up @@ -117,21 +161,76 @@ Run notes:
* Benchmark takes one second per segment for OCR using the default `trocr` model. Please be patient or set a counter to limit the number of files processed.
* Only one segment can be input at a time

### Dockerized Development

It is also possible to run the entire project in a collection of docker containers. This is useful for development and testing purposes as it doesn't require any additional dependencies to be installed on your local machine.
### Test Data Sets

To start the containers, run the following command:
You can run the script `pytest run reportvision-dataset-1/medical_report_import.py` to pull in all relevant data.

```shell
docker compose -f dev-env.yaml up
```

This will start the following containers:
## Project Architecture

- ocr: The OCR service container
- frontend: The frontend container
The OCR Layer is organized as follows:

The frontend container will automatically reload when changes are made to the frontend code. To access the frontend, navigate to http://localhost:5173 in your browser.
- **`ocr/`**:
- **`api.py`**: Defines the API for the OCR service.
- **`main.py`**: Entry point script to run the OCR service.
- **`segmenter.py`**: Handles image segmentation based on templates and labels.
- **`ocr_engine.py`**: OCR logic using the specified OCR models.
- **`metrics.py`**: Computes metrics (e.g., confidence, Levenshtein distance) by comparing OCR results with ground truth.
- **`config.py`**: Contains configuration files for paths, environment variables, and model settings.

The OCR service container will restart automatically when changes are made to the OCR code. To access the API, navigate to http://localhost:8000/ in your browser.
- **`tests/`**: Contains unit tests, integration tests, and benchmarking scripts.
- **`benchmark_test.py`**: Tests benchmarking logic for OCR and metrics.
- **`unit_test.py`**: Includes unit tests for individual components of the OCR service.
- **`benchmark_main.py`**: Main script for running end-to-end benchmarking, including segmentation, OCR, and metrics computation.

- **`data/`**: location of segmentation templates, labels, ground truth, and test datasets (not included in the repository by default).

- **`reportvision-dataset-1/`**: Example dataset folder for running benchmarks and tests.
- **`medical_report_import.py`**: Script to import and prepare medical reports for testing.

- **`Dockerfile`**: Defines the container for running the OCR service in a Dockerized environment.

- **`dev-env.yaml`**: Docker Compose file for setting up a development environment with containers for the OCR service and frontend.

- **`pyproject.toml`**: Poetry configuration file specifying project dependencies and settings.

- **`poetry.lock`**: Lock file generated by Poetry to ensure dependency consistency.

## API Endpoints

The OCR service exposes the following API endpoints:

#### Health Check
- **`GET /`**
- **Description**: Returns the status of the OCR service.
- **Response**: Status message indicating the service's health.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we also doing swagger type docs as well?


#### Image Alignment
- **`POST /image_alignment/`**
- **Description**: Aligns a source image with a segmentation template.
- **Request Body**:
- `source_image` (Base64-encoded string): The source image to align.
- `segmentation_template` (Base64-encoded string): The segmentation template to align with.
- **Response**:
- Base64-encoded string of the aligned image.

#### Image File to Text
- **`POST /image_file_to_text/`**
- **Description**: Processes an image file and a segmentation template to extract text based on labeled regions.
- **Request Body**:
- `source_image` (file): The uploaded source image file.
- `segmentation_template` (file): The uploaded segmentation template file.
- `labels` (JSON string): Defines labeled regions in the segmentation template.
- **Response**:
- JSON object containing text extracted from labeled regions.

#### Image to Text
- **`POST /image_to_text`**
- **Description**: Processes Base64-encoded images and extracts text from labeled regions.
- **Request Body**:
- `source_image` (Base64-encoded string): The source image.
- `segmentation_template` (Base64-encoded string): The segmentation template.
- `labels` (JSON string): Defines labeled regions in the segmentation template.
- **Response**:
- JSON object containing text extracted from labeled regions.
4 changes: 2 additions & 2 deletions OCR/ocr/services/phdc_converter/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -644,7 +644,7 @@ def _build_patient(self, patient: Patient) -> ET.Element:
)
patient_data.append(v)
else:
logging.warning(f"Race code {patient.race_code} not found in " "the OMB classification.")
logging.warning(f"Race code {patient.race_code} not found in the OMB classification.")

if patient.ethnic_group_code is not None:
if patient.ethnic_group_code in ethnicity_code_and_mapping:
Expand All @@ -658,7 +658,7 @@ def _build_patient(self, patient: Patient) -> ET.Element:
)
patient_data.append(v)
else:
logging.warning(f"Ethnic group code {patient.ethnic_group_code} not " "found in OMB classification.")
logging.warning(f"Ethnic group code {patient.ethnic_group_code} not found in OMB classification.")

return patient_data

Expand Down
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,9 @@

## Overview

Describe the purpose of your project. Add additional sections as necessary to help collaborators and potential collaborators understand and use your project.

Please see the User Guide to get a overview of this project.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to include the link to this



## Public Domain Standard Notice
This repository constitutes a work of the United States Government and is not
subject to domestic copyright protection under 17 USC § 105. This repository is in
Expand Down
Binary file added arcdiagram.png
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use mermaid.js to create this if you don't want to embed a png-

https://github.blog/developer-skills/github/include-diagrams-markdown-files-mermaid/

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
154 changes: 154 additions & 0 deletions backend/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# Backend Middleware - Spring Boot Application

This document provides a guide for the **Backend Middleware** of the ReportVision project. This middleware bridges the **frontend app** with the **OCR backend**

---

## Table of Contents
1. [Introduction](#introduction)
2. [Installation](#installation)
3. [Testing](#testing)
4. [Project Architecture](#project-architecture)
5. [Key Features](#key-features)
6. [API Endpoints](#api-endpoints)
7. [Troubleshooting](#troubleshooting)


## Introduction

The backend of ReportVision is a **Spring Boot** application designed to:
- Serve as middleware connecting the frontend with OCR.
- Manage storage of template in the DB
- Act as a middle layer to pass data for OCR extraction


### Installation

## To Run the Project please ensure you have docker set up
1. Clone the repository:
```bash
git clone https://github.com/CDCgov/ReportVision.git
cd ReportVision/backend
2. Run the app
Make sure you are in root

```shell
docker-compose -f backend.yaml up --build
```

3. Verify the app is running by visiting http://localhost:8080/api/health

# Testing

You can run gradle tests by bash into container

```shell
docker ps
```
Get the container id

```shell
docker exec -it <CONTAINER_ID> /bin/bash
```

```shell
./gradlew test
```

## Project Architecture

The backend is organized into the following directories and files:

- **`src/main/java/gov/cdc/reportvision/`**:
- **`controllers/`**: handle API requests from the frontend.
- **`services/`**: service layer for managing templates, data extraction, and interactions with the OCR backend.
- **`models/`**: Data models representing application entities
- **`repositories/`**: Interfaces for database operations,
- **`config/`**: Configuration files for security, database connections, and CORS policies.
- **`utils/`**: Utility classes for validation, logging, and file manipulation.
- **`src/test/`**: Includes unit and integration tests for the backend.
- **`Dockerfile`**: Docker configuration file for containerizing the application.

- **`README.md`**: Documentation for the backend application.


## Key Features

#### Template Management
- **Upload, retrieve, and delete templates**:
- Allows users to upload new templates for document segmentation.
- Retrieve a list of all saved templates.
- Delete templates by their unique ID.

#### Data Extraction
- **Document Processing**:
- Connects to the OCR backend to process documents using predefined templates.
- Extracts data based on segmented areas defined in the templates.
- Returns structured extracted data.

#### Validation and Error Handling
- **Data Integrity Checks**:
- Validates user inputs and template configurations.
- Provides error messages for invalid requests or processing failures.

#### Secure Integration
- **Authentication**:
- Implements JWT based authentication.
- Configurable CORS policies to control frontend and third-party access.


## API Endpoints

The backend middleware exposes the following RESTful API endpoints:

#### Health Check
- **`GET /api/health`**
- **Description**: Returns the status of the backend server.
- **Response**: Status message indicating the server's health.

#### Template Management
- **`POST /api/templates`**
- **Description**: Upload a new template for document segmentation.
- **Request Body**: JSON containing template details.
- **Response**: Confirmation of the uploaded template.

- **`GET /api/templates`**
- **Description**: Retrieve a list of all available templates.
- **Response**: JSON array of template metadata.

- **`DELETE /api/templates/{id}`**
- **Description**: Delete a specific template by its unique ID.
- **Response**: Confirmation of deletion.

#### Data Extraction
- **`POST /api/extract`**
- **Description**: Process a document using a selected template and return extracted data from OCR.
- **Request Body**: JSON containing the document and selected template ID.
- **Response**: JSON object with extracted data.

#### Configuration Management
- **`GET /api/config`**
- **Description**: Retrieve the current configuration settings of the application.
- **Response**: JSON object with configuration details.


## Troubleshooting

### Common Issues

#### Database Connection Fails
- **Cause**: The backend is unable to connect to the database.
- **Solution**:
- Ensure the database server is running.
- Verify that the `DB_URL`, `DB_USERNAME`, and `DB_PASSWORD` environment variables are correctly configured.

#### CORS Errors
- **Cause**: Frontend requests are being blocked due to Cross-Origin Resource Sharing (CORS) policies.
- **Solution**:
- Update the `CorsConfig` class in the `config/` directory.
- Add the necessary origins to the allowed list.

#### OCR Service Not Responding
- **Cause**: The backend is unable to communicate with the OCR service.
- **Solution**:
- Verify that the `OCR_SERVICE_URL` is correctly set
Loading
Loading