'±' is recognised as '+' #4286

DominicMukilan · 2024-07-15T09:22:11Z

Current Behavior

No response

Expected Behavior

No response

Suggested Fix

No response

tesseract -v

tesseract v5.4.0.20240606
leptonica-1.84.1
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 3.0.1) : libpng 1.6.43 : libtiff 4.6.0 : zlib 1.3 : libwebp 1.4.0 : libopenjp2 2.5.2
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found libarchive 3.7.4 zlib/1.3.1 liblzma/5.6.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.6

Operating System

Windows 11

Other Operating System

No response

uname -a

No response

Compiler

No response

CPU

No response

Virtualization / Containers

No response

Other Information

Tesseract is recognising '±' as '+'. In some places, it doesn't even recognise this.

Python 3.12

stweil · 2024-07-15T09:37:01Z

Which model / language did you use?

DominicMukilan · 2024-07-15T10:11:59Z

Python 3.12
IDE Pycharm

reprex:
import json
import cv2
import pytesseract
from PIL import Image
import pandas as pd

Load the JSON file

json_path = "new_pred.json"
with open(json_path, "r") as file:
annotations = json.load(file)

Extract all coordinates without filtering

coordinates = [annotation["box"] for annotation in annotations]

Load the image

image_path = "new_pred.jpg"
image = cv2.imread(image_path)

Load the image with PIL to get its dimensions

image_pil = Image.open(image_path)
image_width, image_height = image_pil.size

Function to crop image based on coordinates and perform OCR with boundary check

def crop_and_ocr_with_boundary_check(image, coordinates, image_width, image_height):
ocr_results = []
skipped_coordinates = []
for i, (x1, y1, x2, y2) in enumerate(coordinates):
## Adjust the coordinates to be within image boundaries
original_coords = (x1, y1, x2, y2)
x1 = max(0, min(x1, image_width - 1))
y1 = max(0, min(y1, image_height - 1))
x2 = max(0, min(x2, image_width))
y2 = max(0, min(y2, image_height))

    ## Check if the box is too small
    if x2 - x1 < 5 or y2 - y1 < 5:
        skipped_coordinates.append((i, original_coords, "Too small"))
        continue

    ## Crop the region from the image
    cropped_img = image[y1:y2, x1:x2]

    ## Perform OCR on the cropped image
    text = pytesseract.image_to_string(cropped_img)

    ## Append the OCR result
    ocr_results.append({
        "coordinates": (x1, y1, x2, y2),
        "text": text.strip()  # Remove leading/trailing whitespace
    })

return ocr_results, skipped_coordinates

Perform OCR on the annotated regions with boundary check

ocr_results, skipped_coordinates = crop_and_ocr_with_boundary_check(image, coordinates, image_width, image_height)

Convert OCR results to a DataFrame

ocr_df = pd.DataFrame(ocr_results)

Print debugging information

print(f"Total annotations in JSON: {len(annotations)}")
print(f"Total OCR results: {len(ocr_results)}")
print(f"Skipped coordinates: {len(skipped_coordinates)}")
for skip in skipped_coordinates:
print(f" Index: {skip[0]}, Coordinates: {skip[1]}, Reason: {skip[2]}")

Display the DataFrame

print(ocr_df)

Optionally, save the results to a CSV file

ocr_df.to_csv("ocr_results.csv", index=False)
print("Results saved to ocr_results.csv")

Print image dimensions

print(f"Image dimensions: {image_width}x{image_height}")

stweil · 2024-07-15T10:39:01Z

Please add also your image (or its URL if it is online) to this issue report.

DominicMukilan · 2024-07-15T10:58:30Z

Python output:

Total annotations in JSON: 50
Total OCR results: 50
Skipped coordinates: 0
coordinates text
0 (1763, 5732, 2293, 5861)
1 (1785, 5974, 2332, 6064) | 1.314.01 Le
2 (1848, 6119, 2648, 6215)
3 (2901, 4062, 3223, 4164) 03 X 45°
4 (1029, 577, 1510, 665)
5 (8511, 2174, 8895, 2267) 188
6 (6735, 306, 7311, 411) —e| [a—( 188 )
7 (1732, 3857, 2147, 3941) — w=! 64 ba
8 (3571, 508, 4259, 604) | |e ——_ 188+.003
9 (1069, 1827, 1666, 1940) @d .615+.002\n\nLa
10 (2349, 5867, 2629, 5952)
11 (2409, 3895, 3120, 3987) —e| -— .382+.003
12 (4672, 2200, 5422, 2320) a 2.487+.002 ——=
13 (7402, 3622, 7733, 3817) 30°
14 (8679, 2312, 9175, 2417)
15 (9044, 4051, 9409, 4597)
16 (786, 771, 1275, 853)
17 (1721, 1328, 1869, 1528)
18 (3437, 2321, 3790, 2432) -| 860
19 (2097, 4084, 2295, 4270)
20 (1159, 4032, 1699, 4153) 3 3/4-10 UNS-2A
21 (3918, 4779, 4131, 4931) 2.973
22 (8506, 531, 8895, 626) 1.595+.002
23 (8901, 997, 9284, 1168)
24 (5060, 1791, 5344, 1954)
25 (7401, 2650, 7823, 2751) =| 420
26 (1850, 2369, 2072, 2481) R.125
27 (3355, 651, 3604, 761) ‘a
28 (8916, 3820, 9281, 4032)
29 (1778, 4305, 1924, 4589)
30 (5060, 1715, 5384, 1950) ngle: 0.46
31 (984, 1257, 1145, 1370) a\nA
32 (7791, 4801, 8101, 4951)
33 (8217, 2315, 9202, 2415)
34 (2343, 5511, 2888, 5609)
35 (8267, 1997, 8656, 2096)
36 (1462, 1665, 1715, 1764)
37 (433, 1303, 546, 1757)
38 (8384, 1476, 8512, 1830)
39 (1517, 5565, 2035, 5665) 332.01-— |
40 (6247, 3077, 6399, 4105)
41 (4327, 2035, 4901, 2181) (.078 = —
42 (9207, 1376, 9383, 1751)
43 (4671, 4768, 4947, 4940) |\n03.875
44 (4986, 896, 5622, 1195) ay\n>
45 (4886, 1063, 5328, 1186)
46 (4751, 1802, 4996, 1954)
47 (3044, 4667, 3238, 4771) -R.03
48 (8680, 3010, 8963, 3225)
49 (5117, 890, 5620, 1258)
Results saved to ocr_results.csv
Image dimensions: 10200x6600
new_pred.json

requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'±' is recognised as '+' #4286

'±' is recognised as '+' #4286

DominicMukilan commented Jul 15, 2024 •

edited

Loading

stweil commented Jul 15, 2024

DominicMukilan commented Jul 15, 2024 •

edited

Loading

stweil commented Jul 15, 2024

DominicMukilan commented Jul 15, 2024 •

edited

Loading

'±' is recognised as '+' #4286

'±' is recognised as '+' #4286

Comments

DominicMukilan commented Jul 15, 2024 • edited Loading

Current Behavior

Expected Behavior

Suggested Fix

tesseract -v

Operating System

Other Operating System

uname -a

Compiler

CPU

Virtualization / Containers

Other Information

stweil commented Jul 15, 2024

DominicMukilan commented Jul 15, 2024 • edited Loading

Load the JSON file

Extract all coordinates without filtering

Load the image

Load the image with PIL to get its dimensions

Function to crop image based on coordinates and perform OCR with boundary check

Perform OCR on the annotated regions with boundary check

Convert OCR results to a DataFrame

Print debugging information

Display the DataFrame

Optionally, save the results to a CSV file

Print image dimensions

stweil commented Jul 15, 2024

DominicMukilan commented Jul 15, 2024 • edited Loading

DominicMukilan commented Jul 15, 2024 •

edited

Loading

DominicMukilan commented Jul 15, 2024 •

edited

Loading

DominicMukilan commented Jul 15, 2024 •

edited

Loading