This package allows you to convert strange Unicode symbols to normal ones using Tesseract.
First, you need to install Tesseract. Instructions can be found here
If you don't want to use pre-built Docker image, you'll also need a TTF file with Unicode font. I highly recommend to use GNU Unifont. You can download it here.
For now, only Python 3.8 is supported. You can try other versions, but no guarantees that it'll work properly.
Clone repository:
git clone https://github.com/theseus-automl/ocr-unicode-normalizer
cd ocr-unicode-normalizer
Install package:
python setup.py install
from pathlib import Path
from ocr_unicode_normalizer import Normalizer
norm = Normalizer(font_path=Path('/path/to/font'))
print(norm.normalize('hello', lang='eng'))
Install additional requirements:
python -m pip install -r api-requirements.txt
Place config.yaml file near to main.py:
data:
"tesseract_path": "/path/to/tesseract"
"tessdata_path": "/path/to/tessdata"
"font_path": "/path/to/font"
Start server:
uvicorn main:app --port 9000
Make request:
import requests
resp = requests.get('http://localhost:9000/normalize', json={'text': 'hello', 'lang': 'eng'})
print(resp.json())
Docker image is also provided. You can view available labels at Docker Hub.
Pull image:
docker pull sn4kebyt3/ocr-unicode-normalizer:0.1.0
Run container:
docker container run -p 9000:80 -d --name normalizer sn4kebyt3/ocr-unicode-normalizer:0.1.0
Request example can be found in section 3.2.
Feel free to open issues, send pull requests and ask any questions!