OCR program based on Pytesseract - a wrapper for Tesseract. It includes language models to enhance the OCR performance.
-
Install Tesseract
- For Mac users: brew install tesseract
- For Windows users: The latest installer can be downloaded from here.
- For Linux users: sudo apt install tesseract-ocr -y
-
Add tesseract path to system environment variable
-
Download language models here.
- Run OCR_singleImage.ipynb to produce a searchable Pdf and OCR text as an output.
- Run OCR_multipleImage.ipynb to process multiple images in batches.
- Run OCR_textExtractor.ipynb to perform text detection using pytesseract.
If you are from non-technical background, and would like to set up pytesseract on your computer from scratch, please refer to instructions here: Mac, Windows. The guide also includes instructions to set up python and virtual environment.
- Our implementation is based on Google's tesseract OCR engine.
Ekta Vats ([email protected])
Centre for Digital Humanities
Uppsala University
Sweden