Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract's Recognition Is Appallingly Inept – Even Custom Training Fails Completely。trash #409

Open
kinghelong opened this issue Dec 8, 2024 · 1 comment

Comments

@kinghelong
Copy link

Tesseract 5.5 is fundamentally flawed. The pre-trained model's accuracy for Chinese characters is abysmal, barely reaching 20% in my tests. Frustrated by this, I attempted to train my own model, following the available tutorials meticulously. Despite my best efforts, the results were a complete failure – empty pages or no output at all. Below, I outline my process in detail to highlight just how broken Tesseract truly is.

My Training Process:
Dataset Preparation:

I created a custom dataset of Chinese characters.
Each character was rendered as an image using a program I wrote. The images:
Are 50x50 pixels, white background with black text.
Use a font size of approximately 48pt.
Are entirely noise-free and synthetically generated, ensuring a perfect input for OCR.
For each image:
I manually wrote a .box file to ensure the coordinates were correct. (The auto-generated .box files were completely wrong, often splitting a single character into multiple entries.)
I also created .gt.txt files containing the corresponding ground truth.
Training Execution:

Following Tesseract's documentation, I used the tesstrain utility to start training.
My training setup:
Directory: data/train-ground-truth for all my prepared images and accompanying files.
Command:
make training MODEL_NAME=train START_MODEL=chi_sim TESSDATA=../tessdata/ MAX_ITERATIONS=500 LEARNING_RATE=0.01
The training process completed without any errors or warnings, suggesting everything was properly configured.
Testing the Trained Model:

After training, I copied the resulting train.traineddata file into Tesseract's tessdata folder.
I then used Tesseract to test the model on one of the same training images:
tesseract str.tif output -l train
Result: Empty page.
Tesseract didn’t recognize a single pixel of text from the image it had supposedly been trained on.
Further Debugging Attempts:

I tried enlarging the image dimensions (e.g., 100x100, 200x200), keeping the font size unchanged. This had no effect – the output remained "empty page."
The images were noise-free, perfectly clean, and in the simplest possible format (black text on a white background). Yet Tesseract completely failed to process them.
Issues with Tesseract:
Pre-trained Models Are Useless: The chi_sim model cannot recognize even basic Chinese text with reasonable accuracy, making it effectively worthless for practical use.

Training Is a Black Box: While the training process runs without errors, the results are completely non-functional. Tesseract provides no meaningful diagnostics or tools to identify where the problem lies.

Empty Page for Perfect Input: The input images used for testing were the same as those used for training. These images are synthetic, noise-free, and contain single characters. If Tesseract cannot recognize this, what can it recognize?

Broken Auto-Generated Files: The .box files generated during training are absurdly wrong, often splitting a single character into multiple entries. I had to manually correct them, which is unreasonable and error-prone for large datasets.

My Conclusion:
Tesseract, even after decades of development, is incapable of handling the simplest, cleanest OCR tasks reliably. Its current state is an embarrassment for any project that claims to be a leading open-source OCR engine.

If Tesseract developers cannot fix these fundamental issues, perhaps it would be better to release the core algorithms and let the community rebuild something functional from scratch. Right now, Tesseract is nothing more than a collection of bugs masquerading as a tool.
str13
You can't even recognize such a simple word. Just delete the database of your product. It's useless anyway.

@aria-afk
Copy link

aria-afk commented Jan 6, 2025

  1. MAX_ITERATIONS=500 is very low, why did you think this was a good number?
  2. How many sets of data did you provide in your ground truth? And was this data single lines of characters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants