Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plans for tesseract 5.x.y #3673

Open
amitdo opened this issue Dec 5, 2021 · 135 comments
Open

Plans for tesseract 5.x.y #3673

amitdo opened this issue Dec 5, 2021 · 135 comments

Comments

@amitdo
Copy link
Collaborator

amitdo commented Dec 5, 2021

I suggest to focus on 5.x for 2022 at least.

That means we should not break the API (and ABI?). Use C++17, not C++20/C++23.

@stweil
Copy link
Contributor

stweil commented Dec 22, 2021

What about releasing a 5.0.1 after Christmas at the end of December? I think there are several fixes since 5.0.0 which would be good for a new release.

@amitdo
Copy link
Collaborator Author

amitdo commented Dec 22, 2021

Mind reader :-)
I was about to suggest to release 5.0.1 before year end. It would be nice if we can fix #3683 before releasing 5.0.1.

@amitdo
Copy link
Collaborator Author

amitdo commented Dec 22, 2021

Right before tagging 5.0.1, you can update this sentence from the README:

The latest stable version is 5.0.0, released on November 30, 2021.

@amitdo amitdo mentioned this issue Dec 23, 2021
@egorpugin
Copy link
Contributor

What should be added into v5?
5.x changes could be merged into branch and cherry picked into v6 main.

@stweil
Copy link
Contributor

stweil commented Dec 23, 2021

We already have a wish list for improved training, a lot of issues with layout detection, want improved logging, and much more. Maintaining two branches did not work good with 4.x, and I am afraid it would not work better with 5.x.

@egorpugin
Copy link
Contributor

Maybe keep 5.0 as is? It is a good release with a number of changes.
Everything else will go straight into 6?

@amitdo
Copy link
Collaborator Author

amitdo commented Dec 26, 2021

@amitdo
Copy link
Collaborator Author

amitdo commented Jan 1, 2022

What about releasing a 5.0.1 after Christmas at the end of December? I think there are several fixes since 5.0.0 which would be good for a new release.

Do you plan to release 5.0.1 next week?

@stweil
Copy link
Contributor

stweil commented Jan 1, 2022

Yes, unless we discover that something very important is still missing.

@stweil
Copy link
Contributor

stweil commented Jan 6, 2022

It would be nice if we can fix #3683 before releasing 5.0.1.

There is still no fix, and I have no clang-cl, so I cannot look for a fix myself. Should we release 5.0.1 without a fix? Are other things missing for 5.0.1 (besides updating of the documentation)?

@egorpugin
Copy link
Contributor

clang-cl is not worth it currently.

@amitdo
Copy link
Collaborator Author

amitdo commented Jan 6, 2022

You can release 5.0.1 without the clang-cl fix.

@stweil
Copy link
Contributor

stweil commented Jan 7, 2022

Release 5.0.1 is now online.

@stweil
Copy link
Contributor

stweil commented Jan 7, 2022

The next release could be a new minor version 5.1.0 with new features, maybe end of January (unless there is an urgent need for a bug fix release 5.0.2). I want to have especially image information in ALTO and hOCR output (see PR #3710 which implements that for hOCR), maybe more from the project list. The new minor release would also disable OpenMP by default for autoconf builds, too.

@stweil stweil pinned this issue Feb 10, 2022
@amitdo
Copy link
Collaborator Author

amitdo commented Feb 14, 2022

https://packages.ubuntu.com/search?keywords=tesseract-ocr

@AlexanderP,

Are you going to update Ubuntu 22.04 to 5.0.1 soon? The feature freeze date is February 24.

@AlexanderP
Copy link

@amitdo

i uploaded:

I hope @jbreiden will upload them to debian.

@amitdo
Copy link
Collaborator Author

amitdo commented Feb 27, 2022

Hi @AlexanderP,

I hope @jbreiden will upload them to debian.

From https://tracker.debian.org/pkg/tesseract :

maintainer: [Alexander Pozdnyakov]

So, why can't you directly push new versions of Tesseract to Debian?

@stweil
Copy link
Contributor

stweil commented Feb 28, 2022

I'd like to create a new release Tesseract 5.1.0 soon. Originally I had planned it for end of January.

Are there any contributions or important bug fixes which should be included still pending (then I'd wait), or can we release now?

@Shreeshrii
Copy link
Collaborator

I suggest you go ahead with 5.1.0 now.

I would like to see improvements related to training and evaluation implemented, but they could go in a future release.

@stweil
Copy link
Contributor

stweil commented Mar 1, 2022

Release 5.1.0 is now available.

@AlexanderP
Copy link

@amitdo no rights to upload to debian

@stweil
Copy link
Contributor

stweil commented May 29, 2022

There are now several fixes and improvements in git master, so I think it's time for a new release 5.1.1.

@egorpugin, is it possible to fix the CI sw build which is currently failing?

Are there any other contributions or important bug fixes which should be included still pending (then I'd wait), or can we release now? Ideally #3782 should also be included.

@egorpugin
Copy link
Contributor

Yes, I'll check.

@zdenop
Copy link
Contributor

zdenop commented Jun 1, 2022

Unfortunately windows build does not work (for me): I tried Clang (14) and MS Visual Studio (2019). Here are logs:
clang_build.zip
msvc_build.zip

@amitdo
Copy link
Collaborator Author

amitdo commented Jun 1, 2022

cmake-win64 action fails (since March 29).

cmake and vcpkg actions pass.

@egorpugin
Copy link
Contributor

I fixed sw build in ci.
Zdenko, is it fails only on VS2019? Can you check VS2022.

@zdenop
Copy link
Contributor

zdenop commented Jun 1, 2022

cmake-win64 action has some strange error: it fails already on unzipping zlib (or maybe even earlier: during setting up shell?)

image

And vcpkg is IMO not building the HEAD, but 5.1.0:

image

And I see this with HEAD:

image

@amitdo
Copy link
Collaborator Author

amitdo commented Mar 11, 2024

Let's cpntinue the discusion about the pdf renderer in issue #2879.

@amitdo
Copy link
Collaborator Author

amitdo commented Mar 14, 2024

OK. I decided to remove my objection to the recent changes in the pdf renderer.

@amitdo
Copy link
Collaborator Author

amitdo commented Mar 18, 2024

@stweil,

What about the useless OpenCL code? It's about time we removed it.

@stweil
Copy link
Contributor

stweil commented Mar 25, 2024

@jbarlow83, are the latest changes in Tesseract's PDF renderer compatible with OCRmyPDF, or would they break it?

@jbarlow83
Copy link

@stweil The changes in the PDF renderer are compatible with OCRmyPDF and yield a slight improvement in text positioning on Evince. LGTM.

I tested Tesseract commit 2b07505 which includes egorpugin's changes by examining visual results in Evince using both OCRmyPDF's wrapper around the Tesseract PDF renderer (--pdf-renderer sandwich) and the direct output from PDF renderer. Did not check macOS Preview where the trouble usually is. I also confirmed that the PDF produced by debugging changes commit (which could have an impact on production output) is still producing a syntactically valid PDF when debugging is off.

@amitdo
Copy link
Collaborator Author

amitdo commented Apr 11, 2024

The next release will be 5.4.0.

@amitdo
Copy link
Collaborator Author

amitdo commented Apr 12, 2024

amitdo commented Mar 18, 2024

@stweil,

What about the useless OpenCL code? It's about time we removed it.

Done in #4220.

@amitdo
Copy link
Collaborator Author

amitdo commented Apr 24, 2024

@stweil,

Can you please release 5.4.0 in the next few days?

@stweil
Copy link
Contributor

stweil commented Apr 24, 2024

That's my plan.

@amitdo
Copy link
Collaborator Author

amitdo commented Jun 4, 2024

on Apr 24

That's my plan.

So... what's your current plan?

@stweil
Copy link
Contributor

stweil commented Jun 6, 2024

Done now, 5.4.0 is available. Sorry for the delay. And as always many thanks to all contributors and supporters who helped with issues, discussions and pull requests.

Some things (pull rquests) remain open for follow-up releases.

@amitdo
Copy link
Collaborator Author

amitdo commented Jun 6, 2024

Thanks[ @stweil]!

b8961a7 is not mentioned in the 5.4.0 changes.

@stweil
Copy link
Contributor

stweil commented Jun 6, 2024

The list of changes is generated automatically by GitHub which only uses the information from pull request. Therefore direct commits which were made by maintainers might be missing. I updated the release information now with an initial comment, but feel free to suggest further improvements (or change the release notes as required).

@stweil
Copy link
Contributor

stweil commented Jun 9, 2024

I think we need a 5.4.1 because of a regression with legacy models (issue #4257) which is now fixed in main. Is there anything else which should be included in the bug fix release?

@zdenop
Copy link
Contributor

zdenop commented Jun 9, 2024

GA cmake win64 build started to crash 4 days ago.
I am not able to replicate the problem on my laptop with MSVC 2019 (GA should use MSVC 2022).
When I download Artifacts, they crash also for tesseract --list-langs...

Also, I was able to replicate the problem described here tesseract-ocr/tesstrain#394 on Windows with the 5.4.0 code:

git clone --depth 1 https://github.com/tesseract-ocr/tesstrain

cd tesstrain
mkdir data
unzip ocrd-testset.zip -d data/ocrd-testset-ground-truth

make training MODEL_NAME=ocrd-testset START_MODEL=ces TESSDATA=..../tessdata/best  2>&1 | tee training.log

I am checking whether the lastest code fixed also this...

@stweil
Copy link
Contributor

stweil commented Jun 11, 2024

I agree that it would be good to clarify these two issues, but I cannot reproduce them up to now.

@zdenop
Copy link
Contributor

zdenop commented Jun 11, 2024

GA cmake win64 seem like GA/Win env issue, so you do not need to wait for this (I already have minimal working version, now I try to add steps to find out real cause of problem)

@zdenop
Copy link
Contributor

zdenop commented Jun 11, 2024

Seems like Release and RelWithDebInfo build (Visual Studio 17 2022; MSVC 19.40.33811.0 / MSC v.1940) is crashing tesseract. Debug works fine so I will keep it for GA.
MSVC 19.39.33523.0 / MSC v.1939 Release build worked fine. For me this is the compiler issue, but I have no clue where and how to report it.

@stweil
Copy link
Contributor

stweil commented Jun 11, 2024

The new release 5.4.1 which fixes the regression with legacy or mixed models is now available.

Many thanks for bug reports, reviews and other contributions!

@kloczek
Copy link

kloczek commented Jun 11, 2024

5.4.1 has one issue. It uses bundled googletest included in source tree as submodules.
Why not use system installed gtest? 🤔 All distros provides gtest ..

@amitdo
Copy link
Collaborator Author

amitdo commented Jun 11, 2024

@kloczek,

It's not the right place to discuss the gtest issue.

The gtest issue is not new and we discussed it in the past in #2838 and #3679.

@stweil
Copy link
Contributor

stweil commented Oct 17, 2024

It's time for a new bug fix release. Is there anything urgent which should be included or fixed in the next release?

@zdenop
Copy link
Contributor

zdenop commented Oct 17, 2024

I am in the process of creating cmake files with autotools (leptonica has it already) This is not critical, but it takes more time than I expect it...

@stweil
Copy link
Contributor

stweil commented Oct 21, 2024

... and it currently breaks the autotools builds.

@zdenop
Copy link
Contributor

zdenop commented Oct 22, 2024

This is unrelated topic as cmake generate tesseract.pc from other template (tesseract.pc.cmake) Maybe it could be unified, but this is not topic for now.

@amitdo
Copy link
Collaborator Author

amitdo commented Oct 22, 2024

@stweil, please go ahead with a new release.

@stweil
Copy link
Contributor

stweil commented Oct 22, 2024

I'll try to fix the CI failures before tagging a new release.

@egorpugin
Copy link
Contributor

I've checked this issue
https://github.com/tesseract-ocr/tesseract/pull/4330/files

TessBaseAPI::GetIterator() and some other methods (like GetUTF8Text()) return raw memory.
It would be nice so they return unique_ptr<T> instead.
Doing this we clearly state memory management of returned objects instead of documentation mention.

I propose to impove memory management of public APIs in tess v6 because it is API breakage.

In addition C API implementation will be updated from

TessResultIterator *TessBaseAPIGetIterator(TessBaseAPI *handle) {
  return handle->GetIterator();
}

to

TessResultIterator *TessBaseAPIGetIterator(TessBaseAPI *handle) {
  return handle->GetIterator().release();
}

So C API will be retained the same.


So,

  1. How and when do we want API breaking changes?
  2. What other public API/ABI changes do we want? We need a tracking issue for it? Do we have one already?
  3. I think we have enough 5.x.x releases already, maybe switch master branch to v6 and create separate v5 branch for small fixes?

@stweil
Copy link
Contributor

stweil commented Oct 28, 2024

I just added #4336, and we can discuss and track API changes there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

11 participants