Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update requirement for six library: >=1.11,<2.0 #27

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Update requirement for six library: >=1.11,<2.0 #27

wants to merge 1 commit into from

Conversation

DanielSwain
Copy link
Contributor

=1.11,<2.0 is the current Wagtail requirement for six

=1.11,<2.0 is the current Wagtail requirement for six
@codecov
Copy link

codecov bot commented Apr 23, 2019

Codecov Report

Merging #27 into master will decrease coverage by 2.05%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #27      +/-   ##
==========================================
- Coverage   89.72%   87.67%   -2.06%     
==========================================
  Files          15       15              
  Lines         146      146              
==========================================
- Hits          131      128       -3     
- Misses         15       18       +3
Impacted Files Coverage Δ
src/wagtail_textract/handlers.py 61.9% <0%> (-14.29%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 645b454...f83d134. Read the comment docs.

@DanielSwain
Copy link
Contributor Author

@khink Any suggestions for how to remedy this code coverage issue? Not sure how updating this dependency would have caused wagtail_textract to not pick up any text in your test document. Codecov says that the following lines were never even executed:

29.  if text:
30.     document.transcription = text.decode()
31.     document.save(transcribe=False)

Could it be that the 10 second delay for Tesseract to work was not long enough in this case since it never even executed the if text: statement?

As an aside, I looked at test_management_command.py and don't understand why it has:

assert 'CORRECT' not in document.transcription
assert 'STAPLE' not in document.transcription

Your note in this test file explains that 'CORRECT' and 'STAPLE' are the only two words that Tesseract DOES recognize, so I would have thought the assertion would be:

assert 'CORRECT' in document.transcription
assert 'STAPLE' in document.transcription

Your tests have obviously been working, so I'm sure I'm just not seeing things right.

@khink
Copy link
Contributor

khink commented May 3, 2019

Hi Dan,

Sorry it took some time to get back on this. It looks like you're right, that test does not test what it should, and it really doesn't prove much. I haven't been able to get the tests running myself yet due to dependency problems (new laptop). Not sure when i get around to fixing it. What would you propose to do? I'd be happy to get your PR in.

@DanielSwain
Copy link
Contributor Author

If you would approve this PR despite the code coverage problem, then one easy, short-term solution to the code coverage problem might be to type up a few sentences, print and scan that, and upload it in place of the current document that is being OCRed. In my attempts at OCRing, I've found that Tesseract 3 does not do well at all in identifying words that are large and handwritten, but it does do pretty well on regular typewritten documents. I even OCRed some meeting minutes from the 1950s, and it did an acceptable job (though it would be nice to step up Tesseract 4 at some point; however, I realize that this would ideally happen in Textract itself since they're only at 3). If you do the new scan, change the assertions to be positive, and perhaps set the delay to a longer period to give time for the OCRing to complete, then I would think the testing issue would be resolved.

@khink
Copy link
Contributor

khink commented May 15, 2019

Hi Dan,

Thanks for your research.

I'm reluctant to do it while tests for the PR are failing. That said, having no test is better than what we have now. I'm sorry i left it in this state.

Would you be willing to make a quick fix to get tests to pass first? Perhaps you could replace the current handwritten test file with a typewritten document, which is properly recognized, so the test makes sense. Or if you decide to just test that the management command runs without error, or that document.transcription is not empty, that's also fine by me.

Hope to hear from you.

@khink
Copy link
Contributor

khink commented Feb 4, 2021

Hi Dan,

See my reply above, how do you propose to proceed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants