Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract non ascii/unicode text from PDF #148

Open
AlmogEinstein opened this issue Aug 3, 2021 · 0 comments
Open

Extract non ascii/unicode text from PDF #148

AlmogEinstein opened this issue Aug 3, 2021 · 0 comments

Comments

@AlmogEinstein
Copy link

AlmogEinstein commented Aug 3, 2021

Hey!
I'm trying to extract text from this file using tikaondotnet.extraction.
the code is really basic
public static string Extract(string path) { var te = new TextExtractor(); return te.Extract(path).Text; }

When I get to the arabic text part in the attached pdf, I get a lot of warnings like the following -
WARN No Unicode mapping for behini (112) in font NSIEBX+OmegaSerifArabicOne
WARN No Unicode mapping for seenmed (148) in font NSIEBX+OmegaSerifArabicOne
WARN No Unicode mapping for meemfin (205) in font NSIEBX+OmegaSerifArabicOne
WARN No Unicode mapping for alifiso (109) in font NSIEBX+OmegaSerifArabicOne
WARN No Unicode mapping for lamini (191) in font NSIEBX+OmegaSerifArabicOne

This is the extracted text

I was wondering if there's an option to add a decode specification when extracting the text\ an option to convert all the the text to a different font that is supported in tika?

P.S.
the English text is extracted fine :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant