Extract non ascii/unicode text from PDF #148

AlmogEinstein · 2021-08-03T11:15:16Z

Hey!
I'm trying to extract text from this file using tikaondotnet.extraction.
the code is really basic
public static string Extract(string path) { var te = new TextExtractor(); return te.Extract(path).Text; }

When I get to the arabic text part in the attached pdf, I get a lot of warnings like the following -
WARN No Unicode mapping for behini (112) in font NSIEBX+OmegaSerifArabicOne
WARN No Unicode mapping for seenmed (148) in font NSIEBX+OmegaSerifArabicOne
WARN No Unicode mapping for meemfin (205) in font NSIEBX+OmegaSerifArabicOne
WARN No Unicode mapping for alifiso (109) in font NSIEBX+OmegaSerifArabicOne
WARN No Unicode mapping for lamini (191) in font NSIEBX+OmegaSerifArabicOne

This is the extracted text

I was wondering if there's an option to add a decode specification when extracting the text\ an option to convert all the the text to a different font that is supported in tika?

P.S.
the English text is extracted fine :)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract non ascii/unicode text from PDF #148

Extract non ascii/unicode text from PDF #148

AlmogEinstein commented Aug 3, 2021 •

edited

Loading

Extract non ascii/unicode text from PDF #148

Extract non ascii/unicode text from PDF #148

Comments

AlmogEinstein commented Aug 3, 2021 • edited Loading

AlmogEinstein commented Aug 3, 2021 •

edited

Loading