You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey!
I'm trying to extract text from this file using tikaondotnet.extraction.
the code is really basic public static string Extract(string path) { var te = new TextExtractor(); return te.Extract(path).Text; }
When I get to the arabic text part in the attached pdf, I get a lot of warnings like the following - WARN No Unicode mapping for behini (112) in font NSIEBX+OmegaSerifArabicOne WARN No Unicode mapping for seenmed (148) in font NSIEBX+OmegaSerifArabicOne WARN No Unicode mapping for meemfin (205) in font NSIEBX+OmegaSerifArabicOne WARN No Unicode mapping for alifiso (109) in font NSIEBX+OmegaSerifArabicOne WARN No Unicode mapping for lamini (191) in font NSIEBX+OmegaSerifArabicOne
I was wondering if there's an option to add a decode specification when extracting the text\ an option to convert all the the text to a different font that is supported in tika?
P.S.
the English text is extracted fine :)
The text was updated successfully, but these errors were encountered:
Hey!
I'm trying to extract text from this file using
tikaondotnet.extraction
.the code is really basic
public static string Extract(string path) { var te = new TextExtractor(); return te.Extract(path).Text; }
When I get to the arabic text part in the attached pdf, I get a lot of warnings like the following -
WARN No Unicode mapping for behini (112) in font NSIEBX+OmegaSerifArabicOne
WARN No Unicode mapping for seenmed (148) in font NSIEBX+OmegaSerifArabicOne
WARN No Unicode mapping for meemfin (205) in font NSIEBX+OmegaSerifArabicOne
WARN No Unicode mapping for alifiso (109) in font NSIEBX+OmegaSerifArabicOne
WARN No Unicode mapping for lamini (191) in font NSIEBX+OmegaSerifArabicOne
This is the extracted text
I was wondering if there's an option to add a decode specification when extracting the text\ an option to convert all the the text to a different font that is supported in tika?
P.S.
the English text is extracted fine :)
The text was updated successfully, but these errors were encountered: