Can we extract text from word page by page #129

Anupam750 · 2018-11-05T11:10:08Z

please explain, How can i get text from pdf, word page by page?

KevM · 2018-11-07T14:14:28Z

Look at the unit tests. 🤗

bugybunny · 2018-11-08T10:03:17Z

I‘m also interested in this (but for PDFs). I guess you mean tikaondotnet/src/TikaOnDotNet.Tests/text_extraction.cs, right? I couldn‘t find an example for that.

Anupam750 · 2018-11-08T12:43:54Z

I have also checked it and did not found anything related to page in code as you said..

KevM · 2018-11-08T17:43:17Z

Not sure I follow what you are trying to do. Can you tell me more detail about what you need? There are examples of text extraction against many file types in the tests. Kevin Miller @KevM

…

On Nov 8, 2018, 6:43 AM -0600, Anupam750 ***@***.***>, wrote: I have also checked it and did not found anything related to page in code as you said.. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

bugybunny · 2018-11-11T19:51:55Z

We want to extract the text of a .pdf or .docx (or .doc dunno, I can’t speak for Anupam75) page by page. I currently just use new TextExtractor().Extract(filename.pdf) .Text to get the text from a PDF. But I would love to know where the page breaks are.
I somewhere read that PDFBox outputs a <br> at the beginning of each page but I don’t know what method I need to call to get this behaviour from Tika/TikaOnDotNet.

KevM · 2018-11-13T13:18:20Z

I don't believe that is an option that Tika offers. Tika is a high level abstraction. If you wanted to do that more precisely I would look at using POI directly. https://poi.apache.org/text-extraction.html

bugybunny · 2018-11-14T10:07:46Z

Can you make something out of this answer? https://stackoverflow.com/a/6271696/4040068

I thought that it’s not possible and I know that Tika is just an abstraction layer to get the content from so many different formats. I have to use .NET (so PDFBox itself is not an option, there’s a PDFBox with IKVM but it’s not maintained anymore) and TikaOnDotNet was the only pdf extraction tool that I can use concerning the license and doesn’t cost a lot of money. So I was hoping I can still use it for a bit more than just plain extraction, but it’s not really a problem :)

KevM · 2018-11-15T14:18:56Z

You might be able to hook into Tika to get the raw markup. Not sure. Let me know what you find out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we extract text from word page by page #129

Can we extract text from word page by page #129

Anupam750 commented Nov 5, 2018

KevM commented Nov 7, 2018

bugybunny commented Nov 8, 2018

Anupam750 commented Nov 8, 2018

KevM commented Nov 8, 2018 via email

bugybunny commented Nov 11, 2018 •

edited

Loading

KevM commented Nov 13, 2018 •

edited

Loading

bugybunny commented Nov 14, 2018 •

edited

Loading

KevM commented Nov 15, 2018

Can we extract text from word page by page #129

Can we extract text from word page by page #129

Comments

Anupam750 commented Nov 5, 2018

KevM commented Nov 7, 2018

bugybunny commented Nov 8, 2018

Anupam750 commented Nov 8, 2018

KevM commented Nov 8, 2018 via email

bugybunny commented Nov 11, 2018 • edited Loading

KevM commented Nov 13, 2018 • edited Loading

bugybunny commented Nov 14, 2018 • edited Loading

KevM commented Nov 15, 2018

bugybunny commented Nov 11, 2018 •

edited

Loading

KevM commented Nov 13, 2018 •

edited

Loading

bugybunny commented Nov 14, 2018 •

edited

Loading