This project is a simple wrapper around the very excellent and robust Tika text extraction Java library. This project produces two nugets:
- TikaOnDotNet - A straight IKVM hosted port of Java Tika project.
- TikaOnDotNet.TextExtractor - Use Tika to extract text from rich documents.
The best way to get started is to:
- Add a Nuget dependency to TikaOnDotNet.TextExtractor.
- Instantiate a new
TextExtractor
object and call one of theExtract
methods.
// using TikaOnDotNet.TextExtraction;
var textExtractor = new TextExtractor();
var wordDocContents = textExtractor.Extract(@".\path\to\my favorite word.docx");
var webPageContents = textExtractor.Extract(new Uri("https://google.com"));
Take a look at our tests for more usage examples.
Have an idea to make this project better? Great! Start out by taking a look at our Contributing Guide.
Search in the Issues as your problem may be a common one. If don't find your problem please create an issue. Contributors here will chime in when they can.