Extract text from pdfs that contain searchable pdf text. The module is wrapper that calls the pdftotext
command to perform the actual extraction
npm install pdf-text-extract
You will need the pdftotext
binary available on your path. There are packages available for many different operating systems
See https://github.com/nisaacson/pdf-extract#osx for how to install the pdftotext
command
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var extract = require('pdf-text-extract')
extract(filePath, function (err, pages) {
if (err) {
console.dir(err)
return
}
console.dir(pages)
})
The output will be an array of where each entry is a page of text. If you want just a string of all pages you can do pages.join(' ')
If needed you can pass an optional arguments to the extract function. These will be passed to the command
var filePath = path.join(__dirname, 'test/data/multipage.pdf')
var extract = require('pdf-text-extract')
var options = {
cwd: "./"
}
extract(filePath, options, function (err, pages) {
if (err) {
console.dir(err)
return
}
console.dir('extracted pages', pages)
})
npm install -g pdf-text-extract
Execute with the filePath as an argument. Output will be json-formatted array of pages
pdf-text-extract ./test/data/multipage.pdf
# outputs
# ['<page 1 content...>', '<page 2 content...>']
# install dev dependencies
npm install
# run tests
npm test