Can I read PDF or Word Docs with Node.js?

后端未结

关注

 8  2077

不知归路 2021-02-02 14:11

I can\'t find any packages to do this. I know PHP has a ton of libraries for PDFs (like http://www.fpdf.org/) but anything for Node?

8条回答

北荒 (楼主)

2021-02-02 14:19

you can use pdf-text for pdf files. it will extract text from a pdf into an array of text 'chunks'. Useful for doing fuzzy parsing on structured pdf text.

var pdfText = require('pdf-text')
var pathToPdf = __dirname + "/info.pdf"


pdfText(pathToPdf, function(err, chunks) {
  //chunks is an array of strings  
  //loosely corresponding to text objects within the pdf 
  //for a more concrete example, view the test file in this repo 
})

var fs = require('fs')
var buffer = fs.readFileSync(pathToPdf)
pdfText(buffer, function(err, chunks) {
 console.log(chunks)
})

for docx files you can use mammoth, it will extract text from .docx files.

var mammoth = require("mammoth");

mammoth.extractRawText({path: "./doc.docx"})
    .then(function(result){
        var text = result.value; // The raw text 
        console.log(text);
        var messages = result.messages;
    })
    .done();

I hope this will help.

0 讨论(0)

查看其它8个回答