Can I read PDF or Word Docs with Node.js?

后端 未结 8 2077
不知归路
不知归路 2021-02-02 14:11

I can\'t find any packages to do this. I know PHP has a ton of libraries for PDFs (like http://www.fpdf.org/) but anything for Node?

8条回答
  •  北荒
    北荒 (楼主)
    2021-02-02 14:19

    you can use pdf-text for pdf files. it will extract text from a pdf into an array of text 'chunks'. Useful for doing fuzzy parsing on structured pdf text.

    var pdfText = require('pdf-text')
    var pathToPdf = __dirname + "/info.pdf"
    
    
    pdfText(pathToPdf, function(err, chunks) {
      //chunks is an array of strings  
      //loosely corresponding to text objects within the pdf 
      //for a more concrete example, view the test file in this repo 
    })
    
    var fs = require('fs')
    var buffer = fs.readFileSync(pathToPdf)
    pdfText(buffer, function(err, chunks) {
     console.log(chunks)
    })
    

    for docx files you can use mammoth, it will extract text from .docx files.

    var mammoth = require("mammoth");
    
    mammoth.extractRawText({path: "./doc.docx"})
        .then(function(result){
            var text = result.value; // The raw text 
            console.log(text);
            var messages = result.messages;
        })
        .done();
    

    I hope this will help.

提交回复
热议问题