Can I read PDF or Word Docs with Node.js?

后端 未结 8 2059
不知归路
不知归路 2021-02-02 14:11

I can\'t find any packages to do this. I know PHP has a ton of libraries for PDFs (like http://www.fpdf.org/) but anything for Node?

相关标签:
8条回答
  • 2021-02-02 14:19

    you can use pdf-text for pdf files. it will extract text from a pdf into an array of text 'chunks'. Useful for doing fuzzy parsing on structured pdf text.

    var pdfText = require('pdf-text')
    var pathToPdf = __dirname + "/info.pdf"
    
    
    pdfText(pathToPdf, function(err, chunks) {
      //chunks is an array of strings  
      //loosely corresponding to text objects within the pdf 
      //for a more concrete example, view the test file in this repo 
    })
    
    var fs = require('fs')
    var buffer = fs.readFileSync(pathToPdf)
    pdfText(buffer, function(err, chunks) {
     console.log(chunks)
    })
    

    for docx files you can use mammoth, it will extract text from .docx files.

    var mammoth = require("mammoth");
    
    mammoth.extractRawText({path: "./doc.docx"})
        .then(function(result){
            var text = result.value; // The raw text 
            console.log(text);
            var messages = result.messages;
        })
        .done();
    

    I hope this will help.

    0 讨论(0)
  • 2021-02-02 14:19

    For parsing pdf files you can use pdf2json node module

    It allows you to convert pdf file to json as well as to raw text data.

    0 讨论(0)
  • 2021-02-02 14:22

    Here is an example showing how to download and extract text from a PDF using PDF.js:

    import _ from 'lodash';
    import superagent from 'superagent';
    import pdf from 'pdfjs-dist';
    
    const url = 'http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf';
    
    const main = async () => {
      const response = await superagent.get(url).buffer();
      const data = response.body;
      const doc = await pdf.getDocument({ data });
      for (const i of _.range(doc.numPages)) {
        const page = await doc.getPage(i + 1);
        const content = await page.getTextContent();
        for (const { str } of content.items) {
          console.log(str);
        }
      }
    };
    
    main().catch(error => console.error(error));
    
    0 讨论(0)
  • 2021-02-02 14:26

    I would suggest looking into unoconv for your initial conversion, this uses LibreOffice or OpenOffice for the actual conversion. Which adds some overhead.

    I'd setup a few workers with all the necessities setup, and use a request/response queue for handling the conversion... (may want to look into kue or zmq)

    In general this is a CPU bound and heavy task that should be offloaded... Pandoc and others specifically mention .docx, not .doc so they may or may not be options as well.


    Note: I know this question is old, just wanted to provide a current answer for others coming across this.

    0 讨论(0)
  • Looks like there's a few for pdf, but I didn't find any for Word.

    CPU bound processing like that isn't really Node's strong point anyway (i.e. you get no additional benefits using node to do it over any other language). A pragmatic approach would be to find a good tool and utilise it from Node.

    I have heard good things around the office about docsplit http://documentcloud.github.com/docsplit/

    While it's not Node, you could easily invoke it from Node with http://nodejs.org/docs/latest/api/all.html#child_process.exec

    0 讨论(0)
  • 2021-02-02 14:32

    textract is a great lib that supports PDFs, Doc, Docx, etc.

    0 讨论(0)
提交回复
热议问题