Can I read PDF or Word Docs with Node.js?

后端未结

关注

 8  2059

I can\'t find any packages to do this. I know PHP has a ton of libraries for PDFs (like http://www.fpdf.org/) but anything for Node?

相关标签:

8条回答

北荒

2021-02-02 14:19

you can use pdf-text for pdf files. it will extract text from a pdf into an array of text 'chunks'. Useful for doing fuzzy parsing on structured pdf text.

var pdfText = require('pdf-text')
var pathToPdf = __dirname + "/info.pdf"


pdfText(pathToPdf, function(err, chunks) {
  //chunks is an array of strings  
  //loosely corresponding to text objects within the pdf 
  //for a more concrete example, view the test file in this repo 
})

var fs = require('fs')
var buffer = fs.readFileSync(pathToPdf)
pdfText(buffer, function(err, chunks) {
 console.log(chunks)
})

for docx files you can use mammoth, it will extract text from .docx files.

var mammoth = require("mammoth");

mammoth.extractRawText({path: "./doc.docx"})
    .then(function(result){
        var text = result.value; // The raw text 
        console.log(text);
        var messages = result.messages;
    })
    .done();

I hope this will help.

0 讨论(0)

梦如初夏

2021-02-02 14:19

For parsing pdf files you can use pdf2json node module

It allows you to convert pdf file to json as well as to raw text data.

0 讨论(0)
发布评论:

提交评论
- 加载中...

予麋鹿

2021-02-02 14:22

Here is an example showing how to download and extract text from a PDF using PDF.js:

import _ from 'lodash';
import superagent from 'superagent';
import pdf from 'pdfjs-dist';

const url = 'http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf';

const main = async () => {
  const response = await superagent.get(url).buffer();
  const data = response.body;
  const doc = await pdf.getDocument({ data });
  for (const i of _.range(doc.numPages)) {
    const page = await doc.getPage(i + 1);
    const content = await page.getTextContent();
    for (const { str } of content.items) {
      console.log(str);
    }
  }
};

main().catch(error => console.error(error));

0 讨论(0)

别那么骄傲

2021-02-02 14:26

I would suggest looking into unoconv for your initial conversion, this uses LibreOffice or OpenOffice for the actual conversion. Which adds some overhead.

I'd setup a few workers with all the necessities setup, and use a request/response queue for handling the conversion... (may want to look into kue or zmq)

In general this is a CPU bound and heavy task that should be offloaded... Pandoc and others specifically mention .docx, not .doc so they may or may not be options as well.

Note: I know this question is old, just wanted to provide a current answer for others coming across this.

0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2021-02-02 14:29

Looks like there's a few for pdf, but I didn't find any for Word.

CPU bound processing like that isn't really Node's strong point anyway (i.e. you get no additional benefits using node to do it over any other language). A pragmatic approach would be to find a good tool and utilise it from Node.

I have heard good things around the office about docsplit http://documentcloud.github.com/docsplit/

While it's not Node, you could easily invoke it from Node with http://nodejs.org/docs/latest/api/all.html#child_process.exec

0 讨论(0)
发布评论:

提交评论
- 加载中...
小鲜肉

2021-02-02 14:32

textract is a great lib that supports PDFs, Doc, Docx, etc.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页