How to correctly extract text from a pdf using pdf.js

后端未结

关注

 4  1676

I\'m new to ES6 and Promise. I\'m trying pdf.js to extract texts from all pages of a pdf file into a string array. And when extraction is done, I want to parse the array som

相关标签:

4条回答

清酒与你

2020-12-15 10:41

A bit more cleaner version of @async5 and updated according to the latest version of "pdfjs-dist": "^2.0.943"

import PDFJS from "pdfjs-dist";
import PDFJSWorker from "pdfjs-dist/build/pdf.worker.js"; // add this to fit 2.3.0

PDFJS.disableTextLayer = true;
PDFJS.disableWorker = true; // not availaible anymore since 2.3.0 (see imports)

const getPageText = async (pdf: Pdf, pageNo: number) => {
  const page = await pdf.getPage(pageNo);
  const tokenizedText = await page.getTextContent();
  const pageText = tokenizedText.items.map(token => token.str).join("");
  return pageText;
};

/* see example of a PDFSource below */
export const getPDFText = async (source: PDFSource): Promise<string> => {
  Object.assign(window, {pdfjsWorker: PDFJSWorker}); // added to fit 2.3.0
  const pdf: Pdf = await PDFJS.getDocument(source).promise;
  const maxPages = pdf.numPages;
  const pageTextPromises = [];
  for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
    pageTextPromises.push(getPageText(pdf, pageNo));
  }
  const pageTexts = await Promise.all(pageTextPromises);
  return pageTexts.join(" ");
};

This is the corresponding typescript declaration file that I have used if anyone needs it.

declare module "pdfjs-dist";

type TokenText = {
  str: string;
};

type PageText = {
  items: TokenText[];
};

type PdfPage = {
  getTextContent: () => Promise<PageText>;
};

type Pdf = {
  numPages: number;
  getPage: (pageNo: number) => Promise<PdfPage>;
};

type PDFSource = Buffer | string;

declare module 'pdfjs-dist/build/pdf.worker.js'; // needed in 2.3.0

Example of how to get a PDFSource from a File with Buffer (from node types) :

file.arrayBuffer().then((ab: ArrayBuffer) => {
  const pdfSource: PDFSource = Buffer.from(ab);
});

0 讨论(0)

醉话见心

2020-12-15 10:46

Here's a shorter (not necessarily better) version:

async function getPdfText(data) {
    let doc = await pdfjsLib.getDocument({data}).promise;
    let pageTexts = Array.from({length: doc.numPages}, async (v,i) => {
        return (await (await doc.getPage(i+1)).getTextContent()).items.map(token => token.str).join('');
    });
    return (await Promise.all(pageTexts)).join('');
}

Here, data is a string or buffer (or you could change it to take the url, etc., instead).

0 讨论(0)

有刺的猬

2020-12-15 10:49

If you use the PDFViewer component, here is my solution that doesn't involve any promise or asynchrony:

function getDocumentText(viewer) {
    let text = '';
    for (let i = 0; i < viewer.pagesCount; i++) {
        const { textContentItemsStr } = viewer.getPageView(i).textLayer;
        for (let item of textContentItemsStr)
            text += item;
    }
    return text;
}

0 讨论(0)

渐次进展

2020-12-15 10:55

Similar to https://stackoverflow.com/a/40494019/1765767 -- collect page promises using Promise.all and don't forget to chain then's:

function gettext(pdfUrl){
  var pdf = PDFJS.getDocument(pdfUrl);
  return pdf.then(function(pdf) { // get all pages text
    var maxPages = pdf.pdfInfo.numPages;
    var countPromises = []; // collecting all page promises
    for (var j = 1; j <= maxPages; j++) {
      var page = pdf.getPage(j);

      var txt = "";
      countPromises.push(page.then(function(page) { // add page promise
        var textContent = page.getTextContent();
        return textContent.then(function(text){ // return content promise
          return text.items.map(function (s) { return s.str; }).join(''); // value page text 
        });
      }));
    }
    // Wait for all pages and join text
    return Promise.all(countPromises).then(function (texts) {
      return texts.join('');
    });
  });
}

// waiting on gettext to finish completion, or error
gettext("https://cdn.mozilla.net/pdfjs/tracemonkey.pdf").then(function (text) {
  alert('parse ' + text);
}, 
function (reason) {
  console.error(reason);
});

<script src="https://npmcdn.com/pdfjs-dist/build/pdf.js"></script>

0 讨论(0)