How to correctly extract text from a pdf using pdf.js

后端 未结 4 1674
生来不讨喜
生来不讨喜 2020-12-15 10:30

I\'m new to ES6 and Promise. I\'m trying pdf.js to extract texts from all pages of a pdf file into a string array. And when extraction is done, I want to parse the array som

相关标签:
4条回答
  • 2020-12-15 10:41

    A bit more cleaner version of @async5 and updated according to the latest version of "pdfjs-dist": "^2.0.943"

    import PDFJS from "pdfjs-dist";
    import PDFJSWorker from "pdfjs-dist/build/pdf.worker.js"; // add this to fit 2.3.0
    
    PDFJS.disableTextLayer = true;
    PDFJS.disableWorker = true; // not availaible anymore since 2.3.0 (see imports)
    
    const getPageText = async (pdf: Pdf, pageNo: number) => {
      const page = await pdf.getPage(pageNo);
      const tokenizedText = await page.getTextContent();
      const pageText = tokenizedText.items.map(token => token.str).join("");
      return pageText;
    };
    
    /* see example of a PDFSource below */
    export const getPDFText = async (source: PDFSource): Promise<string> => {
      Object.assign(window, {pdfjsWorker: PDFJSWorker}); // added to fit 2.3.0
      const pdf: Pdf = await PDFJS.getDocument(source).promise;
      const maxPages = pdf.numPages;
      const pageTextPromises = [];
      for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
        pageTextPromises.push(getPageText(pdf, pageNo));
      }
      const pageTexts = await Promise.all(pageTextPromises);
      return pageTexts.join(" ");
    };
    

    This is the corresponding typescript declaration file that I have used if anyone needs it.

    declare module "pdfjs-dist";
    
    type TokenText = {
      str: string;
    };
    
    type PageText = {
      items: TokenText[];
    };
    
    type PdfPage = {
      getTextContent: () => Promise<PageText>;
    };
    
    type Pdf = {
      numPages: number;
      getPage: (pageNo: number) => Promise<PdfPage>;
    };
    
    type PDFSource = Buffer | string;
    
    declare module 'pdfjs-dist/build/pdf.worker.js'; // needed in 2.3.0
    

    Example of how to get a PDFSource from a File with Buffer (from node types) :

    file.arrayBuffer().then((ab: ArrayBuffer) => {
      const pdfSource: PDFSource = Buffer.from(ab);
    });
    
    0 讨论(0)
  • 2020-12-15 10:46

    Here's a shorter (not necessarily better) version:

    async function getPdfText(data) {
        let doc = await pdfjsLib.getDocument({data}).promise;
        let pageTexts = Array.from({length: doc.numPages}, async (v,i) => {
            return (await (await doc.getPage(i+1)).getTextContent()).items.map(token => token.str).join('');
        });
        return (await Promise.all(pageTexts)).join('');
    }
    

    Here, data is a string or buffer (or you could change it to take the url, etc., instead).

    0 讨论(0)
  • 2020-12-15 10:49

    If you use the PDFViewer component, here is my solution that doesn't involve any promise or asynchrony:

    function getDocumentText(viewer) {
        let text = '';
        for (let i = 0; i < viewer.pagesCount; i++) {
            const { textContentItemsStr } = viewer.getPageView(i).textLayer;
            for (let item of textContentItemsStr)
                text += item;
        }
        return text;
    }
    
    0 讨论(0)
  • 2020-12-15 10:55

    Similar to https://stackoverflow.com/a/40494019/1765767 -- collect page promises using Promise.all and don't forget to chain then's:

    function gettext(pdfUrl){
      var pdf = PDFJS.getDocument(pdfUrl);
      return pdf.then(function(pdf) { // get all pages text
        var maxPages = pdf.pdfInfo.numPages;
        var countPromises = []; // collecting all page promises
        for (var j = 1; j <= maxPages; j++) {
          var page = pdf.getPage(j);
    
          var txt = "";
          countPromises.push(page.then(function(page) { // add page promise
            var textContent = page.getTextContent();
            return textContent.then(function(text){ // return content promise
              return text.items.map(function (s) { return s.str; }).join(''); // value page text 
            });
          }));
        }
        // Wait for all pages and join text
        return Promise.all(countPromises).then(function (texts) {
          return texts.join('');
        });
      });
    }
    
    // waiting on gettext to finish completion, or error
    gettext("https://cdn.mozilla.net/pdfjs/tracemonkey.pdf").then(function (text) {
      alert('parse ' + text);
    }, 
    function (reason) {
      console.error(reason);
    });
    <script src="https://npmcdn.com/pdfjs-dist/build/pdf.js"></script>

    0 讨论(0)
提交回复
热议问题