I\'m new to ES6 and Promise. I\'m trying pdf.js to extract texts from all pages of a pdf file into a string array. And when extraction is done, I want to parse the array som
A bit more cleaner version of @async5 and updated according to the latest version of "pdfjs-dist": "^2.0.943"
import PDFJS from "pdfjs-dist";
import PDFJSWorker from "pdfjs-dist/build/pdf.worker.js"; // add this to fit 2.3.0
PDFJS.disableTextLayer = true;
PDFJS.disableWorker = true; // not availaible anymore since 2.3.0 (see imports)
const getPageText = async (pdf: Pdf, pageNo: number) => {
const page = await pdf.getPage(pageNo);
const tokenizedText = await page.getTextContent();
const pageText = tokenizedText.items.map(token => token.str).join("");
return pageText;
};
/* see example of a PDFSource below */
export const getPDFText = async (source: PDFSource): Promise<string> => {
Object.assign(window, {pdfjsWorker: PDFJSWorker}); // added to fit 2.3.0
const pdf: Pdf = await PDFJS.getDocument(source).promise;
const maxPages = pdf.numPages;
const pageTextPromises = [];
for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
pageTextPromises.push(getPageText(pdf, pageNo));
}
const pageTexts = await Promise.all(pageTextPromises);
return pageTexts.join(" ");
};
This is the corresponding typescript declaration file that I have used if anyone needs it.
declare module "pdfjs-dist";
type TokenText = {
str: string;
};
type PageText = {
items: TokenText[];
};
type PdfPage = {
getTextContent: () => Promise<PageText>;
};
type Pdf = {
numPages: number;
getPage: (pageNo: number) => Promise<PdfPage>;
};
type PDFSource = Buffer | string;
declare module 'pdfjs-dist/build/pdf.worker.js'; // needed in 2.3.0
Example of how to get a PDFSource from a File with Buffer (from node types) :
file.arrayBuffer().then((ab: ArrayBuffer) => {
const pdfSource: PDFSource = Buffer.from(ab);
});
Here's a shorter (not necessarily better) version:
async function getPdfText(data) {
let doc = await pdfjsLib.getDocument({data}).promise;
let pageTexts = Array.from({length: doc.numPages}, async (v,i) => {
return (await (await doc.getPage(i+1)).getTextContent()).items.map(token => token.str).join('');
});
return (await Promise.all(pageTexts)).join('');
}
Here, data
is a string or buffer (or you could change it to take the url, etc., instead).
If you use the PDFViewer
component, here is my solution that doesn't involve any promise or asynchrony:
function getDocumentText(viewer) {
let text = '';
for (let i = 0; i < viewer.pagesCount; i++) {
const { textContentItemsStr } = viewer.getPageView(i).textLayer;
for (let item of textContentItemsStr)
text += item;
}
return text;
}
Similar to https://stackoverflow.com/a/40494019/1765767 -- collect page promises using Promise.all and don't forget to chain then's:
function gettext(pdfUrl){
var pdf = PDFJS.getDocument(pdfUrl);
return pdf.then(function(pdf) { // get all pages text
var maxPages = pdf.pdfInfo.numPages;
var countPromises = []; // collecting all page promises
for (var j = 1; j <= maxPages; j++) {
var page = pdf.getPage(j);
var txt = "";
countPromises.push(page.then(function(page) { // add page promise
var textContent = page.getTextContent();
return textContent.then(function(text){ // return content promise
return text.items.map(function (s) { return s.str; }).join(''); // value page text
});
}));
}
// Wait for all pages and join text
return Promise.all(countPromises).then(function (texts) {
return texts.join('');
});
});
}
// waiting on gettext to finish completion, or error
gettext("https://cdn.mozilla.net/pdfjs/tracemonkey.pdf").then(function (text) {
alert('parse ' + text);
},
function (reason) {
console.error(reason);
});
<script src="https://npmcdn.com/pdfjs-dist/build/pdf.js"></script>