问题
I used the code from this tutorial http://ourcodeworld.com/articles/read/405/how-to-convert-pdf-to-text-extract-text-from-pdf-with-javascript to set up the pdf to text conversion.
Looked all over on this site https://mozilla.github.io/pdf.js/ for some hints as to how to format the conversion, but couldn't find anything. I am just wondering if anyone has any idea of how to display line breaks as \n
when parsing text using pdf.js.
Thanks in advance.
回答1:
In PDF there no such thing as controlling layout using control chars such as '\n' -- glyphs in PDF positioned using exact coordinates. Use text y-coordinate (can be extracted from transform matrix) to detect a line change.
var url = "https://cdn.mozilla.net/pdfjs/tracemonkey.pdf";
var pageNumber = 2;
// Load document
PDFJS.getDocument(url).then(function (doc) {
// Get a page
return doc.getPage(pageNumber);
}).then(function (pdfPage) {
// Get page text content
return pdfPage.getTextContent();
}).then(function (textContent) {
var p = null;
var lastY = -1;
textContent.items.forEach(function (i) {
// Tracking Y-coord and if changed create new p-tag
if (lastY != i.transform[5]) {
p = document.createElement("p");
document.body.appendChild(p);
lastY = i.transform[5];
}
p.textContent += i.str;
});
});
<script src="https://npmcdn.com/pdfjs-dist/build/pdf.js"></script>
来源:https://stackoverflow.com/questions/44376415/display-line-breaks-as-n-in-pdf-to-text-conversion-using-pdf-js