PDFBox 0.7.3 convert pdf to text

问题

I want to convert pdf file to text file but some of pdf files do not work with pdfbox dll as the version of acrobat in newer than Acrobat 5.x

Please tell me what i do?

output.WriteLine("Begin Parsing.....");
output.WriteLine(DateTime.Now.ToString());

PDDocument doc = PDDocument.load(path);
PDFTextStripper stripper = new PDFTextStripper();

output.Write(stripper.getText(doc));

回答1:

Your first attempt should be to try with a current version of PDFBox. Your version 0.7.3 dates back to 2006! PDFBox meanwhile has become an Apache project and is located here: http://pdfbox.apache.org/ and the current version (as of May 2013) is 1.8.1. And I'm very sure that PDFBox nowerdays does support PDF object streams and cross reference streams which were new in PDF Reference version 1.5, the version Adobe Acrobat 6 has been built for

If that does not work, you might want to try other PDF libraries, e.g. iText (or iTextSharp in your case) version 5.4.x if the AGPL (or alternatively buying a license) is no problem for you.

Information on text parsing using iText(Sharp) can be found in chapter15 Marked content and parsing PDF of iText in Action — 2nd Edition. The samples from that chapter can be found online: Java and .Net.

For a first test the sample ExtractPageContentSorted2.cs / ExtractPageContentSorted2.java would be a good start. The central code:

PdfReader reader = new PdfReader(PDF_FILE);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
StringBuilder sb = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++) {
    sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i));
}

If neither a current PDFBox version nor a current iText(Sharp) version can parse your PDF, you might want to post a sample for inspection; there are ways to drop all information required for text parsing from a PDF...

来源：https://stackoverflow.com/questions/16374746/pdfbox-0-7-3-convert-pdf-to-text

标签

itextsharp

pdfbox

pdftotext