PDFBox 0.7.3 convert pdf to text

╄→尐↘猪︶ㄣ 提交于 2019-12-25 07:18:20

问题


I want to convert pdf file to text file but some of pdf files do not work with pdfbox dll as the version of acrobat in newer than Acrobat 5.x

Please tell me what i do?

output.WriteLine("Begin Parsing.....");
output.WriteLine(DateTime.Now.ToString());

PDDocument doc = PDDocument.load(path);
PDFTextStripper stripper = new PDFTextStripper();

output.Write(stripper.getText(doc));

回答1:


Your first attempt should be to try with a current version of PDFBox. Your version 0.7.3 dates back to 2006! PDFBox meanwhile has become an Apache project and is located here: http://pdfbox.apache.org/ and the current version (as of May 2013) is 1.8.1. And I'm very sure that PDFBox nowerdays does support PDF object streams and cross reference streams which were new in PDF Reference version 1.5, the version Adobe Acrobat 6 has been built for

If that does not work, you might want to try other PDF libraries, e.g. iText (or iTextSharp in your case) version 5.4.x if the AGPL (or alternatively buying a license) is no problem for you.

Information on text parsing using iText(Sharp) can be found in chapter15 Marked content and parsing PDF of iText in Action — 2nd Edition. The samples from that chapter can be found online: Java and .Net.

For a first test the sample ExtractPageContentSorted2.cs / ExtractPageContentSorted2.java would be a good start. The central code:

PdfReader reader = new PdfReader(PDF_FILE);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
StringBuilder sb = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++) {
    sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i));
}

If neither a current PDFBox version nor a current iText(Sharp) version can parse your PDF, you might want to post a sample for inspection; there are ways to drop all information required for text parsing from a PDF...



来源:https://stackoverflow.com/questions/16374746/pdfbox-0-7-3-convert-pdf-to-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!