Reading PDF document with iTextSharp creates string with repeating first page

与世无争的帅哥 提交于 2020-01-06 19:07:13

问题


I currently use iTextSharp to read in some PDF files and parse them by using the string I receive. I have encountered a strange behavior with some PDF files. When getting the string back of a for example 4 page PDF, the string is filled with the pages in the following order:

1 2 1 3 1 4

My code for reading the files is as follows:

using (PdfReader reader = new PdfReader(fileStream))
{
     StringBuilder sb = new StringBuilder();

     ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
     for (int page = 0; page < reader.NumberOfPages; page++)
     {
         string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, strategy);
         if (!string.IsNullOrWhiteSpace(text))
             sb.Append(Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));
     }

     Debug.WriteLine(sb.ToString());
}

Here is a link to a file with which this behaviour occurs:

https://onedrive.live.com/redir?resid=D9FEFF3BF45E05FD!1536&authkey=!AFLRlskAvlg89yY&ithint=file%2cpdf

Hope you guys can help me out!


回答1:


Thanks to Chris Haas I found out was going wrong. The samples found online on how to use iTextSharp.Pdf are incorrect or incorrect for my implementation.

The SimpleTextExtractionStrategy needs to be instantiated for every page you try to read. Not doing this will multiply each previous page in the resulting string.

Also the line where the StringBuilder is being appended can be changed from:

sb.Append(Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text))));

to

sb.Append(text);

Thus the following code gives the correct result:

using (PdfReader reader = new PdfReader(fileStream))
{
    StringBuilder sb = new StringBuilder();

    for (int page = 0; page < reader.NumberOfPages; page++)
    {
        string text = PdfTextExtractor.GetTextFromPage(reader, page + 1, new SimpleTextExtractionStrategy());
        if (!string.IsNullOrWhiteSpace(text))
            sb.Append(text);
    }
    Debug.WriteLine(sb.ToString());                    
}


来源:https://stackoverflow.com/questions/30188491/reading-pdf-document-with-itextsharp-creates-string-with-repeating-first-page

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!