How to convert PDF to text file in iTextSharp

后端 未结 3 907
时光说笑
时光说笑 2021-01-05 10:08

I have to retrieve text from PDF file. But using the following code I only get empty text file.

for (int i = 0; i < n; i++)
{
    pagenumber = i + 1;
            


        
相关标签:
3条回答
  • 2021-01-05 10:44
    //call Create_pdf() function when Done button pressed;
    
    EditText et_Text=findViewById(R.id.EditText);
    
    String projectname="MyPdf";
    
    public  void Create_pdf(){
        Document doc =new Document();
        String outPath= Environment.getExternalStorageDirectory()+"/"+projectname+".pdf";
        try {
           PdfWriter.getInstance(doc, new FileOutputStream(outPath));
           doc.open();
           doc.add(new Paragraph(et_Text.getText().toString()));
           doc.close();
        } catch (DocumentException e) {
            e.printStackTrace();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
    } 
    
    0 讨论(0)
  • 2021-01-05 10:53
    using System;
    using System.IO;
    using System.Linq;
    using System.Text;
    using iTextSharp.text.pdf;
    using iTextSharp.text.pdf.parser;
    
    namespace Pdf2Text
    {
        class Program
        {
            static void Main(string[] args)
            {
                if (!args.Any()) return;
    
                var file = args[0];
                var output = Path.ChangeExtension(file, ".txt");
                if (!File.Exists(file)) return;
    
                var bytes = File.ReadAllBytes(file);
                File.WriteAllText(output, ConvertToText(bytes), Encoding.UTF8);
            }
    
            private static string ConvertToText(byte[] bytes)
            {
                var sb = new StringBuilder();
    
                try
                {
                    var reader = new PdfReader(bytes);
                    var numberOfPages = reader.NumberOfPages;
    
                    for (var currentPageIndex = 1; currentPageIndex <= numberOfPages; currentPageIndex++)
                    {
                        sb.Append(PdfTextExtractor.GetTextFromPage(reader, currentPageIndex));
                    }
                }
                catch (Exception exception)
                {
                    Console.WriteLine(exception.Message);
                }
    
                return sb.ToString();
            }
        }
    }
    
    0 讨论(0)
  • 2021-01-05 11:01

    For text extraction with iTextSharp, take a current version of that library and use

    PdfTextExtractor.GetTextFromPage(reader, pageNumber);
    

    Beware, there is a bug in the text extraction code in some 5.3.x version which has meanwhile been fixed in trunk. You, therefore, might want to checkout the trunk revision.

    0 讨论(0)
提交回复
热议问题