How to convert pdf file to excel in c#

后端 未结 3 1705
南方客
南方客 2021-01-24 09:36

I want to extract some data like \" email addresses \" .. from table which are in PDF file and use this email addresses which I extract to send email to those peopl

3条回答
  •  滥情空心
    2021-01-24 10:03

    You absolutely do not have to convert PDF to Excel. First of all, please determine whether your PDF contains textual data, or it is scanned image. If it contains textual data, then you are right about using "some free dll". I recommend iTextSharp as it is popular and easy to use.

    Now the controversial part. If you don't need rock solid solution, it would be easiest to read all PDF to a string and then retrieve emails using regular expression.
    Here is example (not perfect) of reading PDF with iTextSharp and extracting emails:

    public string PdfToString(string fileName)
    {
        var sb = new StringBuilder();    
        var reader = new PdfReader(fileName);
        for (int page = 1; page <= reader.NumberOfPages; page++)
        {
            var strategy = new SimpleTextExtractionStrategy();
            string text = PdfTextExtractor.GetTextFromPage(reader, page, strategy);
            text = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
            sb.Append(text);
        }
        reader.Close();        
        return sb.ToString();
    }
    //adjust expression as needed
    Regex emailRegex = new Regex("Email Address (?.+?) Passport No");
    public IEnumerable ExtractEmails(string content)
    {   
        var matches = emailRegex.Matches(content);
        foreach (Match m in matches)
        {
            yield return m.Groups["email"].Value;
        }
    }
    

提交回复
热议问题