How to find text from pdf image?

后端 未结 2 740
误落风尘
误落风尘 2021-02-11 01:27

I am developing a C# application in which I am converting a PDF document to an image and then rendering that image in a custom viewer.

I\'ve come across a bit of a bric

相关标签:
2条回答
  • 2021-02-11 01:48

    You can use tessract OCR image for text recognition in console mode.

    I don't know about such SDK for pdf.

    BUT, if you want to get all word coordinates and values, you can use next my not complex code, thank nguyenq for hocr hint:

    public void Recognize(Bitmap bitmap)
    {
        bitmap.Save("temp.png", ImageFormat.Png);
        var startInfo = new ProcessStartInfo("tesseract.exe", "temp.png temp hocr");
        startInfo.WindowStyle = ProcessWindowStyle.Hidden;
        var process = Process.Start(startInfo);
        process.WaitForExit();
    
        GetWords(File.ReadAllText("temp.html"));
    
        // Futher actions with words
    }
    
    public Dictionary<Rectangle, string> GetWords(string tesseractHtml)
    {
        var xml = XDocument.Parse(tesseractHtml);
    
        var rectsWords = new Dictionary<System.Drawing.Rectangle, string>();
    
        var ocr_words = xml.Descendants("span").Where(element => element.Attribute("class").Value == "ocr_word").ToList();
        foreach (var ocr_word in ocr_words)
        {
            var strs = ocr_word.Attribute("title").Value.Split(' ');
            int left = int.Parse(strs[1]);
            int top = int.Parse(strs[2]);
            int width = int.Parse(strs[3]) - left + 1;
            int height = int.Parse(strs[4]) - top + 1;
            rectsWords.Add(new Rectangle(left, top, width, height), ocr_word.Value);
        }
    
        return rectsWords;
    }
    
    0 讨论(0)
  • 2021-02-11 01:56

    Use ITextSharp download it here. Make sure the PDF is searchable.

    and use this code:

    public static string GetTextFromAllPages(String pdfPath)
    {
        PdfReader reader = new PdfReader(pdfPath); 
    
        StringWriter output = new StringWriter();  
    
        for (int i = 1; i <= reader.NumberOfPages; i++) 
            output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
    
        return output.ToString();
    }
    
    0 讨论(0)
提交回复
热议问题