How can I extract text from a PDF file in Perl?

后端 未结 8 1309
花落未央
花落未央 2020-12-03 05:08

I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extra

相关标签:
8条回答
  • 2020-12-03 05:30

    i tried this module which is working fine for special characters of pdf..

    !/usr/bin/perl
    use strict;
    use warnings;
    use PDF::OCR::Thorough;
    
    my $filename = "pdf.pdf";
    
    my $pdf = PDF::OCR::Thorough->new($filename);
    my $text = $pdf->get_text();
    print "$text";
    
    0 讨论(0)
  • 2020-12-03 05:35

    I'm not a Perl user but I imagine you'll struggle to find a better free text extractor than pdftotext.

    pdftotext usually recognises non-ASCII characters fine, is it possible it's extracting them ok but the app you're using to view the text file isn't using the correct encoding? If pdftoetxt on windows is the same as the one on my linux system, then it defaults to exporting as utf-8.

    0 讨论(0)
  • 2020-12-03 05:42

    There is getpdftext.pl; part of CAM::PDF.

    0 讨论(0)
  • 2020-12-03 05:47

    These modules you can acheive the extract text from pdf

    PDF::API2

    CAM::PDF

    CAM::PDF::PageText

    From CPAN

       my $pdf = CAM::PDF->new($filename);
       my $pageone_tree = $pdf->getPageContentTree(1);
       print CAM::PDF::PageText->render($pageone_tree);
    

    This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.

    All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.

    0 讨论(0)
  • 2020-12-03 05:47

    Well, I tried 2-3 perl modules like CAM::PDF, API2 but the problem remains the same! I'm parsing a pdf file containing main pages. Cam or API2 parses the plain text very well. However, they are not able to parse the code snippet [code snippet usually are in different font & encoding than plain text].

    0 讨论(0)
  • 2020-12-03 05:49

    Take a look at PDFBox. It is a library but i think that it also comes with some tool to do text extracting.

    0 讨论(0)
提交回复
热议问题