How can I extract text from a PDF file in Perl?

后端未结

关注

 8  1309

I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extra

相关标签:

8条回答

挽巷

2020-12-03 05:30

i tried this module which is working fine for special characters of pdf..

!/usr/bin/perl
use strict;
use warnings;
use PDF::OCR::Thorough;

my $filename = "pdf.pdf";

my $pdf = PDF::OCR::Thorough->new($filename);
my $text = $pdf->get_text();
print "$text";

0 讨论(0)

Happy的楠姐

2020-12-03 05:35

I'm not a Perl user but I imagine you'll struggle to find a better free text extractor than pdftotext.

pdftotext usually recognises non-ASCII characters fine, is it possible it's extracting them ok but the app you're using to view the text file isn't using the correct encoding? If pdftoetxt on windows is the same as the one on my linux system, then it defaults to exporting as utf-8.

0 讨论(0)
发布评论:

提交评论
- 加载中...
天涯浪人

2020-12-03 05:42

There is getpdftext.pl; part of CAM::PDF.

0 讨论(0)
发布评论:

提交评论
- 加载中...
挽巷

2020-12-03 05:47
These modules you can acheive the extract text from pdf

PDF::API2

CAM::PDF

CAM::PDF::PageText

From CPAN
```
   my $pdf = CAM::PDF->new($filename);
   my $pageone_tree = $pdf->getPageContentTree(1);
   print CAM::PDF::PageText->render($pageone_tree);
```
This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.

All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.
0 讨论(0)
发布评论:

提交评论
- 加载中...
自闭症患者

2020-12-03 05:47

Well, I tried 2-3 perl modules like CAM::PDF, API2 but the problem remains the same! I'm parsing a pdf file containing main pages. Cam or API2 parses the plain text very well. However, they are not able to parse the code snippet [code snippet usually are in different font & encoding than plain text].

0 讨论(0)
发布评论:

提交评论
- 加载中...
别那么骄傲

2020-12-03 05:49

Take a look at PDFBox. It is a library but i think that it also comes with some tool to do text extracting.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页