How can I extract text from a PDF file in Perl?

后端未结

关注

 8  1310

I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extra

相关标签:

8条回答

小蘑菇

2020-12-03 05:50

James Healy is correct. After trying CAM::PDF and PDF::API2, the former of which I've had some success reading text, downloading pdftotext worked great for a number of my implementations.

If on windows go here and download xpdf precompiled binary: http://www.foolabs.com/xpdf/download.html

Then, if you need to run this within perl use system, e.g.,: system("C:\Utilities\xpdfbin-win-3.04\bin64\pdftotext.exe $saveName");

where $saveName is the full path to your PDF file.

This hopefully leaves you with a text file you can open and parse in perl.

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦毁少年i

2020-12-03 05:52

You may never get an appropriate solution to your problem. The PDF format can encode text either as ASCII values with a font applied, or it can encode it as a bitmap. If the tool that created your PDF decided to encode the special characters as a bitmap, you will be out of luck (unless you want to get into OCR solutions, of course).

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2