How can I extract text from a PDF file in Perl?

后端 未结 8 1310
花落未央
花落未央 2020-12-03 05:08

I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extra

相关标签:
8条回答
  • 2020-12-03 05:50

    James Healy is correct. After trying CAM::PDF and PDF::API2, the former of which I've had some success reading text, downloading pdftotext worked great for a number of my implementations.

    If on windows go here and download xpdf precompiled binary: http://www.foolabs.com/xpdf/download.html

    Then, if you need to run this within perl use system, e.g.,: system("C:\Utilities\xpdfbin-win-3.04\bin64\pdftotext.exe $saveName");

    where $saveName is the full path to your PDF file.

    This hopefully leaves you with a text file you can open and parse in perl.

    0 讨论(0)
  • 2020-12-03 05:52

    You may never get an appropriate solution to your problem. The PDF format can encode text either as ASCII values with a font applied, or it can encode it as a bitmap. If the tool that created your PDF decided to encode the special characters as a bitmap, you will be out of luck (unless you want to get into OCR solutions, of course).

    0 讨论(0)
提交回复
热议问题