How can I extract data in a Word document using Perl?

前端未结

关注

 5  474

鱼传尺愫

How to extract the data from a word doc using Perl?

相关标签:

5条回答

后悔当初

2021-01-21 22:44

Word docs are no longer flat files. Find a .docx, rename it with a .zip extention, and you can open it up and poke around inside to get a feel for how things are laid out. I would generally agree though that microsoft has provided ways to do this already.

0 讨论(0)
发布评论:

提交评论
- 加载中...
佛祖请我去吃肉

2021-01-21 22:46

On Windows you'd better use COM interfaces to access Word functionality.

If you want to do it cross-platform think about executing "catdoc" or libwv.

0 讨论(0)
发布评论:

提交评论
- 加载中...
有刺的猬

2021-01-21 22:55

If you are not on Windows, I think the best route might be to convert it first.

If you are not using Windows and don't have access to Win32::OLE, you can use OpenOffice to convert the documents.

You could wrap up the script in the link into your Perl program. Although the link starts with PDF if you read on it can convert it to text. Also see this stackoverflow post about converting doc and docx files.

0 讨论(0)
发布评论:

提交评论
- 加载中...
慢半拍i

2021-01-21 23:01

You could use Win32::OLE if the script is to run on a Windows box with Word installed.

What platform are you using? Perhaps antiword could be invoked?

0 讨论(0)
发布评论:

提交评论
- 加载中...

花落未央

2021-01-21 23:08

use Win32::OLE;
use Win32::OLE::Enum;

$document = Win32::OLE -> GetObject($ARGV[1]);
open (FH,">$ARGV[0]");

print "Extracting Text ...\n";

$paragraphs = $document->Paragraphs();
$enumerate = new Win32::OLE::Enum($paragraphs);
while(defined($paragraph = $enumerate->Next()))
{
    $style = $paragraph->{Style}->{NameLocal};
    print FH "+$style\n";
    $text = $paragraph->{Range}->{Text};
    $text =~ s/[\n\r]//g;
    $text =~ s/\x0b/\n/g;
    print FH "=$text\n";
}

stolen from here

0 讨论(0)