How to extract the data from a word doc using Perl?
Word docs are no longer flat files. Find a .docx, rename it with a .zip extention, and you can open it up and poke around inside to get a feel for how things are laid out. I would generally agree though that microsoft has provided ways to do this already.
On Windows you'd better use COM interfaces to access Word functionality.
If you want to do it cross-platform think about executing "catdoc" or libwv.
If you are not on Windows, I think the best route might be to convert it first.
If you are not using Windows and don't have access to Win32::OLE, you can use OpenOffice to convert the documents.
You could wrap up the script in the link into your Perl program. Although the link starts with PDF if you read on it can convert it to text. Also see this stackoverflow post about converting doc and docx files.
You could use Win32::OLE if the script is to run on a Windows box with Word installed.
What platform are you using? Perhaps antiword could be invoked?
use Win32::OLE;
use Win32::OLE::Enum;
$document = Win32::OLE -> GetObject($ARGV[1]);
open (FH,">$ARGV[0]");
print "Extracting Text ...\n";
$paragraphs = $document->Paragraphs();
$enumerate = new Win32::OLE::Enum($paragraphs);
while(defined($paragraph = $enumerate->Next()))
{
$style = $paragraph->{Style}->{NameLocal};
print FH "+$style\n";
$text = $paragraph->{Range}->{Text};
$text =~ s/[\n\r]//g;
$text =~ s/\x0b/\n/g;
print FH "=$text\n";
}
stolen from here