I would like to know how can I read the contents of a doc or docx. I\'m using a Linux VPS and PHP, but if there is a simpler solution using other language, please let me kno
I wrote a library that parses the docx, odt and rtf documents based on answers here and elsewhere.
The major improvement I have made to the .docx and .odt parsing is the that the library processes the XML that describes the document and attempts to conform it to HTML tags, i.e. em and strong tags. This means that if you're using the library for a CMS, text formatting is not lost
You can get it here
Here i have added the solution to get the text from .doc,.docx word files
private function read_doc() {
$fileHandle = fopen($this->filename, "r");
$line = @fread($fileHandle, filesize($this->filename));
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
return $outtext;
}
private function read_docx(){
$striped_content = '';
$content = '';
$zip = zip_open($this->filename);
if (!$zip || is_numeric($zip)) return false;
while ($zip_entry = zip_read($zip)) {
if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
if (zip_entry_name($zip_entry) != "word/document.xml") continue;
$content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
zip_entry_close($zip_entry);
}// end while
zip_close($zip);
$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$striped_content = strip_tags($content);
return $striped_content;
}
I insert little improvements in doc to txt converter function
private function read_doc() {
$line_array = array();
$fileHandle = fopen( $this->filename, "r" );
$line = @fread( $fileHandle, filesize( $this->filename ) );
$lines = explode( chr( 0x0D ), $line );
$outtext = "";
foreach ( $lines as $thisline ) {
$pos = strpos( $thisline, chr( 0x00 ) );
if ( $pos !== false ) {
} else {
$line_array[] = preg_replace( "/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/", "", $thisline );
}
}
return implode("\n",$line_array);
}
Now it saves empty rows and txt file looks row by row .
This is a .DOCX solution only. For .DOC or .PDF you'll need to use something else like pdf2text.php for PDF
function docx2text($filename) {
return readZippedXML($filename, "word/document.xml");
}
function readZippedXML($archiveFile, $dataFile) {
// Create new ZIP archive
$zip = new ZipArchive;
// Open received archive file
if (true === $zip->open($archiveFile)) {
// If done, search for the data file in the archive
if (($index = $zip->locateName($dataFile)) !== false) {
// If found, read it to the string
$data = $zip->getFromIndex($index);
// Close archive file
$zip->close();
// Load XML from a string
// Skip errors and warnings
$xml = new DOMDocument();
$xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
// Return data without XML formatting tags
return strip_tags($xml->saveXML());
}
$zip->close();
}
// In case of failure return empty string
return "";
}
echo docx2text("test.docx"); // Save this contents to file
Try ApachePOI. It works well for Java. I suppose you won't have any difficulties installing Java on Linux.
I would suggest, Extract text using apache Tika, you can extract multiple type of file content like .doc/.docx and pdf and many other.