Extract text from doc and docx

后端 未结 9 1279
死守一世寂寞
死守一世寂寞 2020-11-27 16:24

I would like to know how can I read the contents of a doc or docx. I\'m using a Linux VPS and PHP, but if there is a simpler solution using other language, please let me kno

相关标签:
9条回答
  • 2020-11-27 16:41

    Parse .docx, .odt, .doc and .rtf documents

    I wrote a library that parses the docx, odt and rtf documents based on answers here and elsewhere.

    The major improvement I have made to the .docx and .odt parsing is the that the library processes the XML that describes the document and attempts to conform it to HTML tags, i.e. em and strong tags. This means that if you're using the library for a CMS, text formatting is not lost

    You can get it here

    0 讨论(0)
  • 2020-11-27 16:50

    Here i have added the solution to get the text from .doc,.docx word files

    How to extract text from word file .doc,docx php

    For .doc

    private function read_doc() {
        $fileHandle = fopen($this->filename, "r");
        $line = @fread($fileHandle, filesize($this->filename));   
        $lines = explode(chr(0x0D),$line);
        $outtext = "";
        foreach($lines as $thisline)
          {
            $pos = strpos($thisline, chr(0x00));
            if (($pos !== FALSE)||(strlen($thisline)==0))
              {
              } else {
                $outtext .= $thisline." ";
              }
          }
         $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
        return $outtext;
    }
    

    For .docx

    private function read_docx(){
    
            $striped_content = '';
            $content = '';
    
            $zip = zip_open($this->filename);
    
            if (!$zip || is_numeric($zip)) return false;
    
            while ($zip_entry = zip_read($zip)) {
    
                if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
    
                if (zip_entry_name($zip_entry) != "word/document.xml") continue;
    
                $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
    
                zip_entry_close($zip_entry);
            }// end while
    
            zip_close($zip);
    
            $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
            $content = str_replace('</w:r></w:p>', "\r\n", $content);
            $striped_content = strip_tags($content);
    
            return $striped_content;
        }
    
    0 讨论(0)
  • 2020-11-27 16:53

    I insert little improvements in doc to txt converter function

    private function read_doc() {
        $line_array = array();
        $fileHandle = fopen( $this->filename, "r" );
        $line       = @fread( $fileHandle, filesize( $this->filename ) );
        $lines      = explode( chr( 0x0D ), $line );
        $outtext    = "";
        foreach ( $lines as $thisline ) {
            $pos = strpos( $thisline, chr( 0x00 ) );
            if (  $pos !== false )  {
    
            } else {
                $line_array[] = preg_replace( "/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/", "", $thisline );
    
            }
        }
    
        return implode("\n",$line_array);
    }
    

    Now it saves empty rows and txt file looks row by row .

    0 讨论(0)
  • 2020-11-27 16:57

    This is a .DOCX solution only. For .DOC or .PDF you'll need to use something else like pdf2text.php for PDF

    function docx2text($filename) {
       return readZippedXML($filename, "word/document.xml");
     }
    
    function readZippedXML($archiveFile, $dataFile) {
    // Create new ZIP archive
    $zip = new ZipArchive;
    
    // Open received archive file
    if (true === $zip->open($archiveFile)) {
        // If done, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false) {
            // If found, read it to the string
            $data = $zip->getFromIndex($index);
            // Close archive file
            $zip->close();
            // Load XML from a string
            // Skip errors and warnings
            $xml = new DOMDocument();
        $xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            // Return data without XML formatting tags
            return strip_tags($xml->saveXML());
        }
        $zip->close();
    }
    
    // In case of failure return empty string
    return "";
    }
    
    echo docx2text("test.docx"); // Save this contents to file
    
    0 讨论(0)
  • 2020-11-27 17:01

    Try ApachePOI. It works well for Java. I suppose you won't have any difficulties installing Java on Linux.

    0 讨论(0)
  • 2020-11-27 17:02

    I would suggest, Extract text using apache Tika, you can extract multiple type of file content like .doc/.docx and pdf and many other.

    0 讨论(0)
提交回复
热议问题