Extract text from doc and docx

后端 未结 9 1280
死守一世寂寞
死守一世寂寞 2020-11-27 16:24

I would like to know how can I read the contents of a doc or docx. I\'m using a Linux VPS and PHP, but if there is a simpler solution using other language, please let me kno

相关标签:
9条回答
  • 2020-11-27 17:03

    My solution is Antiword for .doc and docx2txt for .docx

    Assuming a linux server that you control, download each one, extract then install. I installed each one system wide:

    Antiword: make global_install
    docx2txt: make install

    Then to use these tools to extract the text into a string in php:

    //for .doc
    $text = shell_exec('/usr/local/bin/antiword -w 0 ' . 
        escapeshellarg($docFilePath));
    
    //for .docx
    $text = shell_exec('/usr/local/bin/docx2txt.pl ' . 
        escapeshellarg($docxFilePath) . ' -');
    

    docx2txt requires perl

    no_freedom's solution does extract text from docx files, but it can butcher whitespace. Most files I tested had instances where words that should be separated had no space between them. Not good when you want to full text search the documents you're processing.

    0 讨论(0)
  • 2020-11-27 17:03

    I used docxtotxt to extract docx file content. My code is as follows:

    if($extention == "docx")
    {   
        $docxFilePath = "/var/www/vhosts/abc.com/httpdocs/writers/filename.docx";
        $content = shell_exec('/var/www/vhosts/abc.com/httpdocs/docx2txt/docx2txt.pl     
        '.escapeshellarg($docxFilePath) . ' -');
    }
    
    0 讨论(0)
  • 2020-11-27 17:07

    You can use Apache Tika as complete solution it provides REST API.

    Another good library is RawText, as it can do an OCR over images, and extract text from any doc. It's non-free, and it works over REST API.

    The sample code extracting your file with RawText:

    $result = $rawText->extract($your_file)
    
    0 讨论(0)
提交回复
热议问题