Parse a word document with PHPWord to a string

无人久伴 提交于 2019-12-24 09:20:26

问题


I've tried several solutions to parse word documents to a string in PHP, however they sometimes have trouble with certain word documents. So I'm now trying PHPWord to attempt to parse the word document to a string.

I'm looking at this sample file in PHPWord which reads a Word document and outputs to another Word document:

include_once 'Sample_Header.php';

// Read contents
$name = basename(__FILE__, '.php');
$source = "resources/{$name}.doc";
echo date('H:i:s'), " Reading contents from `{$source}`", EOL;
$phpWord = \PhpOffice\PhpWord\IOFactory::load($source, 'MsDoc');

// (Re)write contents
$writers = array('Word2007' => 'docx', 'ODText' => 'odt', 'RTF' => 'rtf');
foreach ($writers as $writer => $extension) {
    echo date('H:i:s'), " Write to {$writer} format", EOL;
    $xmlWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord, $writer);
    $xmlWriter->save("{$name}.{$extension}");
    rename("{$name}.{$extension}", "results/{$name}.{$extension}");
}

include_once 'Sample_Footer.php';

However, I don't want to output another entire Word document, I just want to parse the contents to a string in PHP. How can this be modified to output the content to a string?


回答1:


You have to use the object you have received:

$phpWord = \PhpOffice\PhpWord\IOFactory::load($source, 'MsDoc');

It is a multidimensional object of arrays and objects, and you have to locate [elements] property, in which you have to locate [text] property. This [text] property contains the text extracted from your Word file.

Please bear in mind that by default these two properties are protected, so you will have to change their status in the PHPWord library files - for [elements] it is AbstractContainer.php, and for [text] it is Text.php. Once you have changed the status of these two properties to public, you can extract them from your $phpWord object.

I now can extract text from .doc files, but what I noticed is that PHPWord will just extract some 60% of text from any .doc file, sometimes just cutting the last word it extracted by half. So, if your file has 4,000 words, PHPWord gets only some 2,000 of them, somehow.

I am at a loss here, actually, as to why PHPWord does not want to get ALL the text. No notices, no exceptions, just an object without a good half of text from a .doc file.



来源:https://stackoverflow.com/questions/50629144/parse-a-word-document-with-phpword-to-a-string

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!