PHP DOMDocument / XPath: Get HTML-text and surrounded tags

问题

I am looking for this functionality:

Given is this html-Page:

<body>
 <h1>Hello,
  <b>world!</b>
 </h1>
</body>

I want to get an array that only contains the DISTINCT text elements (no duplicates) and an array of the tags that surround the text elements:

The result to the above "html" would be an array that looks like this:

array => 
 "Hello," surrounded by => "h1" and "body"
 "world!" surrounded by => "b", "h1" and "body"

I alreday do this:

$res=$xpath->query("//body//*/text()");

which gives me the distinct text-contents but that omits the html-tags.

When I just do this:

$res=$xpath->query("//body//*");

I get duplicate texts, one for each tag-constellation: e.g.: "world!" would show up 3 times, one time for "body", one time for "h1" and one time for "b" but I don't seem to be able to get the information which texts are acutally duplicates. Just checking for duplicate text is not sufficient, as duplicate texts are sometimes just substrings of former texts or a website could contain real duplicate text which would then be discarded which is wrong.

How could I solve this issue?

Thank you very much!!

Thomas

回答1:

You can iterate over the parentNodes of the DOMText nodes:

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$textNodes = array();
foreach($xpath->query('/html/body//text()') as $i => $textNode) {
    $textNodes[$i] = array(
        'text' => $textNode->nodeValue,
        'parents' => array()
    );
    for (
        $currentNode = $textNode->parentNode;
        $currentNode->parentNode;
        $currentNode = $currentNode->parentNode
    ) {
        $textNodes[$i]['parents'][] = $currentNode->nodeName;
    }
}
print_r($textNodes);

demo

Note that loadHTML will add implied elements, e.g. it will add html and head elements which you will have to take into account when using XPath. Also note that any whitespace used for formatting is considered a DOMText so you will likely get more elements than you expect. If you only want to query for non-empty DOMText nodes use

/html/body//text()[normalize-space(.) != ""]

demo

回答2:

In your sample code, $res=$xpath->query("//body//*/text()") is a DOMNodeList of DOMText nodes. For each DOMText, you can access the containing element via the parentNode property.

来源：https://stackoverflow.com/questions/7875106/php-domdocument-xpath-get-html-text-and-surrounded-tags

标签

php

html

parsing