PHP DOMDocument / XPath: Get HTML-text and surrounded tags

依然范特西╮ 提交于 2020-01-04 07:47:18

问题


I am looking for this functionality:

Given is this html-Page:

<body>
 <h1>Hello,
  <b>world!</b>
 </h1>
</body>

I want to get an array that only contains the DISTINCT text elements (no duplicates) and an array of the tags that surround the text elements:

The result to the above "html" would be an array that looks like this:

array => 
 "Hello," surrounded by => "h1" and "body"
 "world!" surrounded by => "b", "h1" and "body"

I alreday do this:

$res=$xpath->query("//body//*/text()");

which gives me the distinct text-contents but that omits the html-tags.

When I just do this:

$res=$xpath->query("//body//*");

I get duplicate texts, one for each tag-constellation: e.g.: "world!" would show up 3 times, one time for "body", one time for "h1" and one time for "b" but I don't seem to be able to get the information which texts are acutally duplicates. Just checking for duplicate text is not sufficient, as duplicate texts are sometimes just substrings of former texts or a website could contain real duplicate text which would then be discarded which is wrong.

How could I solve this issue?

Thank you very much!!

Thomas


回答1:


You can iterate over the parentNodes of the DOMText nodes:

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$textNodes = array();
foreach($xpath->query('/html/body//text()') as $i => $textNode) {
    $textNodes[$i] = array(
        'text' => $textNode->nodeValue,
        'parents' => array()
    );
    for (
        $currentNode = $textNode->parentNode;
        $currentNode->parentNode;
        $currentNode = $currentNode->parentNode
    ) {
        $textNodes[$i]['parents'][] = $currentNode->nodeName;
    }
}
print_r($textNodes);

demo

Note that loadHTML will add implied elements, e.g. it will add html and head elements which you will have to take into account when using XPath. Also note that any whitespace used for formatting is considered a DOMText so you will likely get more elements than you expect. If you only want to query for non-empty DOMText nodes use

/html/body//text()[normalize-space(.) != ""]

demo




回答2:


In your sample code, $res=$xpath->query("//body//*/text()") is a DOMNodeList of DOMText nodes. For each DOMText, you can access the containing element via the parentNode property.



来源:https://stackoverflow.com/questions/7875106/php-domdocument-xpath-get-html-text-and-surrounded-tags

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!