DomDocument and html entities

本秂侑毒 提交于 2019-12-30 05:13:08

问题


I'm trying to parse some HTML that includes some HTML entities, like ×

$str = '<a href="http://example.com/"> A &#215; B</a>';

$dom = new DomDocument;
$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = $link -> nodeValue;
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";    

but DomDocument substitutes the text for for A × B.

Is there some way to keep it from taking the & for an html entity and make it just leave it alone? I tried to set substituteEntities to false but it doesn't do anything


回答1:


From the docs:

The DOM extension uses UTF-8 encoding.
Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.

Assuming you're using latin-1 try:

<?php
header('Content-type:text/html;charset=iso-8859-1');


$str = utf8_encode('<a href="http://example.com/"> A &#215; B</a>');

$dom = new DOMDocument;


$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = utf8_decode($link -> nodeValue);
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";    ?>



回答2:


This is no direct answer to the question, but you may use UTF-8 instead, which allows you to save glyphs like ÷ or × directly. To use UTF-8 with PHP DOM on the other needs a little hack.

Also, if you are trying to display mathematical formulas (as A × B suggests) have a look at MathML.




回答3:


Are you sure the & is being substituted to &amp;? If that were the case, you'd see the exact entity, as text, not the garbled response you're getting.

My guess is that it is converted to the actual character, and you're viewing the page with a latin1 charset, which does not contain this character, hence the garbled response.

If I render your example, my output is:

fullname:  A × B 

href: http://example.com/

When viewing this in latin1/iso-8859-1, I see the output you're describing. But when I set the charset to UTF-8, the output is fine.




回答4:


I am facing the same problem, in fact, utf8_encode and deccode do the trick for some cases but not all of them, for example &#x03A3; can not be rendered using utf-8 decode function, the basic idea which we need is to keep html entities as they are in the string.



来源:https://stackoverflow.com/questions/7220737/domdocument-and-html-entities

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!