问题
So I'm trying to parse HTML pages and looking for paragraphs (<p>
) using get_elements_by_tag_name('p');
The problem is that when I use $element->nodeValue
, it's returning weird characters. The document is loaded first into $html using curl then loading it into a DomDocument.
I'm sure it has to do with charsets.
Here's an example of a response: "aujourd’hui".
Thanks in advance.
回答1:
I had the same issues and now noticed that loadHTML() no longer takes 2 parameters, so I had to find a different solution. Using the following function in my DOM library, I was able to remove the funky characters from my HTML content.
private static function load_html($html)
{
$doc = new DOMDocument;
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
foreach ($doc->childNodes as $node)
if ($node->nodeType == XML_PI_NODE)
$doc->removeChild($node);
$doc->encoding = 'UTF-8';
return $doc;
}
回答2:
I fixed this by forcing conversion to UTF-8 even though the original text was UTF-8:
$text = iconv("UTF-8", "UTF-8", $text);
$dom = new SmartDOMDocument();
$dom->loadHTML($webpage, 'UTF-8');
.
.
echo $node->nodeValue;
PHP is wierd :)
回答3:
This is an encoding issue. try explicitly setting the encoding to UTF-8.
this should help: http://devzone.zend.com/article/8855
回答4:
Apparently for me none of the above worked, finally I've found the following:
// Create a DOMDocument instance
$doc = new DOMDocument();
// The fix: mb_convert_encoding conversion
$doc->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));
Source and more info
来源:https://stackoverflow.com/questions/2024993/nodevalue-from-domdocument-returning-weird-characters-in-php