问题
I am using DOMDocument to extract some paragraphs.
Here is how my initial htm file that I am impotrting looks like:
<html>
<head>
<title>Toxins</title>
</head>
<body>
<p class=8reference><span>1.</span><span>Sivonen, K.; Jones, G. Cyanobacterial Toxins. In <i>Toxic Cyanobacteria in Water. A Guide to Their Public Health Consequences, Monitoring and Management</i>; Chorus, I., Bartram, J., Eds.; E. and F.N. Spon: London, UK, 1999; pp. 41–111.</span></p>
</body>
</html>
When I am doing:
$dom_input = new \DOMDocument("1.0","UTF-8");
$dom_input->encoding = "UTF-8";
$dom_input->formatOutput = true;
$dom_input->loadHTMLFile($manuscript->getUploadRootDir().$manuscript->getFileName());
$paragraphs = $dom_input->getElementsByTagName('p');
foreach ($paragraphs as $paragraph) {
if($paragraph->getAttribute('class') == "8reference") {
var_dump($paragraph->nodeValue);
}
}
The dash from "pp. 41–111" is converted to
pp. 41–111
Any idea why and how can I fix it in order to get utf8 unicode values?
Thank you in advance.
回答1:
It looks to me like the data is correct, you're just displaying it incorrectly.
Are you outputting in UTF-8?
The à + thing is a classic "showing UTF-8 encoded data as if it was other than UTF-8.
E.g. If you're outputting to a web browser, try setting the character set with a meta tag. E.g.
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
If you need to output in something other than UTF-8 you'll need to convert into the alternative character set first.
回答2:
When using PHP fputcsv()
to generate CSV file. Use this before inserting data to fputcsv()
$data = mb_convert_encoding($data, 'cp1252', 'utf-8');
fputcsv($file, $data);
This will surely stop conversion of dash to â€"
when generating CSV.
来源:https://stackoverflow.com/questions/19959794/php-domdocument-why-is-en-dash-converted-to-%c3%a2%e2%82%ac