问题
I have a form with a textarea. Users enter a block of text which is stored in a database.
Occasionally a user will paste text from Word containing smart quotes or emdashes. Those characters appear in the database as: –, ’, “ ,â€
What function should I call on the input string to convert smart quotes to regular quotes and emdashes to regular dashes?
I am working in PHP.
Update: Thanks for all of the great responses so far. The page on Joel's site about encodings is very informative: http://www.joelonsoftware.com/articles/Unicode.html
Some notes on my environment:
The MySQL database is using UTF-8 encoding. Likewise, the HTML pages that display the content are using UTF-8 (Update:) by explicitly setting the meta content-type.
On those pages the smart quotes and emdashes appear as a diamond with question mark.
Solution:
Thanks again for the responses. The solution was twofold:
- Make sure the database and HTML files were explicitly set to use UTF-8 encoding.
- Use
htmlspecialchars()
instead ofhtmlentities()
.
回答1:
This sounds like a Unicode issue. Joel Spolsky has a good jumping off point on the topic: http://www.joelonsoftware.com/articles/Unicode.html
回答2:
The mysql database is using UTF-8 encoding. Likewise, the html pages that display the content are using UTF-8.
The content of the HTML can be in UTF-8, yes, but are you explicitly setting the content type (encoding) of your HTML pages (generated via PHP?) to UTF-8 as well? Try returning a Content-Type
header of "text/html;charset=utf-8"
or add <meta>
tags to your HTMLs:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
That way, the content type of the data submitted to PHP will also be the same.
I had a similar issue and adding the <meta>
tag worked for me.
回答3:
It sounds like the real problem is that your database is not using the same character encoding as your page (which should probably be UTF-8). In that case, if any user submits a non-ASCII character you'll probably see weird characters in the database. Finding and fixing just a few of them (curly quotes and em dashes) isn't going to solve the real problem.
Here is some info on migrating your database to another character encoding, at least for a MySQL database.
回答4:
This is an unfortunately all-too-common problem, not helped by PHP's very poor handling of character sets.
What we do is force the text through iconv
// Convert input data to UTF8, ignore any odd (MS Word..) chars
// that don't translate
$input = iconv("ISO-8859-1","UTF-8//IGNORE",$input);
The //IGNORE
flag means that anything that can't be translated will be thrown away.
If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded.
回答5:
We would often use standard string replace functions for that. Even though the nature of ASCII/Unicode in that context is pretty murky, it works. Just make sure your php file is saved in the right encoding format, etc.
回答6:
In my experience, it's easier to just accept the smart quotes and make sure you're using the same encoding everywhere. To start, add this to your form tag: accept-charset="utf-8"
回答7:
You could try mb_ convert_encoding from ISO-8859-1 to UTF-8.
$str = mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');
This assumes you want UTF-8, and convert can find reasonable replacements... if not, mb_str_replace or preg_replace them yourself.
回答8:
You have to be sure your database connection is configured to accept and provide UTF-8 from and to the client (otherwise it will convert to the "default", which is usually latin1).
In practice this means running a query SET NAMES 'utf8';
http://www.phpwact.org/php/i18n/utf-8/mysql
Also, smart quotes are part of the windows-1252 character set, not iso-8859-1 (latin-1). Not very relevant to your problem, but just FYI. The euro symbol is in there as well.
回答9:
the problem is on the mysql charset, I fixed my issues with this line of code.
mysql_set_charset('utf8',$link);
回答10:
You have to manually change the collation of individual columns to UTF8; changing the database overall won't alter these.
回答11:
If you were looking to escape these characters for the web while preserving their appearance, so your strings will appear like this: “It’s nice!” rather than "It's boring"...
You can do this by using your own custom htmlEncode function in place of PHP's htmlentities():
$trans_tbl = false;
function htmlEncode($text) {
global $trans_tbl;
// create translation table once
if(!$trans_tbl) {
// start with the default set of conversions and add more.
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
$trans_tbl[chr(130)] = '‚'; // Single Low-9 Quotation Mark
$trans_tbl[chr(131)] = 'ƒ'; // Latin Small Letter F With Hook
$trans_tbl[chr(132)] = '„'; // Double Low-9 Quotation Mark
$trans_tbl[chr(133)] = '…'; // Horizontal Ellipsis
$trans_tbl[chr(134)] = '†'; // Dagger
$trans_tbl[chr(135)] = '‡'; // Double Dagger
$trans_tbl[chr(136)] = 'ˆ'; // Modifier Letter Circumflex Accent
$trans_tbl[chr(137)] = '‰'; // Per Mille Sign
$trans_tbl[chr(138)] = 'Š'; // Latin Capital Letter S With Caron
$trans_tbl[chr(139)] = '‹'; // Single Left-Pointing Angle Quotation Mark
$trans_tbl[chr(140)] = 'Œ'; // Latin Capital Ligature OE
// smart single/ double quotes (from MS)
$trans_tbl[chr(145)] = '‘';
$trans_tbl[chr(146)] = '’';
$trans_tbl[chr(147)] = '“';
$trans_tbl[chr(148)] = '”';
$trans_tbl[chr(149)] = '•'; // Bullet
$trans_tbl[chr(150)] = '–'; // En Dash
$trans_tbl[chr(151)] = '—'; // Em Dash
$trans_tbl[chr(152)] = '˜'; // Small Tilde
$trans_tbl[chr(153)] = '™'; // Trade Mark Sign
$trans_tbl[chr(154)] = 'š'; // Latin Small Letter S With Caron
$trans_tbl[chr(155)] = '›'; // Single Right-Pointing Angle Quotation Mark
$trans_tbl[chr(156)] = 'œ'; // Latin Small Ligature OE
$trans_tbl[chr(159)] = 'Ÿ'; // Latin Capital Letter Y With Diaeresis
ksort($trans_tbl);
}
// escape HTML
return strtr($text, $trans_tbl);
}
回答12:
This may not be the best solution, but I'd try testing to find out what PHP sees. Let's say it sees "–" (there are a few other possibilities, like simple "“" or maybe "“"). Then do a str_replace to get rid of all of those and replace them with normal quotes, before stuffing the answer in a database.
The better solution would probably involve making the end-to-end data passing all UTF-8, as people are trying to help with in other answers.
回答13:
Actually the problem is not happening in PHP but it is happening in JavaScript, it is due to copy/paste from Word, so you need to solve your problem in JavaScript before you pass your text to PHP, Please see this answer https://stackoverflow.com/a/6219023/1857295.
来源:https://stackoverflow.com/questions/175785/how-do-i-convert-word-smart-quotes-and-em-dashes-in-a-string