I have a form with a textarea. Users enter a block of text which is stored in a database.
Occasionally a user will paste text from Word containing smart quotes or emdashes. Those characters appear in the database as: –, ’, “ ,â€
What function should I call on the input string to convert smart quotes to regular quotes and emdashes to regular dashes?
I am working in PHP.
Update: Thanks for all of the great responses so far. The page on Joel's site about encodings is very informative: http://www.joelonsoftware.com/articles/Unicode.html
Some notes on my environment:
The MySQL database is using UTF-8 encoding. Likewise, the HTML pages that display the content are using UTF-8 (Update:) by explicitly setting the meta content-type.
On those pages the smart quotes and emdashes appear as a diamond with question mark.
Solution:
Thanks again for the responses. The solution was twofold:
- Make sure the database and HTML files were explicitly set to use UTF-8 encoding.
- Use
htmlspecialchars()
instead ofhtmlentities()
.
This sounds like a Unicode issue. Joel Spolsky has a good jumping off point on the topic: http://www.joelonsoftware.com/articles/Unicode.html
The mysql database is using UTF-8 encoding. Likewise, the html pages that display the content are using UTF-8.
The content of the HTML can be in UTF-8, yes, but are you explicitly setting the content type (encoding) of your HTML pages (generated via PHP?) to UTF-8 as well? Try returning a Content-Type
header of "text/html;charset=utf-8"
or add <meta>
tags to your HTMLs:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
That way, the content type of the data submitted to PHP will also be the same.
I had a similar issue and adding the <meta>
tag worked for me.
It sounds like the real problem is that your database is not using the same character encoding as your page (which should probably be UTF-8). In that case, if any user submits a non-ASCII character you'll probably see weird characters in the database. Finding and fixing just a few of them (curly quotes and em dashes) isn't going to solve the real problem.
Here is some info on migrating your database to another character encoding, at least for a MySQL database.
This is an unfortunately all-too-common problem, not helped by PHP's very poor handling of character sets.
What we do is force the text through iconv
// Convert input data to UTF8, ignore any odd (MS Word..) chars
// that don't translate
$input = iconv("ISO-8859-1","UTF-8//IGNORE",$input);
The //IGNORE
flag means that anything that can't be translated will be thrown away.
If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded.
We would often use standard string replace functions for that. Even though the nature of ASCII/Unicode in that context is pretty murky, it works. Just make sure your php file is saved in the right encoding format, etc.
In my experience, it's easier to just accept the smart quotes and make sure you're using the same encoding everywhere. To start, add this to your form tag: accept-charset="utf-8"
You could try mb_ convert_encoding from ISO-8859-1 to UTF-8.
$str = mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');
This assumes you want UTF-8, and convert can find reasonable replacements... if not, mb_str_replace or preg_replace them yourself.
You have to be sure your database connection is configured to accept and provide UTF-8 from and to the client (otherwise it will convert to the "default", which is usually latin1).
In practice this means running a query SET NAMES 'utf8';
http://www.phpwact.org/php/i18n/utf-8/mysql
Also, smart quotes are part of the windows-1252 character set, not iso-8859-1 (latin-1). Not very relevant to your problem, but just FYI. The euro symbol is in there as well.
the problem is on the mysql charset, I fixed my issues with this line of code.
mysql_set_charset('utf8',$link);
You have to manually change the collation of individual columns to UTF8; changing the database overall won't alter these.
If you were looking to escape these characters for the web while preserving their appearance, so your strings will appear like this: “It’s nice!” rather than "It's boring"...
You can do this by using your own custom htmlEncode function in place of PHP's htmlentities():
$trans_tbl = false;
function htmlEncode($text) {
global $trans_tbl;
// create translation table once
if(!$trans_tbl) {
// start with the default set of conversions and add more.
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
$trans_tbl[chr(130)] = '‚'; // Single Low-9 Quotation Mark
$trans_tbl[chr(131)] = 'ƒ'; // Latin Small Letter F With Hook
$trans_tbl[chr(132)] = '„'; // Double Low-9 Quotation Mark
$trans_tbl[chr(133)] = '…'; // Horizontal Ellipsis
$trans_tbl[chr(134)] = '†'; // Dagger
$trans_tbl[chr(135)] = '‡'; // Double Dagger
$trans_tbl[chr(136)] = 'ˆ'; // Modifier Letter Circumflex Accent
$trans_tbl[chr(137)] = '‰'; // Per Mille Sign
$trans_tbl[chr(138)] = 'Š'; // Latin Capital Letter S With Caron
$trans_tbl[chr(139)] = '‹'; // Single Left-Pointing Angle Quotation Mark
$trans_tbl[chr(140)] = 'Œ'; // Latin Capital Ligature OE
// smart single/ double quotes (from MS)
$trans_tbl[chr(145)] = '‘';
$trans_tbl[chr(146)] = '’';
$trans_tbl[chr(147)] = '“';
$trans_tbl[chr(148)] = '”';
$trans_tbl[chr(149)] = '•'; // Bullet
$trans_tbl[chr(150)] = '–'; // En Dash
$trans_tbl[chr(151)] = '—'; // Em Dash
$trans_tbl[chr(152)] = '˜'; // Small Tilde
$trans_tbl[chr(153)] = '™'; // Trade Mark Sign
$trans_tbl[chr(154)] = 'š'; // Latin Small Letter S With Caron
$trans_tbl[chr(155)] = '›'; // Single Right-Pointing Angle Quotation Mark
$trans_tbl[chr(156)] = 'œ'; // Latin Small Ligature OE
$trans_tbl[chr(159)] = 'Ÿ'; // Latin Capital Letter Y With Diaeresis
ksort($trans_tbl);
}
// escape HTML
return strtr($text, $trans_tbl);
}
This may not be the best solution, but I'd try testing to find out what PHP sees. Let's say it sees "–" (there are a few other possibilities, like simple "“" or maybe "“"). Then do a str_replace to get rid of all of those and replace them with normal quotes, before stuffing the answer in a database.
The better solution would probably involve making the end-to-end data passing all UTF-8, as people are trying to help with in other answers.
Actually the problem is not happening in PHP but it is happening in JavaScript, it is due to copy/paste from Word, so you need to solve your problem in JavaScript before you pass your text to PHP, Please see this answer https://stackoverflow.com/a/6219023/1857295.
来源:https://stackoverflow.com/questions/175785/how-do-i-convert-word-smart-quotes-and-em-dashes-in-a-string