I have a form with a textarea. Users enter a block of text which is stored in a database.
Occasionally a user will paste text from Word containing smart quotes or em
In my experience, it's easier to just accept the smart quotes and make sure you're using the same encoding everywhere. To start, add this to your form tag: accept-charset="utf-8"
Actually the problem is not happening in PHP but it is happening in JavaScript, it is due to copy/paste from Word, so you need to solve your problem in JavaScript before you pass your text to PHP, Please see this answer https://stackoverflow.com/a/6219023/1857295.
If you were looking to escape these characters for the web while preserving their appearance, so your strings will appear like this: “It’s nice!” rather than "It's boring"...
You can do this by using your own custom htmlEncode function in place of PHP's htmlentities():
$trans_tbl = false;
function htmlEncode($text) {
global $trans_tbl;
// create translation table once
if(!$trans_tbl) {
// start with the default set of conversions and add more.
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
$trans_tbl[chr(130)] = '‚'; // Single Low-9 Quotation Mark
$trans_tbl[chr(131)] = 'ƒ'; // Latin Small Letter F With Hook
$trans_tbl[chr(132)] = '„'; // Double Low-9 Quotation Mark
$trans_tbl[chr(133)] = '…'; // Horizontal Ellipsis
$trans_tbl[chr(134)] = '†'; // Dagger
$trans_tbl[chr(135)] = '‡'; // Double Dagger
$trans_tbl[chr(136)] = 'ˆ'; // Modifier Letter Circumflex Accent
$trans_tbl[chr(137)] = '‰'; // Per Mille Sign
$trans_tbl[chr(138)] = 'Š'; // Latin Capital Letter S With Caron
$trans_tbl[chr(139)] = '‹'; // Single Left-Pointing Angle Quotation Mark
$trans_tbl[chr(140)] = 'Œ'; // Latin Capital Ligature OE
// smart single/ double quotes (from MS)
$trans_tbl[chr(145)] = '‘';
$trans_tbl[chr(146)] = '’';
$trans_tbl[chr(147)] = '“';
$trans_tbl[chr(148)] = '”';
$trans_tbl[chr(149)] = '•'; // Bullet
$trans_tbl[chr(150)] = '–'; // En Dash
$trans_tbl[chr(151)] = '—'; // Em Dash
$trans_tbl[chr(152)] = '˜'; // Small Tilde
$trans_tbl[chr(153)] = '™'; // Trade Mark Sign
$trans_tbl[chr(154)] = 'š'; // Latin Small Letter S With Caron
$trans_tbl[chr(155)] = '›'; // Single Right-Pointing Angle Quotation Mark
$trans_tbl[chr(156)] = 'œ'; // Latin Small Ligature OE
$trans_tbl[chr(159)] = 'Ÿ'; // Latin Capital Letter Y With Diaeresis
ksort($trans_tbl);
}
// escape HTML
return strtr($text, $trans_tbl);
}
You could try mb_ convert_encoding from ISO-8859-1 to UTF-8.
$str = mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');
This assumes you want UTF-8, and convert can find reasonable replacements... if not, mb_str_replace or preg_replace them yourself.
The mysql database is using UTF-8 encoding. Likewise, the html pages that display the content are using UTF-8.
The content of the HTML can be in UTF-8, yes, but are you explicitly setting the content type (encoding) of your HTML pages (generated via PHP?) to UTF-8 as well? Try returning a Content-Type
header of "text/html;charset=utf-8"
or add <meta>
tags to your HTMLs:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
That way, the content type of the data submitted to PHP will also be the same.
I had a similar issue and adding the <meta>
tag worked for me.
It sounds like the real problem is that your database is not using the same character encoding as your page (which should probably be UTF-8). In that case, if any user submits a non-ASCII character you'll probably see weird characters in the database. Finding and fixing just a few of them (curly quotes and em dashes) isn't going to solve the real problem.
Here is some info on migrating your database to another character encoding, at least for a MySQL database.