I have a form with a textarea. Users enter a block of text which is stored in a database.
Occasionally a user will paste text from Word containing smart quotes or em
This is an unfortunately all-too-common problem, not helped by PHP's very poor handling of character sets.
What we do is force the text through iconv
// Convert input data to UTF8, ignore any odd (MS Word..) chars
// that don't translate
$input = iconv("ISO-8859-1","UTF-8//IGNORE",$input);
The //IGNORE
flag means that anything that can't be translated will be thrown away.
If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded.
This sounds like a Unicode issue. Joel Spolsky has a good jumping off point on the topic: http://www.joelonsoftware.com/articles/Unicode.html
the problem is on the mysql charset, I fixed my issues with this line of code.
mysql_set_charset('utf8',$link);
This may not be the best solution, but I'd try testing to find out what PHP sees. Let's say it sees "–" (there are a few other possibilities, like simple "“" or maybe "“"). Then do a str_replace to get rid of all of those and replace them with normal quotes, before stuffing the answer in a database.
The better solution would probably involve making the end-to-end data passing all UTF-8, as people are trying to help with in other answers.
We would often use standard string replace functions for that. Even though the nature of ASCII/Unicode in that context is pretty murky, it works. Just make sure your php file is saved in the right encoding format, etc.
You have to be sure your database connection is configured to accept and provide UTF-8 from and to the client (otherwise it will convert to the "default", which is usually latin1).
In practice this means running a query SET NAMES 'utf8';
http://www.phpwact.org/php/i18n/utf-8/mysql
Also, smart quotes are part of the windows-1252 character set, not iso-8859-1 (latin-1). Not very relevant to your problem, but just FYI. The euro symbol is in there as well.