I\'m looking for general a strategy/advice on how to handle invalid UTF-8 input from users.
Even though my webapp uses UTF-8, somehow some users enter invalid chara
Set UTF-8 as the character set for all headers output by your PHP code
In every PHP output header, specify UTF-8 as the encoding:
header('Content-Type: text/html; charset=utf-8');
Try doing what Rails does to force all browsers always to post UTF-8 data:
<form accept-charset="UTF-8" action="#{action}" method="post"><div
style="margin:0;padding:0;display:inline">
<input name="utf8" type="hidden" value="✓" />
</div>
<!-- form fields -->
</form>
See railssnowman.info or the initial patch for an explanation.
meta http-equiv
tag).accept-charset="UTF-8"
in the form.✓
which can only be from the Unicode charset (and, in this example, not the Korean charset).Receiving invalid characters from your web app might have to do with the character sets assumed for HTML forms. You can specify which character set to use for forms with the accept-charset attribute:
<form action="..." accept-charset="UTF-8">
You also might want to take a look at similar questions in StackOverflow for pointers on how to handle invalid characters, e.g. those in the column to the right, but I think that signaling an error to the user is better than trying to clean up those invalid characters which cause unexpected loss of significant data or unexpected change of your user's inputs.