How can I remove characters, like punctuation, commas, dashes etc from a string, in a multibyte safe manner?
I will be working with input from many different languag
Maybe this will be usefull?
$newstring = preg_replace('/[^0-9a-zA-Z\s]/', $oldstring);
I used this:
$clean = preg_replace( "/[^\p{L}|\p{N}]+/u", " ", $raw );
$clean = preg_replace( "/[\p{Z}]{2,}/u", " ", $clean );
There are the unicode character class thingys that you can use:
To match any non-letter symbols you can just use \PL+
, the negation of \p{L}
. To not remove spaces, use a charclass like [^\pL\s]+
. Or really just remove punctuation with \pP+
Well, and obviously don't forget the regex /u
modifier.
Similar post
Remove non-utf8 characters from string
I'm not sure if this covers all characters though.
According to this post on th dreamincode forum
http://www.dreamincode.net/forums/topic/78179-regular-expression-to-remove-non-ascii-characters/
this should work
/[^\x{21}-\x{7E}\s\t\n\r]/