How can I remove characters, like punctuation, commas, dashes etc from a string, in a multibyte safe manner?
I will be working with input from many different languag
There are the unicode character class thingys that you can use:
To match any non-letter symbols you can just use \PL+
, the negation of \p{L}
. To not remove spaces, use a charclass like [^\pL\s]+
. Or really just remove punctuation with \pP+
Well, and obviously don't forget the regex /u
modifier.