Regex to remove non alphanumeric characters from UTF8 strings

后端 未结 4 444
傲寒
傲寒 2021-01-11 11:36

How can I remove characters, like punctuation, commas, dashes etc from a string, in a multibyte safe manner?

I will be working with input from many different languag

相关标签:
4条回答
  • 2021-01-11 11:58

    Maybe this will be usefull?

    $newstring = preg_replace('/[^0-9a-zA-Z\s]/', $oldstring);
    
    0 讨论(0)
  • 2021-01-11 12:18

    I used this:

    $clean = preg_replace( "/[^\p{L}|\p{N}]+/u", " ", $raw );
    $clean = preg_replace( "/[\p{Z}]{2,}/u", " ", $clean );
    
    0 讨论(0)
  • 2021-01-11 12:21

    There are the unicode character class thingys that you can use:

    • http://www.regular-expressions.info/unicode.html
    • http://php.net/manual/en/regexp.reference.unicode.php

    To match any non-letter symbols you can just use \PL+, the negation of \p{L}. To not remove spaces, use a charclass like [^\pL\s]+. Or really just remove punctuation with \pP+

    Well, and obviously don't forget the regex /u modifier.

    0 讨论(0)
  • 2021-01-11 12:21

    Similar post

    Remove non-utf8 characters from string

    I'm not sure if this covers all characters though.

    According to this post on th dreamincode forum

    http://www.dreamincode.net/forums/topic/78179-regular-expression-to-remove-non-ascii-characters/

    this should work

    /[^\x{21}-\x{7E}\s\t\n\r]/
    
    0 讨论(0)
提交回复
热议问题