Regex to strip out everything but words and numbers (and latin chars)

北城以北 提交于 2019-12-25 10:47:10

问题


Im trying to clean a post string used in an ajax request (sanitize before db query) to allow only alphanumeric characters, spaces (1 per word, not multiple), can contain "-", and latin characters like "ç" and "é" without success, can anyone help or point me on the right direction?

This is the regex I'm using so far:

$string = preg_replace('/^[a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû-]+$/', '', mb_strtolower(utf8_encode($_POST['q'])));

Thank you.


回答1:


$string = mb_strtolower(utf8_encode($_POST['q'])));
$string = preg_replace('/[^a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû-]+/g', '', $string);
$string = preg_replace('/ +/g', ' ', $string);

Why not just use mysql_real_escape_string?




回答2:


$regEx = '/^[^\w\p{L}-]+$/iu';

\w - matches alphanumerics

\p{L} - matches a single Unicode Code Point in the 'Letters' category (see the Unicode Categories section here).

- at the end of the character class matches a single hyphen.

^ in the character classes negates the character class, so that the regex will match the opposite of the character class (anything you do not specify).

+ outside of the character class says match 1 or more characters

^ and $ outside of the character class will cause the engine to only accept matches that start at the beginning of a line and goes until the end of the line.

After the pattern, the i modifier says ignore case and the u tells the pattern matching engine that we're going to be sending UTF8 data it's way, and g modifier originally present has been removed since it's not necessary in PHP (instead global matching is dependent on which matching function is called)




回答3:


$string = preg_replace('/[^a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû\-]/u', '', mb_strtolower(utf8_encode($_POST['q']), 'UTF-8'));
$string = preg_replace( '/ +/', ' ', $string );

should do the trick. Note that

  • the character class is negated by putting ^ inside the character class
  • you need the u flag when dealing with unicode strings either in the pattern or in the subject
  • it's better to specify the character set explicitly in mb_* functions because otherwise they will fall back on your system defaults, and that may not be UTF-8.
  • the hyphen character needed escaping (\- instead of - at the end of your character class)


来源:https://stackoverflow.com/questions/6982915/regex-to-strip-out-everything-but-words-and-numbers-and-latin-chars

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!