问题
I used http://translate.google.com/#en|hi|Bangalore to get the Hindi for Bangalore and बंगलौर.
But when I pasted it in vim there is a break before the last character र.
I am using preg_replace with the regex pattern /[^\p{L}\p{Nd}\p{Mn}_]/u for matching words. But this is treating the last character as a separate word.
This is my input string मैनेजमेंट, बंगलौर and I am expecting the output to be मैनेजमेंट बंगलौर after the preg_replace
$cleanedString = preg_replace('/[^\p{L}\p{Nd}\p{Mn}_]/u', ' ', $name);
But the output I am getting is मैनेजमेंट बंगल र . What am I doing wrong here? I guess the problem starts from how vim handled the text I pasted.
回答1:
Try this regex "/[^\p{L}\p{Nd}\p{Mn}\p{Mc}_]/u"
The O
symbol in लौ
takes extra horizontal space as opposed to the ae
in मै
. The unicode class \p{Mn}
matches only non-spacing marks. Use \p{Mc}
to match spacing-marks. You can use \p{M}
to match all combining-marks: "/[^\p{L}\p{Nd}\p{M}_]/u"
From regular-expressions.info/unicode
\p{M}
or\p{Mark}
: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn}
or\p{Non_Spacing_Mark}
: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).\p{Mc}
or\p{Spacing_Combining_Mark}
: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).\p{Me}
or\p{Enclosing_Mark}
: a character that encloses the character is is combined with (circle, square, keycap, etc.).
来源:https://stackoverflow.com/questions/3598212/php-vim-%e0%a4%ac%e0%a4%82%e0%a4%97%e0%a4%b2%e0%a5%8c%e0%a4%b0-bangalore-has-a-break-before-the-last-character-%e0%a4%b0