php + vim - बंगलौर (Bangalore) has a break before the last character र

问题

I used http://translate.google.com/#en|hi|Bangalore to get the Hindi for Bangalore and बंगलौर.

But when I pasted it in vim there is a break before the last character र.
I am using preg_replace with the regex pattern /[^\p{L}\p{Nd}\p{Mn}_]/u for matching words. But this is treating the last character as a separate word.

This is my input string मैनेजमेंट, बंगलौर and I am expecting the output to be मैनेजमेंट बंगलौर after the preg_replace

$cleanedString = preg_replace('/[^\p{L}\p{Nd}\p{Mn}_]/u', ' ', $name);

But the output I am getting is मैनेजमेंट बंगल र . What am I doing wrong here? I guess the problem starts from how vim handled the text I pasted.

回答1:

Try this regex "/[^\p{L}\p{Nd}\p{Mn}\p{Mc}_]/u"

The O symbol in लौ takes extra horizontal space as opposed to the ae in मै. The unicode class \p{Mn} matches only non-spacing marks. Use \p{Mc} to match spacing-marks. You can use \p{M} to match all combining-marks: "/[^\p{L}\p{Nd}\p{M}_]/u"

From regular-expressions.info/unicode

\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).

\p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).

\p{Me} or \p{Enclosing_Mark}: a character that encloses the character is is combined with (circle, square, keycap, etc.).

来源：https://stackoverflow.com/questions/3598212/php-vim-%e0%a4%ac%e0%a4%82%e0%a4%97%e0%a4%b2%e0%a5%8c%e0%a4%b0-bangalore-has-a-break-before-the-last-character-%e0%a4%b0

标签

php

regex

vim

unicode

hindi