php + vim - बंगलौर (Bangalore) has a break before the last character र

◇◆丶佛笑我妖孽 提交于 2019-12-13 02:23:09

问题


I used http://translate.google.com/#en|hi|Bangalore to get the Hindi for Bangalore and बंगलौर.

But when I pasted it in vim there is a break before the last character र.
I am using preg_replace with the regex pattern /[^\p{L}\p{Nd}\p{Mn}_]/u for matching words. But this is treating the last character as a separate word.

This is my input string मैनेजमेंट, बंगलौर and I am expecting the output to be मैनेजमेंट बंगलौर after the preg_replace

$cleanedString = preg_replace('/[^\p{L}\p{Nd}\p{Mn}_]/u', ' ', $name);

But the output I am getting is मैनेजमेंट बंगल र . What am I doing wrong here? I guess the problem starts from how vim handled the text I pasted.


回答1:


Try this regex "/[^\p{L}\p{Nd}\p{Mn}\p{Mc}_]/u"

The O symbol in लौ takes extra horizontal space as opposed to the ae in मै. The unicode class \p{Mn} matches only non-spacing marks. Use \p{Mc} to match spacing-marks. You can use \p{M} to match all combining-marks: "/[^\p{L}\p{Nd}\p{M}_]/u"

From regular-expressions.info/unicode

\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

  • \p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
  • \p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
  • \p{Me} or \p{Enclosing_Mark}: a character that encloses the character is is combined with (circle, square, keycap, etc.).


来源:https://stackoverflow.com/questions/3598212/php-vim-%e0%a4%ac%e0%a4%82%e0%a4%97%e0%a4%b2%e0%a5%8c%e0%a4%b0-bangalore-has-a-break-before-the-last-character-%e0%a4%b0

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!