regex in Vietnamese characters

淺唱寂寞╮ 提交于 2019-12-23 09:30:08

问题


I have one string and want remove any character not in any case below:

  • not in this list : ÀÁÂÃÈÉÊÌÍÒÓÔÕÙÚĂĐĨŨƠàáâãèéêìíòóôõùúăđĩũơƯĂẠẢẤẦẨẪẬẮẰẲẴẶẸẺẼỀỀỂ ưăạảấầẩẫậắằẳẵặẹẻẽềềểỄỆỈỊỌỎỐỒỔỖỘỚỜỞỠỢỤỦỨỪễệỉịọỏốồổỗộớờởỡợụủứừỬỮỰỲỴÝỶỸửữựỳỵỷỹ

  • not in [a-z 0-9 A-Z]

  • not is : _ and white space.

can anyone help me with this regex in php?


回答1:


Try this regular expression:

/[^a-z0-9A-Z_ÀÁÂÃÈÉÊÌÍÒÓÔÕÙÚĂĐĨŨƠàáâãèéêìíòóôõùúăđĩũơƯĂẠẢẤẦẨẪẬẮẰẲẴẶẸẺẼỀỀỂưăạảấầẩẫậắằẳẵặẹẻẽềềểỄỆỈỊỌỎỐỒỔỖỘỚỜỞỠỢỤỦỨỪễệỉịọỏốồổỗộớờởỡợụủứừỬỮỰỲỴÝỶỸửữựỳỵỷỹ]/u

The u modifier makes PHP to interpret the pattern string as UTF-8.

If that doesn’t work, try using Unicode character properties like \p{L} for letters or the escape sequence \x{1234} for describing single Unicode characters or custom character ranges:

/[^a-z0-9A-Z_\x{00C0}-\x{00FF}\x{1EA0}-\x{1EFF}]/u



回答2:


Be careful. Vietnamese Unicode characters may be "decomposed" into "combining characters" with one codepoint for the base character and one or more codepoints for addittional diacritics, or they may be "precomposed" into single Unicode codepoints. Combining diacritics won't work as expected with a regular expression range [] since you will match them no matter what base character they combine with.

Older versions of Unicode did not contain the full set of Vietnamese precomposed characters so expect to find Vietnamese with combining characters in the wild. You can convert combining characters into precomposed characters using Unicode normalization form C, NFC.




回答3:


The above regexes lacks of ế, also ă and are duplicated.
List of correct Vietnamese characters: àáãạảăắằẳẵặâấầẩẫậèéẹẻẽêềếểễệđìíĩỉịòóõọỏôốồổỗộơớờởỡợùúũụủưứừửữựỳỵỷỹýÀÁÃẠẢĂẮẰẲẴẶÂẤẦẨẪẬÈÉẸẺẼÊỀẾỂỄỆĐÌÍĨỈỊÒÓÕỌỎÔỐỒỔỖỘƠỚỜỞỠỢÙÚŨỤỦƯỨỪỬỮỰỲỴỶỸÝ
Also, remember to normalize the string in NFC form (string.normalize('NFC')) before testing it with the regex. Read more here.




回答4:


$newtext = preg_replace('/[^a-z0-9A-Z_[:space:]ÀÁÂÃÈÉÊÌÍÒÓÔÕÙÚĂĐĨŨƠàáâãèéêìíòóôõùúăđĩũơƯĂẠẢẤẦẨẪẬẮẰẲẴẶẸẺẼỀỀỂ ưăạảấầẩẫậắằẳẵặẹẻẽềềểỄỆỈỊỌỎỐỒỔỖỘỚỜỞỠỢỤỦỨỪễệỉịọỏốồổỗộớờởỡợụủứừỬỮỰỲỴÝỶỸửữựỳỵỷỹ]/u','',$text);


来源:https://stackoverflow.com/questions/3819791/regex-in-vietnamese-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!