character-properties

Trim unicode whitespace in PHP 5.2

末鹿安然 提交于 2019-11-27 07:50:54
How can I trim a string(6) " page" , where the first whitespace is a 0xc2a0 non-breaking space? I've tried trim() and preg_match('/^\s*(.*)\s*$/u', $key, $m); . Another question: How can I reliably copy these characters? They seem to be converted to "normal" spaces, which makes it hard to debug. preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u','',$str); Anti Veeranna PCRE unicode properties properties can be used to achieve this Here is the code that I played with and seems to do what you want: <?php function unicode_trim ($str) { return preg_replace('/^[\pZ\pC]+([\PZ\PC]*)[\pZ\pC]+$/u', '$1', $str); }

Regex for names with special characters (Unicode)

青春壹個敷衍的年華 提交于 2019-11-27 04:07:33
Okay, I have read about regex all day now, and still don't understand it properly. What i'm trying to do is validate a name, but the functions i can find for this on the internet only use [a-zA-Z] , leaving characters out that i need to accept to. I basically need a regex that checks that the name is at least two words, and that it does not contain numbers or special characters like !"#¤%&/()=... , however the words can contain characters like æ, é, Â and so on... An example of an accepted name would be: "John Elkjærd" or "André Svenson" An non-accepted name would be: " Hans ", "H 4 nn 3

Matching a Unicode “name” with a JavaScript Regular Expression

折月煮酒 提交于 2019-11-27 02:59:01
问题 In JavaScript we can match individual Unicode codepoints or codepoint ranges by using the Unicode escape sequences, e.g.: "A".match(/\u0041/) // => ["A"] "B".match(/[\u0041-\u007A]/) // => ["B"] But how could we create a regular expression to match a proper name which must include any Unicode "letter" using a JavaScript regular expression? Is there a range of letters? A special regex sequence or character class in JavaScript? Say my website must validate names that could be in latin based

Does \w match all alphanumeric characters defined in the Unicode standard?

眉间皱痕 提交于 2019-11-27 02:47:46
问题 Does Perl's \w match all alphanumeric characters defined in the Unicode standard? For example, will \w match all (say) Chinese and Russian alphanumeric characters? I wrote a simple test script (see below) which suggests that \w does indeed match "as expected" for the non-ASCII alphanumeric characters I tested. But the testing is obviously far from exhaustive. #!/usr/bin/perl use utf8; binmode(STDOUT, ':utf8'); my @ok; $ok[0] = "abcdefghijklmnopqrstuvwxyz"; $ok[1] = "éèëáàåäöčśžłíżńęøáýąóæšćôı

Matching (e.g.) a Unicode letter with Java regexps

时光毁灭记忆、已成空白 提交于 2019-11-27 02:41:27
问题 There are many questions and answers here on StackOverflow that assume a "letter" can be matched in a regexp by [a-zA-Z] . However with Unicode there are many more characters that most people would regard as a letter (all the Greek letters, Cyrllic .. and many more. Unicode defines many blocks each of which may have "letters". The Java definition defines Posix classes for things like alpha characters, but that is specified to only work with US-ASCII. The predefined character classes define

Regex - Unicode Properties Reference and Examples

ぐ巨炮叔叔 提交于 2019-11-27 02:01:39
问题 I feel lost with the Regex Unicode Properties presented by RegexBuddy, I cannot distinguish between any of the Number properties and the Math symbol property only seems to match + but not - , * , / , ^ for instance. Is there any documentation / reference with examples on regular expressions Unicode properties? 回答1: A list of Unicode properties can be found in http://www.unicode.org/Public/UNIDATA/PropList.txt. The properties for each character can be found in http://www.unicode.org/Public

matching unicode characters in python regular expressions

最后都变了- 提交于 2019-11-27 01:25:28
I have read thru the other questions at Stackoverflow, but still no closer. Sorry, if this is allready answered, but I didn`t get anything proposed there to work. >>> import re >>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/xmas/xmas1.jpg') >>> print m.groupdict() {'tag': 'xmas', 'filename': 'xmas1.jpg'} All is well, then I try something with Norwegian characters in it ( or something more unicode-like ): >>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg') >>> print m.groupdict() Traceback (most recent

Match any unicode letter?

自闭症网瘾萝莉.ら 提交于 2019-11-26 22:54:39
In .net you can use \p{L} to match any letter, how can I do the same in Python? Namely, I want to match any uppercase, lowercase, and accented letters. Python's re module doesn't support Unicode properties yet. But you can compile your regex using the re.UNICODE flag, and then the character class shorthand \w will match Unicode letters, too. Since \w will also match digits, you need to then subtract those from your character class, along with the underscore: [^\W\d_] will match any Unicode letter. >>> import re >>> r = re.compile(r'[^\W\d_]', re.U) >>> r.match('x') <_sre.SRE_Match object at

How to determine if a character is a Chinese character

耗尽温柔 提交于 2019-11-26 18:14:16
问题 How to determine if a character is a Chinese character using ruby? 回答1: An interesting article on encodings in Ruby: http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18 (it's part of a series - check the table of contents at the start of the article also) I haven't used chinese characters before but this seems to be the list supported by unicode: http://en.wikipedia.org/wiki/List_of_CJK_Unified_Ideographs . Also take note that it's a unified system including Japanese and

Regex and unicode

自闭症网瘾萝莉.ら 提交于 2019-11-26 16:27:16
问题 I have a script that parses the filenames of TV episodes (show.name.s01e02.avi for example), grabs the episode name (from the www.thetvdb.com API) and automatically renames them into something nicer (Show Name - [01x02].avi) The script works fine, that is until you try and use it on files that have Unicode show-names (something I never really thought about, since all the files I have are English, so mostly pretty-much all fall within [a-zA-Z0-9'\-] ) How can I allow the regular expressions to