character-properties

Trim unicode whitespace in PHP 5.2

一世执手 提交于 2019-12-17 09:33:21
问题 How can I trim a string(6) " page" , where the first whitespace is a 0xc2a0 non-breaking space? I've tried trim() and preg_match('/^\s*(.*)\s*$/u', $key, $m); . Another question: How can I reliably copy these characters? They seem to be converted to "normal" spaces, which makes it hard to debug. 回答1: preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u','',$str); 回答2: PCRE unicode properties properties can be used to achieve this Here is the code that I played with and seems to do what you want: <?php

Python and regular expression with Unicode

◇◆丶佛笑我妖孽 提交于 2019-12-16 22:09:34
问题 I need to delete some Unicode symbols from the string 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ' I know they exist here for sure. I tried: re.sub('([\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+)', '', 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ') but it doesn't work. String stays the same. What am I doing wrong? 回答1: Are you using python 2.x or 3.0? If you're using 2.x, try making the regex string a unicode-escape string, with 'u'. Since it's regex it's good practice to make your regex string a

Regular expression in Java that takes as input alphanumeric followed by forward slash and then again alphanumeric

痞子三分冷 提交于 2019-12-13 02:06:10
问题 I need a regular expression that takes as input alphanumeric followed by forward slash and then again alphanumeric. How do I write regular expression in Java for this? Example for this is as follows: adc9/fer4 I tried by using regular expression as follows: String s = abc9/ferg5; String pattern="^[a-zA-Z0-9_]+/[a-zA-z0-9_]*$"; if(s.matches(pattern)) { return true; } But the problem it is accepting all the strings of form abc9/ without checking after forward slash. 回答1: Reference: http:/

How do I get a list of all Unicode characters that have a given property?

送分小仙女□ 提交于 2019-12-07 02:57:39
问题 Without looping over the entire range of Unicode characters, how can I get a list of characters that have a given property? In particular I want a list of all characters that are digits (i.e. those that match /\d/ ). I have looked at Unicode::UCD, and it is useful for determining the properties of a given character, but there doesn't seem to be a way to get a list characters that have a property out of it. 回答1: The list of Unicode characters for each class is generated from the Unicode spec

Match unicode in ply's regexes

非 Y 不嫁゛ 提交于 2019-12-06 04:08:08
问题 I'm matching identifiers, but now I have a problem: my identifiers are allowed to contain unicode characters. Therefore the old way to do things is not enough: t_IDENTIFIER = r"[A-Za-z](\\.|[A-Za-z_0-9])*" In my markup language parser I match unicode characters by allowing all the characters except those I explicitly use, because my markup language only has two or three of characters I need to escape that way. How do I match all unicode characters with python regexs and ply? Also is this a

POSIX character equivalents in Java regular expressions

て烟熏妆下的殇ゞ 提交于 2019-12-05 10:40:30
I would like to use a regular expression like this in Java : [[=a=][=e=][=i=]] . But Java doesn't support the POSIX classes [=a=], [=e=] etc . How can I do this? More precisely, is there a way to not use US-ASCII? Java does support posix character classes . The syntax is just different, for instance: \p{Lower} \p{Upper} \p{ASCII} \p{Alpha} \p{Digit} \p{Alnum} \p{Punct} \p{Graph} \p{Print} \p{Blank} \p{Cntrl} \p{XDigit} \p{Space} Quoting from http://download.oracle.com/javase/1.6.0/docs/api/java/util/regex/Pattern.html POSIX character classes (US-ASCII only) \p{Lower} A lower-case alphabetic

How do I get a list of all Unicode characters that have a given property?

*爱你&永不变心* 提交于 2019-12-05 08:30:54
Without looping over the entire range of Unicode characters, how can I get a list of characters that have a given property? In particular I want a list of all characters that are digits (i.e. those that match /\d/ ). I have looked at Unicode::UCD , and it is useful for determining the properties of a given character, but there doesn't seem to be a way to get a list characters that have a property out of it. The list of Unicode characters for each class is generated from the Unicode spec when you compile Perl, and is typically stored in /usr/lib/perl-YOURPERLVERSION/unicore/lib/gc_sc/ For

Match unicode in ply's regexes

跟風遠走 提交于 2019-12-04 10:13:21
I'm matching identifiers, but now I have a problem: my identifiers are allowed to contain unicode characters. Therefore the old way to do things is not enough: t_IDENTIFIER = r"[A-Za-z](\\.|[A-Za-z_0-9])*" In my markup language parser I match unicode characters by allowing all the characters except those I explicitly use, because my markup language only has two or three of characters I need to escape that way. How do I match all unicode characters with python regexs and ply? Also is this a good idea at all? I'd want to let people use identifiers like Ω » « ° foo² väli π as an identifiers

Regular expression to match boundary between different Unicode scripts

我与影子孤独终老i 提交于 2019-12-04 01:04:16
问题 Regular expression engines have a concept of "zero width" matches, some of which are useful for finding edges of words: \b - present in most engines to match any boundary between word and non-word characters \< and \> - present in Vim to match only the boundary at the beginning of a word, and at the end of a word, respectively. A newer concept in some regular expression engines is Unicode classes. One such class is script, which can distinguish Latin, Greek, Cyrillic, etc. These examples are

How to validate both Chinese (unicode) and English name?

情到浓时终转凉″ 提交于 2019-12-03 05:16:07
问题 I have a multilingual website (Chinese and English). I like to validate a text field (name field) in javascript. I have the following code so far. var chkName = /^[characters]{1,20}$/; if( chkName.test("[name value goes here]") ){ alert("validated"); } the problem is, /^[characters]{1,20}$/ only matches English characters. Is it possible to match ANY (including unicode) characters? I used to use the following regex, but I don't want to allow spaces between each characeters. /^(.+){1,20}$/ 回答1