character-properties

How to know the preferred display width (in columns) of Unicode characters?

我与影子孤独终老i 提交于 2019-11-28 17:44:49
问题 In different encodings of Unicode, for example UTF-16le or UTF-8 , a character may occupy 2 or 3 bytes. Many Unicode applications doesn't take care of display width of Unicode chars just like they are all Latin letters. For example, in 80 -column text, which should contains 40 Chinese characters or 80 Latin letters in one line, but most application (like Eclipse, Notepad++, and all well-known text editors, I dare if there's any good exception) just count each Chinese character as 1 width as

Regex to match all unicode quotation marks

落花浮王杯 提交于 2019-11-28 14:32:05
Is there a simple regular expression to match all unicode quotes? Or does one have to hand-code it like this: quotes = ur"[\"'\u2018\u2019\u201c\u201d]" Thank you for reading. Brian Python doesn't support Unicode properties, therefore you can't use the Pi and Pf properties, so I guess your solution is as good as it gets. You might also want to consider the "false quotation marks" that are sadly being used - the acute and grave accent ( ´ and `` ): \u0060 and \u00B4`. Then there are guillemets ( « » ‹ › ), do you want those, too? Use \u00BB\u203A\u00AB\u2039 for those. Also, your command has a

Matching only a unicode letter in Python re

我是研究僧i 提交于 2019-11-28 09:39:24
I have a string from which i want to extract 3 groups: '19 janvier 2012' -> '19', 'janvier', '2012' Month name could contain non ASCII characters, so [A-Za-z] does not work for me: >>> import re >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20 janvier 2012', re.UNICODE).groups() (u'20', u'janvier', u'2012') >>> re.search(ur'(\d{,2}) ([A-Za-z]+) (\d{4})', u'20 février 2012', re.UNICODE).groups() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'NoneType' object has no attribute 'groups' >>> I could use \w but it matches digits and underscore: >>> re

Matching (e.g.) a Unicode letter with Java regexps

你。 提交于 2019-11-28 09:10:20
There are many questions and answers here on StackOverflow that assume a "letter" can be matched in a regexp by [a-zA-Z] . However with Unicode there are many more characters that most people would regard as a letter (all the Greek letters, Cyrllic .. and many more. Unicode defines many blocks each of which may have "letters". The Java definition defines Posix classes for things like alpha characters, but that is specified to only work with US-ASCII. The predefined character classes define words to consist of [a-zA-Z_0-9] , which also excludes many letters. So how do you properly match against

Regex - Unicode Properties Reference and Examples

[亡魂溺海] 提交于 2019-11-28 07:53:20
I feel lost with the Regex Unicode Properties presented by RegexBuddy, I cannot distinguish between any of the Number properties and the Math symbol property only seems to match + but not - , * , / , ^ for instance. Is there any documentation / reference with examples on regular expressions Unicode properties? A list of Unicode properties can be found in http://www.unicode.org/Public/UNIDATA/PropList.txt . The properties for each character can be found in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt (1.2 MB). In your case, + (PLUS SIGN) is Sm , - (HYPHEN-MINUS) is Pd , * (ASTERISK) is

Replace Unicode Control Characters

北城余情 提交于 2019-11-28 07:51:24
I need to replace all special control character in a string in Java. I want to ask the Google maps API v3, and Google doesn't seems to like these characters. Example: http://www.google.com/maps/api/geocode/json?sensor=false&address=NEW%20YORK%C2%8F This URL contains this character: http://www.fileformat.info/info/unicode/char/008f/index.htm So I receive some data, and I need to geocode this data. I know some character would not pass the geocoding, but I don't know the exact list. I was not able to find any documentation about this issue, so I think the list of characters that Google doesn't

Scanning for Unicode Numbers in a string with \d

时光毁灭记忆、已成空白 提交于 2019-11-28 03:25:05
问题 According to the Oniguruma documentation, the \d character type matches: decimal digit char Unicode: General_Category -- Decimal_Number However, scanning for \d in a string with all the Decimal_Number characters results in only latin 0-9 digits being matched: #encoding: utf-8 require 'open-uri' html = open("http://www.fileformat.info/info/unicode/category/Nd/list.htm").read digits = html.scan(/U\+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16) }.pack('U*') puts digits.encoding, digits #=> UTF-8 #

List of Unicode alphabetic characters

落花浮王杯 提交于 2019-11-28 01:15:44
问题 I need the list of ranges of Unicode characters with the property Alphabetic as defined in http://www.unicode.org/Public/5.1.0/ucd/UCD.html#Alphabetic. However, I cannot find them in the Unicode Character Database no matter how I search for them. Can somebody provide a list of them or just a search facility for characters with specified Unicode properties? 回答1: The Unicode Character Database comprises all the text files in the distribution. It is not just a single file as it once was long ago

How to determine if a character is a Chinese character

强颜欢笑 提交于 2019-11-27 13:20:19
How to determine if a character is a Chinese character using ruby? An interesting article on encodings in Ruby: http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18 (it's part of a series - check the table of contents at the start of the article also) I haven't used chinese characters before but this seems to be the list supported by unicode: http://en.wikipedia.org/wiki/List_of_CJK_Unified_Ideographs . Also take note that it's a unified system including Japanese and Korean characters (some characters are shared between them) - not sure if you can distinguish which are

Regex to match all unicode quotation marks

我的梦境 提交于 2019-11-27 08:36:56
问题 Is there a simple regular expression to match all unicode quotes? Or does one have to hand-code it like this: quotes = ur"[\"'\u2018\u2019\u201c\u201d]" Thank you for reading. Brian 回答1: Python doesn't support Unicode properties, therefore you can't use the Pi and Pf properties, so I guess your solution is as good as it gets. You might also want to consider the "false quotation marks" that are sadly being used - the acute and grave accent ( ´ and `` ): \u0060 and \u00B4`. Then there are