character-properties

Unicode block of a character in python

六眼飞鱼酱① 提交于 2019-12-03 04:35:19
问题 Is there a way to get the Unicode Block of a character in python? The unicodedata module doesn't seem to have what I need, and I couldn't find an external library for it. Basically, I need the same functionality as Character.UnicodeBlock.of() in java. 回答1: I couldn't find one either. Strange! Luckily, the number of Unicode blocks is quite manageably small. This implementation accepts a one-character Unicode string, just like the functions in unicodedata . If your inputs are mostly ASCII, this

How to validate both Chinese (unicode) and English name?

旧巷老猫 提交于 2019-12-02 19:39:09
I have a multilingual website (Chinese and English). I like to validate a text field (name field) in javascript. I have the following code so far. var chkName = /^[characters]{1,20}$/; if( chkName.test("[name value goes here]") ){ alert("validated"); } the problem is, /^[characters]{1,20}$/ only matches English characters. Is it possible to match ANY (including unicode) characters? I used to use the following regex, but I don't want to allow spaces between each characeters. /^(.+){1,20}$/ searlea You might check out Javascript + Unicode regexes and do some research to find exactly which ranges

Unicode block of a character in python

有些话、适合烂在心里 提交于 2019-12-02 17:47:04
Is there a way to get the Unicode Block of a character in python? The unicodedata module doesn't seem to have what I need, and I couldn't find an external library for it. Basically, I need the same functionality as Character.UnicodeBlock.of() in java. zaphod I couldn't find one either. Strange! Luckily, the number of Unicode blocks is quite manageably small. This implementation accepts a one-character Unicode string, just like the functions in unicodedata . If your inputs are mostly ASCII, this linear search might even be faster than binary search using bisect or whatever. If I were submitting

Matching Unicode Dashes in Java Regular Expressions?

北战南征 提交于 2019-12-01 04:45:03
I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression: private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s"); which, if I'm reading the Pattern documentation correctly, should capture any of the unicode dashes or the ascii dash, when surrounded on both sides by whitespace. I'm using the pattern as

Regular expression to match boundary between different Unicode scripts

馋奶兔 提交于 2019-12-01 04:18:53
Regular expression engines have a concept of "zero width" matches, some of which are useful for finding edges of words: \b - present in most engines to match any boundary between word and non-word characters \< and \> - present in Vim to match only the boundary at the beginning of a word, and at the end of a word, respectively. A newer concept in some regular expression engines is Unicode classes. One such class is script, which can distinguish Latin, Greek, Cyrillic, etc. These examples are all equivalent and match any character of the Greek writing system: \p{greek} \p{script=greek} \p

Latin Characters check

大憨熊 提交于 2019-11-30 18:27:28
问题 there are some similar questions out there, but none that are quite the same or that have an answer that works for me. I need a javascript function which validates whether a text field contains all valid latin characters, so no cryllic or Chinese, just latin; specifically: Basic Latin (excluding the C0 control characters), Latin-1 (excluding the C1 control characters), Latin Extended A, Latin Extended B and Latin Extended Additional. This set corresponds to Unicode code points U+0020 to U

How to know the preferred display width (in columns) of Unicode characters?

两盒软妹~` 提交于 2019-11-29 21:53:23
In different encodings of Unicode, for example UTF-16le or UTF-8 , a character may occupy 2 or 3 bytes. Many Unicode applications doesn't take care of display width of Unicode chars just like they are all Latin letters. For example, in 80 -column text, which should contains 40 Chinese characters or 80 Latin letters in one line, but most application (like Eclipse, Notepad++, and all well-known text editors, I dare if there's any good exception) just count each Chinese character as 1 width as Latin letter. This certainly make the result format ugly and non-aligned. For example, a tab-width of 8

Scanning for Unicode Numbers in a string with \\d

对着背影说爱祢 提交于 2019-11-29 10:03:34
According to the Oniguruma documentation , the \d character type matches: decimal digit char Unicode: General_Category -- Decimal_Number However, scanning for \d in a string with all the Decimal_Number characters results in only latin 0-9 digits being matched: #encoding: utf-8 require 'open-uri' html = open("http://www.fileformat.info/info/unicode/category/Nd/list.htm").read digits = html.scan(/U\+([\da-f]{4})/i).flatten.map{ |s| s.to_i(16) }.pack('U*') puts digits.encoding, digits #=> UTF-8 #=> 0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨… p RUBY_DESCRIPTION, digits.scan(/

Is There a Way to Match Any Unicode non-Alphabetic Character?

[亡魂溺海] 提交于 2019-11-29 09:34:27
I have some documents that went through OCR conversion from PDF into HTML. Because of that, they wound up having lots of random unicode punctuation where the converter messed up (i.e. elipses, etc...). They also correctly have a bunch of Non-English, but still Alphabetic characters, like é, and Russian characters, etc... Is there any way to make a Regex that will match any unicode alphabetic character (from alphabets of any language)? Or one that will only match non-alphabetic characters? Either one would be really helpful and awesome. I'm using Perl, if that changes anything. Thanks!

List of Unicode alphabetic characters

落爺英雄遲暮 提交于 2019-11-29 07:34:08
I need the list of ranges of Unicode characters with the property Alphabetic as defined in http://www.unicode.org/Public/5.1.0/ucd/UCD.html#Alphabetic . However, I cannot find them in the Unicode Character Database no matter how I search for them. Can somebody provide a list of them or just a search facility for characters with specified Unicode properties? tchrist The Unicode Character Database comprises all the text files in the distribution. It is not just a single file as it once was long ago. The Alphabetic property is a derived property. You really do not want to use code point ranges