cjk

n-gram name analysis in non-english languages (CJK, etc)

不羁岁月 提交于 2019-12-03 16:33:34
I'm working on deduping a database of people. For a first pass, I'm following a basic 2-step process to avoid an O(n^2) operation over the whole database, as described in the literature . First, I "block"- iterate over the whole dataset, and bin each record based on n-grams AND initials present in the name. Second, all the records per bin are compared using Jaro-Winkler to get a measure of the likelihood of their representing the same person. My problem- the names are Unicode. Some (though not many) of these names are in CJK (Chinese-Japanese-Korean) languages. I have no idea how to find word

Detecting CJK characters in a string (C#)

最后都变了- 提交于 2019-12-03 14:12:41
I am using iTextSharp to generate a series of PDFs, using Open Sans as the default font. On occasion, names are inserted into the content of the PDFs. However my issue is that some of the names I need to insert contain CJK characters (stored in nvarchar columns in SQL Server), and as far as I know Open Sans does not support CJK characters at present. I need to keep using Open Sans as my default font, so ideally I would like to try and detect CJK characters in the strings being grabbed from the database and switch to a CJK font when printing out those characters. Would a regex be the best bet

How does tokenization and pattern matching work in Chinese.?

风流意气都作罢 提交于 2019-12-03 12:11:20
This question involves computing as well as knowledge of Chinese. I have chinese queries and I have a separate list of phrases in Chinese I need to be able to find which of these queries have any of these phrases. In english, it is a very simple task. I don't understand Chinese at all, its semantics, grammar rules etc. and if somebody in this forum who also understands Chinese can help me with some basic understanding and how pattern matching is done for Chinese. I have a basic perception that in Chinese one unit (without any space in between) can actually mean more than one word(Is this

Programmatically determine number of strokes in a Chinese character?

拜拜、爱过 提交于 2019-12-03 11:30:26
Does Unicode store stroke count information about Chinese, Japanese, or other stroke-based characters? Tim A little googling came up with Unihan.zip , a file published by the Unicode Consortium which contains several text files including Unihan_RadicalStrokeCounts.txt which may be what you want. There is also an online Unihan Database Lookup based on this data. In Python there is a library for that: >>> from cjklib.characterlookup import CharacterLookup >>> cjk = CharacterLookup('C') >>> cjk.getStrokeCount(u'日') 4 Disclaimer: I wrote it You mean, is it encoded somehow in the actual code point?

What is the encoding of Chinese characters on Wikipedia?

安稳与你 提交于 2019-12-03 09:44:16
I was looking at the encoding of Chinese characters on Wikipedia and I'm having trouble figuring out what they are using. For instance "的" is encoded as "%E7%9A%84" ( see here ). That's three bytes, however none of the encodings described on this page uses three bytes to represent Chinese characters. UTF-8 for instance uses 2 bytes. I'm basically trying to match these three bytes to an actual character. Any suggestion on what encoding it could be? >>> c='\xe7\x9a\x84'.decode('utf8') >>> c u'\u7684' >>> print c 的 though Unicode encodes it in 16 bits, utf8 breaks it down to 3 bytes. The header

Italic Font not work for Chinese/Japanese/Korean on iOS 7

為{幸葍}努か 提交于 2019-12-03 09:15:28
I want to set Italic Font Style in UITextView, but Italic Font just not work for Chinese/Japanese/Korean on iOS 7.Could anyone help? Because there are no italic styled Chinese fonts on iOS, you need to use affine transformation to slant the normal styled Chinese font. The code below gives a 15° slant to Heiti SC Medium : CGAffineTransform matrix = CGAffineTransformMake(1, 0, tanf(15 * (CGFloat)M_PI / 180), 1, 0, 0); UIFontDescriptor *desc = [UIFontDescriptor fontDescriptorWithName:@"Heiti SC Medium" matrix:matrix]; textView.font = [UIFont fontWithDescriptor:desc size:17]; Real effect: I’m not

Convert numbered pinyin to pinyin with tone marks

ⅰ亾dé卋堺 提交于 2019-12-03 08:46:33
问题 Are there any scripts, libraries, or programs using Python , or BASH tools (e.g. awk , perl , sed ) which can correctly convert numbered pinyin (e.g. dian4 nao3) to UTF-8 pinyin with tone marks (e.g. diàn​ nǎo)? I have found the following examples, but they require PHP or #C : PHP Convert numbered to accentuated Pinyin? C Any libraries to convert number Pinyin to Pinyin with tone markings? I have also found various On-line tools, but they cannot handle a large number of conversions. 回答1: I've

convert unicode into character with ruby

ⅰ亾dé卋堺 提交于 2019-12-03 07:35:59
问题 I found a dictionary of Chinese characters in unicode. I'm trying to build a database of Characters out of this dictionary but I don't know how to convert unicode to a character.. p "国".unpack("U*").first #this gives the unicode 22269 How can convert 22269 back into the character value which would be the opposite of the line above. 回答1: [22269].pack('U*') #=> "国" or "\345\233\275" Edit : Works in 1.8.6+ (verified in 1.8.6, 1.8.7, and 1.9.2). In 1.8.x you get a three-byte string representing

How to classify Japanese characters as either kanji or kana?

泪湿孤枕 提交于 2019-12-03 06:49:40
问题 Given the text below, how can I classify each character as kana or kanji? 誰か確認上記これらのフ To get some thing like this 誰 - kanji か - kana 確 - kanji 認 - kanji 上 - kanji 記 - kanji こ - kana れ - kana ら - kana の - kana フ - kana (Sorry if I did it incorrectly.) 回答1: This functionality is built into the Character.UnicodeBlock class. Some examples of the Unicode blocks related to the Japanese language: Character.UnicodeBlock.of('誰') == CJK_UNIFIED_IDEOGRAPHS Character.UnicodeBlock.of('か') == HIRAGANA

sort() for Japanese

落爺英雄遲暮 提交于 2019-12-03 06:11:52
If I have set my current locale to Japanese, how can I make it so that Japanese characters will always have higher preference than non-Japanese characters. For example, right now English characters will always appear before the Katakana characters. How can I reverse this effect? Sorry for not being very clear. As you can see here . The final results have Java, NVIDIA and Windows ファイアウォール. Ranked as the first three ahead of the Japanese characters. Is it possible to have those at the end? Use usort() instead of sort() so you can define comparing criteria at your own way. Try this simple method.