Detect Japanese character input and “Romajis” (ASCII)

半世苍凉 提交于 2019-12-23 13:08:40

问题


I would like to be able to detect when the user:

  1. Inputs Japanese characters (Kanji or Kana)
  2. Inputs Roman characters (exclusively)

Currently I am using the ASCII range like this (C# syntax):

string searchKeyWord = Console.ReadLine();
var romajis = from c in searchKeyWord where c >= ' ' && c <= '~' select c;

if (romajis.Any())
{
    // Romajis
}
else
{
    // Japanese input
}

Is there a better, faster (stronger...) way to do this?

EDIT: the question can be generalized to any other language with a non-ascii character set.


回答1:


Wikipedia is nice and has the unicode ranges in the top right corner for hiragana, katakana and kanji. We can use this to our advantage to refine your algorithm and also get the other character sets.

private static IEnumerable<char> GetCharsInRange(string text, int min, int max)
{
    return text.Where(e => e >= min && e <= max);
}

Usage:

var romaji = GetCharsInRange(searchKeyword, 0x0020, 0x007E);
var hiragana = GetCharsInRange(searchKeyword, 0x3040, 0x309F);
var katakana = GetCharsInRange(searchKeyword, 0x30A0, 0x30FF);
var kanji = GetCharsInRange(searchKeyword, 0x4E00, 0x9FBF);

Note that this should be as fast as your, just a little nicer/better imo :)

Determining general language sets

Yes you can detect sets of characters like that, but not really languages. Since French, German, etc. share a lot of characters with English and Japanese shares a lot of Kanji with Chinese (obviously). You can't clearly say that a single character is from a single language for a lot of characters without a giant lookup chart.

There is also the fact that Japanese use English (and punctuation) quite a bit, your method would consider anything that contains a romanised word or an emoticon to be romaji.



来源:https://stackoverflow.com/questions/15805859/detect-japanese-character-input-and-romajis-ascii

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!