问题
I would like to be able to detect when the user:
- Inputs Japanese characters (Kanji or Kana)
- Inputs Roman characters (exclusively)
Currently I am using the ASCII range like this (C# syntax):
string searchKeyWord = Console.ReadLine();
var romajis = from c in searchKeyWord where c >= ' ' && c <= '~' select c;
if (romajis.Any())
{
// Romajis
}
else
{
// Japanese input
}
Is there a better, faster (stronger...) way to do this?
EDIT: the question can be generalized to any other language with a non-ascii character set.
回答1:
Wikipedia is nice and has the unicode ranges in the top right corner for hiragana, katakana and kanji. We can use this to our advantage to refine your algorithm and also get the other character sets.
private static IEnumerable<char> GetCharsInRange(string text, int min, int max)
{
return text.Where(e => e >= min && e <= max);
}
Usage:
var romaji = GetCharsInRange(searchKeyword, 0x0020, 0x007E);
var hiragana = GetCharsInRange(searchKeyword, 0x3040, 0x309F);
var katakana = GetCharsInRange(searchKeyword, 0x30A0, 0x30FF);
var kanji = GetCharsInRange(searchKeyword, 0x4E00, 0x9FBF);
Note that this should be as fast as your, just a little nicer/better imo :)
Determining general language sets
Yes you can detect sets of characters like that, but not really languages. Since French, German, etc. share a lot of characters with English and Japanese shares a lot of Kanji with Chinese (obviously). You can't clearly say that a single character is from a single language for a lot of characters without a giant lookup chart.
There is also the fact that Japanese use English (and punctuation) quite a bit, your method would consider anything that contains a romanised word or an emoticon to be romaji.
来源:https://stackoverflow.com/questions/15805859/detect-japanese-character-input-and-romajis-ascii