n-gram name analysis in non-english languages (CJK, etc)
I'm working on deduping a database of people. For a first pass, I'm following a basic 2-step process to avoid an O(n^2) operation over the whole database, as described in the literature . First, I "block"- iterate over the whole dataset, and bin each record based on n-grams AND initials present in the name. Second, all the records per bin are compared using Jaro-Winkler to get a measure of the likelihood of their representing the same person. My problem- the names are Unicode. Some (though not many) of these names are in CJK (Chinese-Japanese-Korean) languages. I have no idea how to find word