问题
I'm working on deduping a database of people. For a first pass, I'm following a basic 2-step process to avoid an O(n^2) operation over the whole database, as described in the literature. First, I "block"- iterate over the whole dataset, and bin each record based on n-grams AND initials present in the name. Second, all the records per bin are compared using Jaro-Winkler to get a measure of the likelihood of their representing the same person.
My problem- the names are Unicode. Some (though not many) of these names are in CJK (Chinese-Japanese-Korean) languages. I have no idea how to find word boundaries for something like initials in these languages. I have no idea whether n-gram analysis is valid on names in languages where names can be 2 characters. I also don't know if string edit-distance or other similarity metrics are valid in this context.
Any ideas from linguist programmers or native speakers?
回答1:
Some more information regarding Japanese:
When it comes to splitting the names into family name and given name, morphological analyzers like mecab (mentioned in @Holden's answer) basically work, but the level of accuracy will not be very high, because they will only get those names right that are in their dictionary (the statistical 'guessing' capabilities of mecab mostly relate to POS tags and in dealing with ambiguous dictionary entries, but if a proper noun is not in the dictionary, mecab will most of the time split it into individual characters, which is almost always wrong). To test this, I used a random list of names on the web (this one, which contains 113 people's names), extracted the names, removed whitespace from them and tested mecab using the IPAdic. It got approx. 21% of the names wrong.
'Proper' Japanese names, i.e. names of Japanese people, consist of a family name (most of the time 2, but sometimes 1 or 3, Kanji) and a given name (most of the time 1 or 2, sometimes 3 Kanji, but sometimes 2-5 Hiragana instead). There are no middle names and there is no concept of initials. You could improve the mecab output by (1) using a comprehensive dictionary of family names, which you could build from web resources, (2) assuming the output is wrong whenever there are more than 2 elements, and then use you self-made family name dictionary to recognise the family name part, and if that fails use default splitting rules based on the number of characters. The latter will not always be accurate.
Of course foreign names can be represented in Japanese, too. Firstly, there are Chinese and Korean names, which are typically represented using Kanji, i.e. whatever splitting rules for Chinese or Korean you use can be applied more or less directly. Western as well as Arabic or Indian names are either represented using Latin characters (possibly full-width, though), or Katakana characters, often (but not always) using white space or a middle dot ・ between family name and given name. While for names of Japanese, Chinese or Korean people the order in Japanese representation will always be family name, then given name, the order for Western names is hard to predict.
Do you even need to split names into family and given part? For the purposes of deduplication / data cleansing, this should only be required if some of the possible duplicates appear in different order or with optional middle initials. None of this is possible in Japanese names (nor Chinese, nor Korean names for that matter). The only thing to keep in mind is that if you are given a Katakana string with spaces or middle dots in it, you are likely dealing with a Western name, in which case splitting at the space / middle dot is useful.
While splitting is probably not really required, you must take care of a number of other issues not mentioned in the previous answers:
Transliteration of foreign names. Depending on how your database was constructed, there may be situations that involve a Western name, say 'Obama' in one entry, and the Japanese Katakana representation 'オバマ' in a duplicate entry. Unfortunately, the mapping from Latin to Katakana is not straightforward, as Katakana tries to reflect the pronounciation of the name, which may vary depending on the language or origin and the accent of whoever pronounces it. E.g. somebody who hears the name 'Obama' for the first time, may be tempted to represent it as 'オバーマ' to emphasize the long vowel in the middle. Solving this is not trivial and will never work perfectly accurately, but if you think it is important for your cleansing problem, let's address it in a separate question.
Kanji variation. Japanese names (as well as Japanese representations of some Chinees or Korean names) use Kanji that are considered traditional versions of modern Kanji. For example many common family names contain 澤, which is a version of 沢. For example, the family name Takazawa may be written as 高沢 or 高澤. Usually, only one is the correct variant used by any particular person of that name, but it is not uncommon that the wrong variant is used in a database entry. You should therefore definitely normalise traditional variants to modern variants before comparing names. This web page provides a mapping that is certainly not comprehensive, but is probably good enough for your purposes.
Both Latin characters as well as Katakana characters exist as full-width as well as half-width variants. In Katakana the former and in Latin the latter is commonly used, but there is no guarantee. You should normalise all Kakatana to full-width and all Latin to half-width before comparing names.
Perhaps needless to say, but there are various versions of white space characters, which you also must normalise before comparing names. Moreover, in a pure Kanji sequence, I recommend removing all whitespace before comparing.
As said, some first names (especially female ones) are written in Hiragana. It may happen that those same names are written in Katakana in some instances. A mapping between Hiragana and Katakana is trivially possible. You should consider normalising all Kana (i.e. Hiragana and Katakana) to a common representation (either Hiragana or Katakana) before making any comparisons.
It may also happen that some Kanji names are represented using Kana. This is because whoever made the database entry might not have known the correct Kanji for the name (especially with first names, guessing the correct Kanji after hearing a name e.g. on the phone is very often impossible even for native speakers). Unfortunately, mapping between Kanji representations and Kana representations is very difficult and highly ambiguous, for example 真, 誠 and 実 are possible Kanji for the first name 'Makoto'. Anyone individual of that name will consider only one of them correct for himself, but it is impossible to know which one if the only thing you know is that the name is 'Makoto'. But Kana is sound-based, so all three versions are the same マコト in Katakana. Dictionaries built into morphological analyzers like mecab provide mappings, but because there is more than one possible Kanji for any Kana sequence and vice versa, actually using this during data cleansing will complicate your algorithm quite a lot. Depending on how your database was created in the first place, this may or may not be a relevant problem.
Edit specifically about publication author names: Japanese translations of non-Japanese books usually have the author name transliterated to Katakana. E.g. the book recommendation list of the Asahi newspaper has 30 books today; 7 have a Western author name in Katakana. They even have abbreviated first names and middle initials, which they keep in Latin, e.g.
H・S・フリードマン and L・R・マーティン
which corresponds to
H.S. Friedman (or Friedmann, or Fridman, or Fridmann?)
and
L.R. Martin (or Matin, or Mahtin?)
I'd say this exemplifies the most common way to deal with non-Japanese author names of books:
- Initials are preserved as Latin
- Unabbreviated parts of the name are given in Katakana (but there is no uniquely defined one-to-one mapping between Latin and Katakana, as described in 5.1)
- The order is preserved: First, middle, surname. That is a very common convention for author names, but in something like a customer database that may be different.
- Either whitespace, or middle dot (as above), or the standard ASCII dot are used to separate the elements
So as long as your project is related to author names of books, I believe the following is accurate with regards to non-Japanese authors:
The same author may appear in a Latin (in a non-Japanese entry) as well as a Katakana representation (in a Japanese entry). To be able to determine that two such entries refer to the same author, you'll need to map between Katakana and Latin. That is a non-trivial problem, but not totally unsurmountable either (although it will never work 100% correctly). I am unsure if a good solution is available for free; but let's address this in a separate question (perhaps with the japanese tag) if required.
Even if for some reason we can assume that there are no Latin duplicates of Katakana names, there is still a good chance that there are multiple variants in Katakana (due to 5.1). However, for author names (in particular of well-known authors), it may be safe to assume that the amount of variation is relatively limited. Hence, for a start, it may be sufficient to normalize dots and whitespace.
Splitting into first and last name is trivial (whitespace and dots), and the order of names will generally be the same across all variants.
Western authors will generally not be represented using Kanji. There are a few people who consider themselves so closely related to Japan that they choose Kanji for their own name (it's a matter of choice, not just transliteration, because the Kanji carry meaning), but that will be so rare that it is hardly worth worrying about.
Now regarding Japanese authors, those will be represented in Kanji as described in part 2 of the main answer. In Western translations of their books, their name will generally be given in Latin, and the order will be exchanged. For example,
村上春樹 (村上 = Murakami, the family name, 春樹 = Haruki, the given name)
will be represented as
Haruki Murakami
on translations of his books. This kind of mapping between Kanji and Latin requires a very comprehensive dictionary and quite a lot of work. Also, the spelling in Latin cannot always be uniquely determined, even if the reading of the Kanji can. E.g. one of the most frequent Japanese family names, 伊藤, may be spelled 'Ito' as well as 'Itoh' in English. Even 'Itou' and 'Itoo' are not impossible.
If Japanese-Latin cross matching is not required, the only kind of variation amongst the Kanji representations themselves you will see are Kanji variants (5.2). But to be clear, even where a traditional as well as a modern variant of a Kanji exists, only one of them is correct for any given individual. Typing the wrong Kanji variant may easily happen when a phone operator enters names into a database, but in a database of author names this will be relatively rare because the correct spelling of an author can be verified relatively easily.
Regarding the question about 5.6 (Kana vs. Kanji):
Some people's given name has no Kanji representation, only a Hiragana one. Since there is a one-to-one correspondence between Hiragana and Katakana, there is a fair chance that both variants appear in a database. I recommend converting all Hiragana to Katakana (or vice versa) before comparing.
However, most people's names are written in Kanji. On the cover of a book, those Kanji will be used, so most likely they will also be used in your database. The only reasons why somebody might input Kana instead of Kanji are: (a) when he/she does not know the correct Kanji (perhaps unlikely since you can easily search Amazon or whatever to find out), (b) when the database is made for search purposes. Search engines for book catalogues might include Katakana versions because that enables users to find authors even if they don't know the correct Kanji. Hence, whether or not you need Kanji-Kana conversion (which is a hard problem) depends on the original purpose of the data and how the database was created.
Regarding nicknames: There are nicknames used in daily conversation, but I doubt you would find them in an author database. I realize there are languages (e.g. Polish) that use nicknames or diminutives (e.g. 'Gosia' instead of 'Małgorzata') in an almost regular way, but I wouldn't say that is the case with Japanese.
Regarding Chinese: I am unable to give a comprehensive answer, but at least the whole Kanji-Kana variation problem does not exist, because Chinese uses Kanji (under the name of Hanzi) only. There is a major Kanji variation problem, however (especially between traditional variants (used in Taiwan) and simplified variants (used on the mainland)).
Regarding Korean: As far as I know, Koreans are generally able to write their name in Hanja (=Kanji), although they don't use Hanja for most of the rest of the language most of the time), but there is obviously a Hangul version of the name, too. I am unsure to what extent Hanja-Hangul conversion is required for a cleansing problem like yours. If it is, it will be a very hard problem.
Regarding regional variants: There are no regional variants of the Kanji characters themselves in Japanese (at least not in modern times). The Kanji of any given author will be written in the same way all over Japan. Of course there are certain family names that are more frequent in one region than another, though. If you are interested in the names themselves (rather than the people they refer to), regional variants (as well as variation between traditional and modern forms of the Kanji) will play a role.
回答2:
For Chinese, most names consist of 3 characters: first character is the family name (!), the other two characters are the personal name, like
Mao Zedong = family name Mao and personal name Zedong.
There are also some 2-character names, then first character is the family name and the second character is the personal name.
4-character names are rare, but then the split is usually 2-2.
Seeing this, it does not really make much sense to do n-gram analysis of Chinese names - you're just researching what are the most common Chinese family/personal names then.
回答3:
So doing bi-gram style matching is a common hack for doing search in Japanese, but there are better approaches that you can use to determine word boundaries. In a project I've worked on in the past we had fairly good results with mecab for Japanese brand names and some other text. I imagine you could get better performance by training it on a list of Japanese names. Sadly its in C, but we ended up using it anyways in Java through the JNI, you could do something similar in your python code.
来源:https://stackoverflow.com/questions/10034881/n-gram-name-analysis-in-non-english-languages-cjk-etc