How to determine the (natural) language of a document?

前端未结

关注

 11  1580

I have a set of documents in two languages: English and German. There is no usable meta information about these documents, a program can look at the content only. Based on t

相关标签:

11条回答

面向向阳花

2020-12-24 08:09

English and German use the same set of letters except for ä, ö, ü and ß (eszett). You can look for those letters for determining the language.

You can also look at this text (Comparing two language identification schemes) from Grefenstette. It looks at letter trigrams and short words. Common trigrams for german en_, er_, _de. Common trigrams for English the_, he_, the...

There’s also Bob Carpenter’s How does LingPipe Perform Language ID?

0 讨论(0)
发布评论:

提交评论
- 加载中...
鱼传尺愫

2020-12-24 08:09

I believe the standard procedure is to measure the quality of a proposed algorithm with test data (i.e. with a corpus). Define the percentage of correct analysis that you would like the algorithm to achieve, and then run it over a number of documents which you have manually classified.

As for the specific algorithm: using a list of stop words sounds fine. Another approach that has been reported to work is to use a Bayesian Filter, e.g. SpamBayes. Rather than training it into ham and spam, train it into English and German. Use a portion of your corpus, run that through spambayes, and then test it on the complete data.

0 讨论(0)
发布评论:

提交评论
- 加载中...
轻奢々

2020-12-24 08:19

Isn't the problem several orders of magnitude easier if you've only got two languages (English and German) to choose from? In this case your approach of a list of stop words might be good enough.

Obviously you'd need to consider a rewrite if you added more languages to your list.

0 讨论(0)
发布评论:

提交评论
- 加载中...

执笔经年

2020-12-24 08:19

You can use the Google Language Detection API.

Here is a little program that uses it:

baseUrl = "http://ajax.googleapis.com/ajax/services/language/detect"

def detect(text):
    import json,urllib
    """Returns the W3C language code of a natural language"""

    params = urllib.urlencode({'v': '1.0' , "q":text[0:3000]}) # only use first 3000 characters                    
    resp = json.load(urllib.urlopen(baseUrl + "?" + params))
    try:
        retText = resp['responseData']['language']
    except:
        raise
    return retText


def test():
    print "Type some text to detect its language:"
    while True:
        text = raw_input('#>  ')
        retText = detect(text)
        print retText


if __name__=='__main__':
    import sys
    try:
        test()
    except KeyboardInterrupt:
        print "\n"
        sys.exit(0)

Other useful references:

Google Announces APIs (and demo): http://googleblog.blogspot.com/2008/03/new-google-ajax-language-api-tools-for.html

Python wrapper: http://code.activestate.com/recipes/576890-python-wrapper-for-google-ajax-language-api/

Another python script: http://www.halotis.com/2009/09/15/google-translate-api-python-script/

RFC 1766 defines W3C languages

Get the current language codes from: http://www.iana.org/assignments/language-subtag-registry

0 讨论(0)

傲寒

2020-12-24 08:21

Try measure occurences of each letter in text. For English and German texts are calculated the frequencies and, maybe, the distributions of them. Having obtained these data, you may reason what language the distribution of frequencies for your text belongs.

You should use Bayesian inference to determine the closest language (with a certain error probability) or, maybe, there are other statistical methods for such tasks.

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2