Python - can I detect unicode string language code?

前端 未结 7 2026
遇见更好的自我
遇见更好的自我 2020-11-27 16:29

I\'m faced with a situation where I\'m reading a string of text and I need to detect the language code (en, de, fr, es, etc).

Is there a simple way to do this in py

相关标签:
7条回答
  • 2020-11-27 16:41

    In my case I only need to determine two languages so I just check the first character:

    import unicodedata
    
    def is_greek(term):
        return 'GREEK' in unicodedata.name(term.strip()[0])
    
    
    def is_hebrew(term):
        return 'HEBREW' in unicodedata.name(term.strip()[0])
    
    0 讨论(0)
  • 2020-11-27 16:43

    If you need to detect language in response to a user action then you could use google ajax language API:

    #!/usr/bin/env python
    import json
    import urllib, urllib2
    
    def detect_language(text,
        userip=None,
        referrer="http://stackoverflow.com/q/4545977/4279",
        api_key=None):        
    
        query = {'q': text.encode('utf-8') if isinstance(text, unicode) else text}
        if userip: query.update(userip=userip)
        if api_key: query.update(key=api_key)
    
        url = 'https://ajax.googleapis.com/ajax/services/language/detect?v=1.0&%s'%(
            urllib.urlencode(query))
    
        request = urllib2.Request(url, None, headers=dict(Referer=referrer))
        d = json.load(urllib2.urlopen(request))
    
        if d['responseStatus'] != 200 or u'error' in d['responseData']:
            raise IOError(d)
    
        return d['responseData']['language']
    
    print detect_language("Python - can I detect unicode string language code?")
    

    Output

    en
    

    Google Translate API v2

    Default limit 100000 characters/day (no more than 5000 at a time).

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    import json
    import urllib, urllib2
    
    from operator import itemgetter
    
    def detect_language_v2(chunks, api_key):
        """
        chunks: either string or sequence of strings
    
        Return list of corresponding language codes
        """
        if isinstance(chunks, basestring):
            chunks = [chunks] 
    
        url = 'https://www.googleapis.com/language/translate/v2'
    
        data = urllib.urlencode(dict(
            q=[t.encode('utf-8') if isinstance(t, unicode) else t 
               for t in chunks],
            key=api_key,
            target="en"), doseq=1)
    
        # the request length MUST be < 5000
        if len(data) > 5000:
            raise ValueError("request is too long, see "
                "http://code.google.com/apis/language/translate/terms.html")
    
        #NOTE: use POST to allow more than 2K characters
        request = urllib2.Request(url, data,
            headers={'X-HTTP-Method-Override': 'GET'})
        d = json.load(urllib2.urlopen(request))
        if u'error' in d:
            raise IOError(d)
        return map(itemgetter('detectedSourceLanguage'), d['data']['translations'])
    

    Now you could request detecting a language explicitly:

    def detect_language_v2(chunks, api_key):
        """
        chunks: either string or sequence of strings
    
        Return list of corresponding language codes
        """
        if isinstance(chunks, basestring):
            chunks = [chunks] 
    
        url = 'https://www.googleapis.com/language/translate/v2/detect'
    
        data = urllib.urlencode(dict(
            q=[t.encode('utf-8') if isinstance(t, unicode) else t
               for t in chunks],
            key=api_key), doseq=True)
    
        # the request length MUST be < 5000
        if len(data) > 5000:
            raise ValueError("request is too long, see "
                "http://code.google.com/apis/language/translate/terms.html")
    
        #NOTE: use POST to allow more than 2K characters
        request = urllib2.Request(url, data,
            headers={'X-HTTP-Method-Override': 'GET'})
        d = json.load(urllib2.urlopen(request))
    
        return [sorted(L, key=itemgetter('confidence'))[-1]['language']
                for L in d['data']['detections']]
    

    Example:

    print detect_language_v2(
        ["Python - can I detect unicode string language code?",
         u"матрёшка",
         u"打水"], api_key=open('api_key.txt').read().strip())
    

    Output

    [u'en', u'ru', u'zh-CN']
    
    0 讨论(0)
  • 2020-11-27 16:49

    A useful article here suggests that this open source named CLD is the best bet for detecting language in python.

    The article shows a comparison of speed and accuracy between 3 solutions :

    1. language-detection or its python port langdetect
    2. Tika
    3. Chromium Language Detection (CLD)

    I wasted my time with langdetect now I am switching to CLD which is 16x faster than langdetect and has 98.8% accuracy

    0 讨论(0)
  • 2020-11-27 16:57

    Look at Natural Language Toolkit and Automatic Language Identification using Python for ideas.

    I would like to know if a Bayesian filter can get language right but I can't write a proof of concept right now.

    0 讨论(0)
  • 2020-11-27 17:01

    If you only have a limited number of possible languages, you could use a set of dictionaries (possibly only including the most common words) of each language and then check the words in your input against the dictionaries.

    0 讨论(0)
  • 2020-11-27 17:02

    Have a look at guess-language:

    Attempts to determine the natural language of a selection of Unicode (utf-8) text.

    But as the name says, it guesses the language. You can't expect 100% correct results.

    Edit:

    guess-language is unmaintained. But there is a fork (that support python3): guess_language-spirit

    0 讨论(0)
提交回复
热议问题