Am I passing the string correctly to the python library?

后端 未结 3 1058
终归单人心
终归单人心 2021-01-29 07:17

I\'m using a python library called Guess Language: http://pypi.python.org/pypi/guess-language/0.1

\"justwords\" is a string with unicode text. I stick it in the package,

3条回答
  •  日久生厌
    2021-01-29 08:03

    Looking at the main page, it says """Detects over 60 languages; Greek (el), Korean (ko), Japanese (ja), Chinese (zh) and all the languages listed in the trigrams directory. """

    It doesn't use trigrams for those 4 languages; it relies on what script blocks are present in the input text. Looking at the source code:

    if "Katakana" in scripts or "Hiragana" in scripts or "Katakana Phonetic Extensions" in scripts:
        return "ja"
    
    if "CJK Unified Ideographs" in scripts or "Bopomofo" in scripts \
            or "Bopomofo Extended" in scripts or "KangXi Radicals" in scripts:
        return "zh"
    

    For a script name like Katakana or Hiragana to appear in scripts, such characters must comprise 40% or more of the input text (after normalisation which removes non-alphabetic characters etc). It may be possible that some Japanese text needs a threshold of less than 40%. HOWEVER if that was the problem with your text, I would expect it to have more than 40% kanji (CJK Unified Ideographs) and thus should return "zh" (Chinese).

    Update after some experimentation, including inserting a print statement to show what script blocks were detected with what percentages:

    A presumably typical news item from the Asahi newspaper website:

     49.3 Hiragana
      8.7 Katakana
     42.0 CJK Unified Ideographs
    result ja
    

    A presumably atypical ditto:

     35.9 Hiragana
     49.2 CJK Unified Ideographs
     13.3 Katakana
      1.6 Halfwidth and Fullwidth Forms
    result zh
    

    (Looks like it might be a good idea to base the test on the total (Hiragana + Katakana) content)

    Result of shoving the raw front page (XML, HTML, everything) through the machinery:

      2.4 Hiragana
      6.1 CJK Unified Ideographs
      0.1 Halfwidth and Fullwidth Forms
      3.7 Katakana
     87.7 Basic Latin
    result ca
    

    The high percentage of Basic Latin is of course due to the markup. I haven't investigated what made it choose "ca" (Catalan) over any other language which uses Basic Latin, including English. However the gobbledegook that you printed doesn't show any sign of including markup.

    End of update

    Update 2

    Here's an example (2 headlines and next 4 paragraphs from this link) where about 83% of the characters are East Asian and the rest are Basic Latin but the result is en (English).

     29.6 Hiragana
     18.5 Katakana
     34.9 CJK Unified Ideographs
     16.9 Basic Latin
    result en
    

    The Basic Latin Characters are caused by the use of the English names of organisations etc in the text. The Japanese rule fails because neither Katakana nor Hiragana score 40% (together they score 48.1%). The Chinese rule fails because CJK Unified Ideographs scores less than 40%. So the 83.1% East Asian characters are ignored, and the result is decided by the 16.9% minority. These "rotten borough" rules need some reform. In generality, it could be expressed like:

    If (total of script blocks used by only language X) >= X-specific threshold, then select language X.

    As suggested above, Hiragana + Katakana >= 40% will probably do the trick for Japanese. A similar rule may well be needed for Korean.

    Your gobbledegook did actually contain a few characters of markup (I didn't scroll far enough to the right to see it) but certainly not enough to depress all the East Asian scores below 40%. So we're still waiting to see what your actual input is and how you got it from where.

    End of update2

    To aid with diagnosis of your problem, please don't print gobbledegook; use

    print repr(justwords)
    

    That way anyone who is interested in actually doing debugging has got something to work on. It would help if you gave the URL of the webpage, and showed the Python code that you used to get your unicode justwords. Please edit your answer to show those 3 pieces of information.

    Update 3 Thanks for the URL. Visual inspection indicates that the language is overwhelmingly Chinese. What gave you the impression that it is Japanese?

    Semithanks for supplying some of your code. To avoid your correspondents having to do your work for you, and to avoid misunderstandings due to guessing, you should always supply (without being asked) a self-contained script that will reproduce your problem. Note that you say you got "ASCII errors" (no exact error message! no traceback!) if you didn't do .encode('utf8') -- my code (see below) doesn't have this problem.

    No thanks for not supplying the result of print repr(justwords) (even after being asked). Inspecting what intermediate data has been created is a very elementary and very effective debugging technique. This is something you should always do before asking a question. Armed with this knowledge you can ask a better question.

    Using this code:

    # coding: ascii
    import sys
    sys.path.append(r"C:\junk\wotlang\guess-language\guess_language")
    import guess_language
    URL = "http://feeds.feedburner.com/nchild"
    from BeautifulSoup import BeautifulStoneSoup
    from pprint import pprint as pp
    import urllib2
    htmlSource = urllib2.urlopen(URL).read()
    soup = BeautifulStoneSoup(htmlSource)
    fall = soup.findAll(text=True)
    # pp(fall)
    justwords = ''.join(fall)
    # justwords = justwords.encode('utf-8')
    result = guess_language.guessLanguage(justwords)
    print "result", result
    

    I got these results:

     29.0 CJK Unified Ideographs
      0.0 Extended Latin
      0.1 Katakana
     70.9 Basic Latin
    result en
    

    Note that the URL content is not static; about an hour later I got:

     27.9 CJK Unified Ideographs
      0.0 Extended Latin
      0.1 Katakana
     72.0 Basic Latin
    

    The statistics were obtained by fiddling around line 361 of guess_language.py so that it reads:

    for key, value in run_types.items():
        pct = (value*100.0) / totalCount # line changed so that pct is a float
        print "%5.1f %s" % (pct, key) # line inserted
        if pct >=40:
            relevant_runs.append(key)
    

    The statistics are symptomatic of Chinese with lots of HTML/XML/Javascript stuff (see previous example); this is confirmed by looking at the output of the pretty-print obtained by un-commenting pp(fall) -- lots of stuff like:

    <img style="float:left; margin:0 10px 0px 10px;cursor:pointer; cursor:hand
    ;" width="60px" src="http://2.bp.blogspot.com/_LBJ4udkQZag/Rm6sTn1b7NI/AAAAAAAAA
    FA/bYkSJZ3i2bg/s400/hepinge169.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_507518
    3283203730642" alt="\u548c\u5e73\u6771\u8def\u4e00\u6bb5169\u865f" title="\u548c
    \u5e73\u6771\u8def\u4e00\u6bb5169\u865f"/>\u4eca\u5929\u4e2d\u5348\u8d70\u523
    0\u516c\u53f8\u5c0d\u9762\u76847-11\u8cb7\u98f2\u6599\uff0c\u7a81\u7136\u770b\u5
    230\u9019\u500b7-11\u602a\u7269\uff01\u770b\u8d77\u4f86\u6bd4\u6a19\u6e96\u62db\
    u724c\u6709\u4f5c\u7528\u7684\u53ea\u6709\u4e2d\u9593\u7684\u6307\u793a\u71c8\u8
    00c\u5df2\uff0c\u53ef\u537b\u6709\u8d85\u7d1a\u5927\u7684footprint\uff01<br /
    ><br /><a href="http://4.bp.blogspot.com/_LBJ4udkQZag/Rm6wHH1b7QI/AA
    

    You need to do something about the markup. Steps: Look at your raw "htmlSource" in an XML browser. Is the XML non-compliant? How can you avoid having untranslated < etc? What elements have text content that is "English" only by virtue of it being a URL or similar? Is there a problem in Beautiful[Stone]Soup? Should you be using some other functionality of Beautiful[Stone]Soup? Should you use lxml instead?

    I'd suggest some research followed by a new SO question.

    end of update 3

提交回复
热议问题