Python: Split unicode string on word boundaries

前端 未结 9 867
清酒与你
清酒与你 2020-12-31 13:16

I need to take a string, and shorten it to 140 characters.

Currently I am doing:

if len(tweet) > 140:
    tweet = re.sub(r\"\\s+\", \" \", tweet)          


        
相关标签:
9条回答
  • 2020-12-31 13:51

    Save two characters and use an elipsis (, 0x2026) instead of three dots!

    0 讨论(0)
  • 2020-12-31 14:02

    This punts the word-breaking decision to the re module, but it may work well enough for you.

    import re
    
    def shorten(tweet, footer="", limit=140):
        """Break tweet into two pieces at roughly the last word break
        before limit.
        """
        lower_break_limit = limit / 2
        # limit under which to assume breaking didn't work as expected
    
        limit -= len(footer)
    
        tweet = re.sub(r"\s+", " ", tweet.strip())
        m = re.match(r"^(.{,%d})\b(?:\W|$)" % limit, tweet, re.UNICODE)
        if not m or m.end(1) < lower_break_limit:
            # no suitable word break found
            # cutting at an arbitrary location,
            # or if len(tweet) < lower_break_limit, this will be true and
            # returning this still gives the desired result
            return tweet[:limit] + footer
        return m.group(1) + footer
    
    0 讨论(0)
  • 2020-12-31 14:03

    Basically, in CJK (Except Korean with spaces), you need dictionary look-ups to segment words properly. Depending on your exact definition of "word", Japanese can be more difficult than that, since not all inflected variants of a word (i.e. "行こう" vs. "行った") will appear in the dictionary. Whether it's worth the effort depends upon your application.

    0 讨论(0)
提交回复
热议问题