I need to take a string, and shorten it to 140 characters.
Currently I am doing:
if len(tweet) > 140:
tweet = re.sub(r\"\\s+\", \" \", tweet)
Save two characters and use an elipsis (…
, 0x2026) instead of three dots!
This punts the word-breaking decision to the re module, but it may work well enough for you.
import re
def shorten(tweet, footer="", limit=140):
"""Break tweet into two pieces at roughly the last word break
before limit.
"""
lower_break_limit = limit / 2
# limit under which to assume breaking didn't work as expected
limit -= len(footer)
tweet = re.sub(r"\s+", " ", tweet.strip())
m = re.match(r"^(.{,%d})\b(?:\W|$)" % limit, tweet, re.UNICODE)
if not m or m.end(1) < lower_break_limit:
# no suitable word break found
# cutting at an arbitrary location,
# or if len(tweet) < lower_break_limit, this will be true and
# returning this still gives the desired result
return tweet[:limit] + footer
return m.group(1) + footer
Basically, in CJK (Except Korean with spaces), you need dictionary look-ups to segment words properly. Depending on your exact definition of "word", Japanese can be more difficult than that, since not all inflected variants of a word (i.e. "行こう" vs. "行った") will appear in the dictionary. Whether it's worth the effort depends upon your application.