I\'ve got a series of text that is mostly English, but contains some phrases with Chinese characters. Here\'s two examples:
s1 = \"You say: 你好. I say: 再見\"
s
A possible solution is to capture everything, but in different capture groups, so you can differentiate later if they're in Chinese or not.
ret = re.findall(ur'([\u4e00-\u9fff]+)|([^\u4e00-\u9fff]+)', utf_line)
result = []
for match in ret:
if match[0]:
result.append(translate(match[0]))
else:
result.append(match[1])
print(''.join(result))