Python: Getting rid of \u200b from a string using regular expressions

前端未结

关注

 2  1178

轻奢々 2021-02-08 11:44

I have a web scraper that takes forum questions, splits them into individual words and writes it to the text file. The words are stored in a list of tuples. Each tuple contains

2条回答

不知归路 (楼主)

2021-02-08 12:34
I tested that with python 2.7. replace works as expected:
```
>>> u'used\u200b'.replace(u'\u200b', '*')
u'used*'
```
and so does strip:
```
>>> u'used\u200b'.strip(u'\u200b')
u'used'
```
Just remember that the arguments to those functions have to be Unicode literals. It should be u'\u200b', not '\u200b'. Notice the u in the beginning.

And actually, writing that character to a file works just fine.
```
>>> import codecs
>>> f = codecs.open('a.txt', encoding='utf-8', mode='w')
>>> f.write(u'used\u200bZero')
```
See resources:
- The python 2 Unicode howto
- The python 3 Unicode howto
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...