Python: Getting rid of \u200b from a string using regular expressions

后端未结

关注

 2  1335

I have a web scraper that takes forum questions, splits them into individual words and writes it to the text file. The words are stored in a list of tuples. Each tuple contains

相关标签:

2条回答

北海茫月

2021-02-08 15:23

you can open the file C:\Users\SOMEN\AppData\Local\Programs\Python\Python37-32\lib\encodings*cp1252.py* in my case but it should be the same.

decoding_table = (
'\x00'     #  0x00 -> NULL
'\x01'     #  0x01 -> START OF HEADING
'\x02'     #  0x02 -> START OF TEXT
'\x03'     #  0x03 -> END OF TEXT
'\x04'     #  0x04 -> END OF TRANSMISSION
'\x05'     #  0x05 -> ENQUIRY
'\x06'     #  0x06 -> ACKNOWLEDGE
'\x07'     #  0x07 -> BELL
'\x08'     #  0x08 -> BACKSPACE
'\t'       #  0x09 -> HORIZONTAL TABULATION
'\n'       #  0x0A -> LINE FEED
'\x0b'     #  0x0B -> VERTICAL TABULATION
'\x0c'     #  0x0C -> FORM FEED
'\r'       #  0x0D -> CARRIAGE RETURN
'\x0e'     #  0x0E -> SHIFT OUT
'\x0f'     #  0x0F -> SHIFT IN
'\x10'     #  0x10 -> DATA LINK ESCAPE
'\x11'     #  0x11 -> DE
#add the character code here
'\u200b' #add this in the file and save it.

0 讨论(0)

天命终不由人

2021-02-08 15:44
I tested that with python 2.7. replace works as expected:
```
>>> u'used\u200b'.replace(u'\u200b', '*')
u'used*'
```
and so does strip:
```
>>> u'used\u200b'.strip(u'\u200b')
u'used'
```
Just remember that the arguments to those functions have to be Unicode literals. It should be u'\u200b', not '\u200b'. Notice the u in the beginning.

And actually, writing that character to a file works just fine.
```
>>> import codecs
>>> f = codecs.open('a.txt', encoding='utf-8', mode='w')
>>> f.write(u'used\u200bZero')
```
See resources:
- The python 2 Unicode howto
- The python 3 Unicode howto
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
0 讨论(0)
发布评论:

提交评论
- 加载中...