Removing escaped entities from a String in Python [duplicate]

泄露秘密 提交于 2019-12-23 02:49:56

问题


I've a huge csv file of tweets. I read them both into the computer and stored them in two separate dictionaries - one for negative tweets, one for positive. I wanted to read the file in and parse it to a dictionary whilst removing any punctuation marks. I've used this code:

tweets = []
for (text, sentiment) in pos_tweets.items() + neg_tweets.items():
    shortenedText = [e.lower() and e.translate(string.maketrans("",""), string.punctuation) for e in text.split() if len(e) >= 3 and not e.startswith('http')]
print shortenedText

It's all worked well barring one minor problem. The huge csv file I've downloaded has unfortunately changed some of the punctuation. I'm not sure what this is called so can't really google it, but effectively some sentence might begin:

"ampampFightin"
""The truth is out there"
"&altThis is the way I feel"

Is there a way to get rid of all these? I notice the latter two begin with an ampersand - will a simple search for that get rid of it (the only reason I'm asking and not doing is because there's too many tweets for me to manually check)


回答1:


First, unescape HTML entities, then remove punctuation chars:

import HTMLParser

tweets = []
for (text, sentiment) in pos_tweets.items() + neg_tweets.items():
    text = HTMLParser.HTMLParser().unescape(text)
    shortenedText = [e.lower() and e.translate(string.maketrans("",""), string.punctuation) for e in text.split() if len(e) >= 3 and not e.startswith('http')]
print shortenedText

Here's an example, how unescape works:

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape(""The truth is out there")
u'"The truth is out there'

UPD: the solution to UnicodeDecodeError problem : use text.decode('utf8'). Here's a good explanation why do you need to do this.



来源:https://stackoverflow.com/questions/18146557/removing-escaped-entities-from-a-string-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!