问题
so I am trying to take this data that uses unicode indicators and make it print with emojis. It is currently in a txt. file but I will write to an excel file later. So anyways I am getting an error I am not sure what to do with. This is the text I am reading:
"Thanks @UglyGod \ud83d\ude4f https:\\/\\/t.co\\/8zVVNtv1o6\"
"RT @Rosssen: Multiculti beatdown \ud83d\ude4f https:\\/\\/t.co\\/fhwVkjhFFC\"
And here is my code:
sampleFile= open('tweets.txt', 'r').read()
splitFile=sampleFile.split('\n')
for line in sampleFile:
x=line.encode('utf-8')
print(x.decode('unicode-escape'))
This is the error Message:
UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 0: \ at end of string
Any ideas? This is how the data was originally generated.
class listener(StreamListener):
def on_data(self, data):
# Check for a field unique to tweets (if missing, return immediately)
if "in_reply_to_status_id" not in data:
return
with open("see_no_evil_monkey.csv", 'a') as saveFile:
try:
saveFile.write(json.dumps(data) + "\n")
except (BaseException, e):
print ("failed on data", str(e))
time.sleep(5)
return True
def on_error(self, status):
print (status)
回答1:
Your emoji 🙏 is represented as a surrogate pair, see also here for info about this particular glyph. Python cannot decode surrogates, so you'll need to look at exactly how your tweets.txt
file was generated, and try encoding the original tweets, along with the emoji, as UTF-8. This will make reading and processing the text file much easier.
回答2:
This is how the data was originally generated...
saveFile.write(json.dumps(data) + "\n")
You should use json.loads()
instead of .decode('unicode-escape')
to read JSON text:
#!/usr/bin/env python3
import json
with open('tweets.txt', encoding='ascii') as file:
for line in file:
text = json.loads(line)
print(text)
来源:https://stackoverflow.com/questions/38106422/converting-to-emoji