apostrophe turning into \x92

痞子三分冷 提交于 2019-12-12 08:23:14

问题


mycorpus.txt

Human where's machine interface for lab abc computer applications   
A where's survey of user opinion of computer system response time

stopwords.txt

let's
ain't
there's

The following code

corpus = set()
for line in open("path\\to\\mycorpus.txt"):
    corpus.update(set(line.lower().split()))
print corpus

stoplist = set()
for line in open("C:\\Users\\Pankaj\\Desktop\\BTP\\stopwords_new.txt"):
    stoplist.add(line.lower().strip())
print stoplist

gives the following output

set(['a', "where's", 'abc', 'for', 'of', 'system', 'lab', 'machine', 'applications', 'computer', 'survey', 'user', 'human', 'time', 'interface', 'opinion', 'response'])
set(['let\x92s', 'ain\x92t', 'there\x92s'])

Why is the apostrophe turning into \x92 in the 2nd set??


回答1:


Code point 92(hex) in window-1252 encoding is Unicode code point 2019(hex) which is 'RIGHT SINGLE QUOTATION MARK'. This looks very like an apostrophe and is likely to be the actual character that you have in stopwords.txt, which I've guessed from the way python has interpreted in, has be encoded in windows-1252 or an encoding that shares ASCII and codepoint values.

' vs ’



来源:https://stackoverflow.com/questions/15564063/apostrophe-turning-into-x92

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!