Read lines of a textfile and getting charmap decode error

笑着哭i 提交于 2020-02-02 11:43:06

问题


im using python3.3 and a sqlite3 database. I have a big textfile around 270mb big which i can open with WordPad in Windows7.

Each line in that file looks as follows:

term \t number\n

I want to read every line and save the values in a database. My Code looks as follows:

f = open('sorted.de.word.unigrams', "r")
for line in f:

    #code

I was able to read all data into my database but just to a certain line, i would suggest maybe half of all lines. Then im getting the following error:

File "C:\projects\databtest.py", line 18, in <module>
for line in f:
File "c:\python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 140: character maps to   <undefined>

I tried to open the file with encoding = utf-8 but nothing worked even other codecs. Then i tried to make a copy with WordPad via save as utf-8 txt file. But WordPad crashed.

Where is the problem here, it looks like there is some character in that line that python cant handle. What can i do to completely read my file? Or is it maybe possible to ignore such Error messages and just go on with the next line?

You can download the packed file here:

http://wacky.sslmit.unibo.it/lib/exe/fetch.php?media=frequency_lists:sorted.de.word.unigrams.7z

Thanks alot!


回答1:


I checked the file, and the root of the problem seems to be that the file contains words in at least two encodings: probably cp1252 and cp850. The character 0x81 is ü in cp850 but undefined in cp1252. You can handle that situation by catching the exception, but some other German characters map to valid but wrong characters in cp1252. If you are happy with such an imperfect solution, here's how you could do it:

with open('sorted.de.word.unigrams','rb') as f: #open in binary mode
    for line in f:
        for cp in ('cp1252', 'cp850'):
            try:
                s = line.decode(cp)
            except UnicodeDecodeError:
                pass
            else:
                store_to_db(s)
                break



回答2:


Try

data = []
import codecs
with codecs.open('sorted.de.word.unigrams', 'r') as f:
    for line in f:
         data.append(line)

If you want to ignore error, you can do

try:
    # Your code that enter data to database
except UnicodeDecodeError:
    pass



回答3:


This usually happens when there is encoding mismatch.

0x81 does not mean anything, try specifying the encoding

file = open(filename, encoding="utf8")


来源:https://stackoverflow.com/questions/18648154/read-lines-of-a-textfile-and-getting-charmap-decode-error

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!