Read lines of a textfile and getting charmap decode error

问题

im using python3.3 and a sqlite3 database. I have a big textfile around 270mb big which i can open with WordPad in Windows7.

Each line in that file looks as follows:

term \t number\n

I want to read every line and save the values in a database. My Code looks as follows:

f = open('sorted.de.word.unigrams', "r")
for line in f:

    #code

I was able to read all data into my database but just to a certain line, i would suggest maybe half of all lines. Then im getting the following error:

File "C:\projects\databtest.py", line 18, in <module>
for line in f:
File "c:\python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 140: character maps to   <undefined>

I tried to open the file with encoding = utf-8 but nothing worked even other codecs. Then i tried to make a copy with WordPad via save as utf-8 txt file. But WordPad crashed.

Where is the problem here, it looks like there is some character in that line that python cant handle. What can i do to completely read my file? Or is it maybe possible to ignore such Error messages and just go on with the next line?

You can download the packed file here:

http://wacky.sslmit.unibo.it/lib/exe/fetch.php?media=frequency_lists:sorted.de.word.unigrams.7z

Thanks alot!

回答1:

I checked the file, and the root of the problem seems to be that the file contains words in at least two encodings: probably cp1252 and cp850. The character 0x81 is ü in cp850 but undefined in cp1252. You can handle that situation by catching the exception, but some other German characters map to valid but wrong characters in cp1252. If you are happy with such an imperfect solution, here's how you could do it:

with open('sorted.de.word.unigrams','rb') as f: #open in binary mode
    for line in f:
        for cp in ('cp1252', 'cp850'):
            try:
                s = line.decode(cp)
            except UnicodeDecodeError:
                pass
            else:
                store_to_db(s)
                break

回答2:

Try

data = []
import codecs
with codecs.open('sorted.de.word.unigrams', 'r') as f:
    for line in f:
         data.append(line)

If you want to ignore error, you can do

try:
    # Your code that enter data to database
except UnicodeDecodeError:
    pass

回答3:

This usually happens when there is encoding mismatch.

0x81 does not mean anything, try specifying the encoding

file = open(filename, encoding="utf8")

来源：https://stackoverflow.com/questions/18648154/read-lines-of-a-textfile-and-getting-charmap-decode-error

标签

python

file

python-3.x

text-files

decode