'utf-16-le' codec can't decode bytes while reading EXCEL in PYTHON

前端 未结 1 1091
北海茫月
北海茫月 2021-01-27 18:17

I am trying to read various numbers of xls files with different languages, Arabic, Greek, Italian, Hebrew, etc. and I get the error shown below when I try to call open_workbook

1条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2021-01-27 19:20

    It's unlikely that language is the issue. More likely is that xlrd is having trouble detecting the encoding of the .xlsx file.

    As xlrd notes in the documentation on handling of unicode:

    This package presents all text strings as Python unicode objects. From Excel 97 onwards, text in Excel spreadsheets has been stored as UTF-16LE (a 16-bit Unicode Transformation Format). Older files (Excel 95 and earlier) don’t keep strings in Unicode; a CODEPAGE record provides a codepage number (for example, 1252) which is used by xlrd to derive the encoding (for same example: “cp1252”) which is used to translate to Unicode.

    My first step to look at this would be to determine the actual encoding. How old is the file and how was it was created (actual Excel? or via a 3rd party tool).

    You could look for the CODEPAGE record by opening the file in a text/hex editor and then try to force that encoding.

    It sounds to me based on the error that it isn't utf-16le (the default assumption of xlrd), so you're going to have to determine it somehow or else start trying random encodings eg:

    book = xlrd.open_workbook(..., encoding_override="cp1252")
    book = xlrd.open_workbook(..., encoding_override="utf-8")
    book = xlrd.open_workbook(..., encoding_override="latin-1")
    

    0 讨论(0)
提交回复
热议问题