“UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80” while loading pickle file using pydrive on google colaboratory

I am new to using google colaboratory (colab) and pydrive along with it. I am trying to load data in 'CAS_num_strings' which was written in a pickle file in a specific directory on my google drive using colab as:

pickle.dump(CAS_num_strings,open('CAS_num_strings.p', 'wb'))
dump_meta = {'title': 'CAS.pkl', 'parents': [{'id':'1UEqIADV_tHic1Le0zlT25iYB7T6dBpBj'}]} 
pkl_dump = drive.CreateFile(dump_meta)
pkl_dump.SetContentFile('CAS_num_strings.p')
pkl_dump.Upload()
print(pkl_dump.get('id'))

Where 'id':'1UEqIADV_tHic1Le0zlT25iYB7T6dBpBj' makes sure that it has a specific parent folder with this given by this id. The last print command gives me the output:

'1ZgZfEaKgqGnuBD40CY8zg0MCiqKmi1vH'

Hence, I am able to create and dump the pickle file whose id is '1ZgZfEaKgqGnuBD40CY8zg0MCiqKmi1vH'. Now, I want to load this pickle file in another colab script for a different purpose. In order to load, I use the command set:

cas_strings = drive.CreateFile({'id':'1ZgZfEaKgqGnuBD40CY8zg0MCiqKmi1vH'})
print('title: %s, mimeType: %s' % (cas_strings['title'], cas_strings['mimeType']))
print('Downloaded content "{}"'.format(cas_strings.GetContentString()))

This gives me the output:

title: CAS.pkl, mimeType: text/x-pascal

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-9-a80d9de0fecf> in <module>()
     30 cas_strings = drive.CreateFile({'id':'1ZgZfEaKgqGnuBD40CY8zg0MCiqKmi1vH'})
     31 print('title: %s, mimeType: %s' % (cas_strings['title'], cas_strings['mimeType']))
---> 32 print('Downloaded content "{}"'.format(cas_strings.GetContentString()))
     33 
     34 

/usr/local/lib/python3.6/dist-packages/pydrive/files.py in GetContentString(self, mimetype, encoding, remove_bom)
    192                     self.has_bom == remove_bom:
    193       self.FetchContent(mimetype, remove_bom)
--> 194     return self.content.getvalue().decode(encoding)
    195 
    196   def GetContentFile(self, filename, mimetype=None, remove_bom=False):

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

As you can see, it finds the file CAS.pkl but cannot decode the data. I want to be able to resolve this error. I understand that the normal utf-8 encoding/decoding works smoothly during normal pickle dumping and loading with the 'wb' and 'rb' options. However in the present case, after dumping I can't seem to load it from the pickle file in google drive created in the previous step. The error exists somewhere in me not being able to specify how to decode the data at "return self.content.getvalue().decode(encoding)". I can't seem to find from here (https://developers.google.com/drive/v2/reference/files#resource-representations) which keywords/metadata tags to modify. Any help is appreciated. Thanks

The problem is that GetContentString only works if the contents are a valid UTF-8 string (docs), and your pickle is not.

Unfortunately, you'll have to do a little extra work, since there's no GetContentBytes -- you have to save the contents to a file and read them back out. Here's a working example: https://colab.research.google.com/drive/1gmh21OrJL0Dv49z28soYq_YcqKEnaQ1X

Actually, I found an elegant answer with a little help from my friends. Instead of GetContentStrings, I use GetContentFile, which is the counterpart of the SetContentFile. This loads the file in the current workspace from which I can read it like any pickle file. Finally, the data gets loaded into cas_nums all well.

cas_strings = drive.CreateFile({'id':'1ZgZfEaKgqGnuBD40CY8zg0MCiqKmi1vH'})
print('title: %s, mimeType: %s' % (cas_strings['title'], cas_strings['mimeType']))
cas_strings.GetContentFile(cas_strings['title'])
cas_nums = pickle.load(open(cas_strings['title'],'rb'))

More details about this can be found in the pydrive documentation in the section download file content - http://pythonhosted.org/PyDrive/filemanagement.html#download-file-content

来源：https://stackoverflow.com/questions/49145328/unicodedecodeerror-utf-8-codec-cant-decode-byte-0x80-while-loading-pickle

标签

python

utf-8

pickle

google-colaboratory

pydrive