How to fix: “UnicodeDecodeError: 'ascii' codec can't decode byte”

前端 未结 19 1548
谎友^
谎友^ 2020-11-22 01:21
as3:~/ngokevin-site# nano content/blog/20140114_test-chinese.mkd
as3:~/ngokevin-site# wok
Traceback (most recent call last):
File \"/usr/local/bin/wok\", line 4, in
         


        
相关标签:
19条回答
  • 2020-11-22 02:03

    Got a same error and this solved my error. Thanks! python 2 and python 3 differing in unicode handling is making pickled files quite incompatible to load. So Use python pickle's encoding argument. Link below helped me solve the similar problem when I was trying to open pickled data from my python 3.7, while my file was saved originally in python 2.x version. https://blog.modest-destiny.com/posts/python-2-and-3-compatible-pickle-save-and-load/ I copy the load_pickle function in my script and called the load_pickle(pickle_file) while loading my input_data like this:

    input_data = load_pickle("my_dataset.pkl")
    

    The load_pickle function is here:

    def load_pickle(pickle_file):
        try:
            with open(pickle_file, 'rb') as f:
                pickle_data = pickle.load(f)
        except UnicodeDecodeError as e:
            with open(pickle_file, 'rb') as f:
                pickle_data = pickle.load(f, encoding='latin1')
        except Exception as e:
            print('Unable to load data ', pickle_file, ':', e)
            raise
        return pickle_data
    
    0 讨论(0)
  • 2020-11-22 02:03

    This worked for me:

        file = open('docs/my_messy_doc.pdf', 'rb')
    
    0 讨论(0)
  • 2020-11-22 02:04

    This error occurs when there are some non ASCII characters in our string and we are performing any operations on that string without proper decoding. This helped me solve my problem. I am reading a CSV file with columns ID,Text and decoding characters in it as below:

    train_df = pd.read_csv("Example.csv")
    train_data = train_df.values
    for i in train_data:
        print("ID :" + i[0])
        text = i[1].decode("utf-8",errors="ignore").strip().lower()
        print("Text: " + text)
    
    0 讨论(0)
  • 2020-11-22 02:08

    In order to resolve this on an operating system level in an Ubuntu installation check the following:

    $ locale charmap
    

    If you get

    locale: Cannot set LC_CTYPE to default locale: No such file or directory
    

    instead of

    UTF-8
    

    then set LC_CTYPE and LC_ALL like this:

    $ export LC_ALL="en_US.UTF-8"
    $ export LC_CTYPE="en_US.UTF-8"
    
    0 讨论(0)
  • 2020-11-22 02:08

    Here is my solution, just add the encoding. with open(file, encoding='utf8') as f

    And because reading glove file will take a long time, I recommend to the glove file to a numpy file. When netx time you read the embedding weights, it will save your time.

    import numpy as np
    from tqdm import tqdm
    
    
    def load_glove(file):
        """Loads GloVe vectors in numpy array.
        Args:
            file (str): a path to a glove file.
        Return:
            dict: a dict of numpy arrays.
        """
        embeddings_index = {}
        with open(file, encoding='utf8') as f:
            for i, line in tqdm(enumerate(f)):
                values = line.split()
                word = ''.join(values[:-300])
                coefs = np.asarray(values[-300:], dtype='float32')
                embeddings_index[word] = coefs
    
        return embeddings_index
    
    # EMBEDDING_PATH = '../embedding_weights/glove.840B.300d.txt'
    EMBEDDING_PATH = 'glove.840B.300d.txt'
    embeddings = load_glove(EMBEDDING_PATH)
    
    np.save('glove_embeddings.npy', embeddings) 
    

    Gist link: https://gist.github.com/BrambleXu/634a844cdd3cd04bb2e3ba3c83aef227

    0 讨论(0)
  • 2020-11-22 02:09

    In some cases, when you check your default encoding (print sys.getdefaultencoding()), it returns that you are using ASCII. If you change to UTF-8, it doesn't work, depending on the content of your variable. I found another way:

    import sys
    reload(sys)  
    sys.setdefaultencoding('Cp1252')
    
    0 讨论(0)
提交回复
热议问题