How to fix: “UnicodeDecodeError: 'ascii' codec can't decode byte”

前端未结

关注

 19  1578

as3:~/ngokevin-site# nano content/blog/20140114_test-chinese.mkd
as3:~/ngokevin-site# wok
Traceback (most recent call last):
File \"/usr/local/bin/wok\", line 4, in


                      
              相关标签:


      
      
        
          19条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  迷失自我        
                
              
                            
                2020-11-22 02:03
              
            
            
                                                                       
Got a same error and this solved my error. Thanks!
python 2 and python 3 differing in unicode handling is making pickled files quite incompatible to load. So Use python pickle's encoding argument. Link below helped me solve the similar problem when I was trying to open pickled data from my python 3.7, while my file was saved originally in python 2.x version.
https://blog.modest-destiny.com/posts/python-2-and-3-compatible-pickle-save-and-load/
I copy the load_pickle function in my script and called the load_pickle(pickle_file) while loading my input_data like this:

input_data = load_pickle("my_dataset.pkl")


The load_pickle function is here:

def load_pickle(pickle_file):
    try:
        with open(pickle_file, 'rb') as f:
            pickle_data = pickle.load(f)
    except UnicodeDecodeError as e:
        with open(pickle_file, 'rb') as f:
            pickle_data = pickle.load(f, encoding='latin1')
    except Exception as e:
        print('Unable to load data ', pickle_file, ':', e)
        raise
    return pickle_data

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  渐次进展        
                
              
                            
                2020-11-22 02:03
              
            
            
                                                                       
This worked for me:

    file = open('docs/my_messy_doc.pdf', 'rb')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  青春惊慌失措        
                
              
                            
                2020-11-22 02:04
              
            
            
                                                                       
This error occurs when there are some non ASCII characters in our string and we are performing any operations on that string without proper decoding.
This helped me solve my problem.
I am reading a CSV file with columns ID,Text and decoding characters in it as below:

train_df = pd.read_csv("Example.csv")
train_data = train_df.values
for i in train_data:
    print("ID :" + i[0])
    text = i[1].decode("utf-8",errors="ignore").strip().lower()
    print("Text: " + text)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  执念已碎        
                
              
                            
                2020-11-22 02:08
              
            
            
                                                                       
In order to resolve this on an operating system level in an Ubuntu installation check the following:

$ locale charmap


If you get

locale: Cannot set LC_CTYPE to default locale: No such file or directory


instead of

UTF-8


then set LC_CTYPE and LC_ALL like this:

$ export LC_ALL="en_US.UTF-8"
$ export LC_CTYPE="en_US.UTF-8"

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  灰色年华        
                
              
                            
                2020-11-22 02:08
              
            
            
                                                                       
Here is my solution, just add the encoding. 
with open(file, encoding='utf8') as f

And because reading glove file will take a long time, I recommend to the glove file to a numpy file. When netx time you read the embedding weights, it will save your time. 

import numpy as np
from tqdm import tqdm


def load_glove(file):
    """Loads GloVe vectors in numpy array.
    Args:
        file (str): a path to a glove file.
    Return:
        dict: a dict of numpy arrays.
    """
    embeddings_index = {}
    with open(file, encoding='utf8') as f:
        for i, line in tqdm(enumerate(f)):
            values = line.split()
            word = ''.join(values[:-300])
            coefs = np.asarray(values[-300:], dtype='float32')
            embeddings_index[word] = coefs

    return embeddings_index

# EMBEDDING_PATH = '../embedding_weights/glove.840B.300d.txt'
EMBEDDING_PATH = 'glove.840B.300d.txt'
embeddings = load_glove(EMBEDDING_PATH)

np.save('glove_embeddings.npy', embeddings) 


Gist link: https://gist.github.com/BrambleXu/634a844cdd3cd04bb2e3ba3c83aef227
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情书的邮戳        
                
              
                            
                2020-11-22 02:09
              
            
            
                                                                       
In some cases, when you check your default encoding (print sys.getdefaultencoding()), it returns that you are using ASCII. If you change to UTF-8, it doesn't work, depending on the content of your variable.
I found another way:    

import sys
reload(sys)  
sys.setdefaultencoding('Cp1252')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
3
4
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复