UnicodeDecodeError when reading CSV file in Pandas with Python

后端未结

关注

 21  2213

I\'m running a program which is processing 30,000 similar files. A random number of them are stopping and producing this error...

File "C:\\Importer\\src


                      
              相关标签:


      
      
        
          21条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  梦毁少年i        
                
              
                            
                2020-11-22 05:04
              
            
            
                                                                       
Struggled with this a while and thought I'd post on this question as it's the first search result.  Adding the encoding="iso-8859-1" tag to pandas read_csv didn't work, nor did any other encoding, kept giving a UnicodeDecodeError. 

If you're passing a file handle to pd.read_csv(), you need to put the encoding attribute on the file open, not in read_csv. Obvious in hindsight, but a subtle error to track down.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  抹茶落季        
                
              
                            
                2020-11-22 05:04
              
            
            
                                                                       
Try this:

import pandas as pd
with open('filename.csv') as f:
    data = pd.read_csv(f)


Looks like it will take care of the encoding without explicitly expressing it through argument
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  难免孤独        
                
              
                            
                2020-11-22 05:05
              
            
            
                                                                       
This answer seems to be the catch-all for CSV encoding issues. If you are getting a strange encoding problem with your header like this:

>>> f = open(filename,"r")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('\ufeffid', '1'), ... ])


Then you have a byte order mark (BOM) character at the beginning of your CSV file. This answer addresses the issue:

Python read csv - BOM embedded into the first key

The solution is to load the CSV with encoding="utf-8-sig":

>>> f = open(filename,"r", encoding="utf-8-sig")
>>> reader = DictReader(f)
>>> next(reader)
OrderedDict([('id', '1'), ... ])


Hopefully this helps someone.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2020-11-22 05:06
              
            
            
                                                                       
Another important issue that I faced which resulted in the same error was:
_values = pd.read_csv("C:\Users\Mujeeb\Desktop\file.xlxs")


^This line resulted in the same error because I am reading an excel file using read_csv() method. Use read_excel() for reading .xlxs

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2020-11-22 05:08
              
            
            
                                                                       
Pandas allows to specify encoding, but does not allow to ignore errors not to automatically replace the offending bytes. So there is no one size fits all method but different ways depending on the actual use case.


You know the encoding, and there is no encoding error in the file. 
Great: you have just to specify the encoding:

file_encoding = 'cp1252'        # set file_encoding to the file encoding (utf8, latin1, etc.)
pd.read_csv(input_file_and_path, ..., encoding=file_encoding)

You do not want to be bothered with encoding questions, and only want that damn file to load, no matter if some text fields contain garbage. Ok, you only have to use Latin1 encoding because it accept any possible byte as input (and convert it to the unicode character of same code):

pd.read_csv(input_file_and_path, ..., encoding='latin1')

You know that most of the file is written with a specific encoding, but it also contains encoding errors. A real world example is an UTF8 file that has been edited with a non utf8 editor and which contains some lines with a different encoding. Pandas has no provision for a special error processing, but Python open function has (assuming Python3), and read_csv accepts a file like object. Typical errors parameter to use here are 'ignore' which just suppresses the offending bytes or (IMHO better) 'backslashreplace' which replaces the offending bytes by their Python’s backslashed escape sequence:

file_encoding = 'utf8'        # set file_encoding to the file encoding (utf8, latin1, etc.)
input_fd = open(input_file_and_path, encoding=file_encoding, errors = 'backslashreplace')
pd.read_csv(input_fd, ...)


                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  挽巷        
                
              
                            
                2020-11-22 05:09
              
            
            
                                                                       
I am posting an answer to provide an updated solution and explanation as to why this problem can occur. Say you are getting this data from a database or Excel workbook. If  you have special characters like La Cañada Flintridge city, well unless you are exporting the data using UTF-8 encoding, you're going to introduce errors. La Cañada Flintridge city will become La Ca\xf1ada Flintridge city. If you are using pandas.read_csv without any adjustments to the default parameters, you'll hit the following error 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 5: invalid continuation byte


Fortunately, there are a few solutions. 

Option 1, fix the exporting. Be sure to use UTF-8 encoding. 

Option 2, if fixing the exporting problem is not available to you, and you need to use pandas.read_csv, be sure to include the following paramters, engine='python'. By default, pandas uses engine='C' which is great for reading large clean files, but will crash if anything unexpected comes up. In my experience, setting encoding='utf-8' has never fixed this UnicodeDecodeError. Also, you do not need to use errors_bad_lines, however, that is still an option if you REALLY need it.

pd.read_csv(<your file>, engine='python')


Option 3: solution is my preferred solution personally. Read the file using vanilla Python.

import pandas as pd

data = []

with open(<your file>, "rb") as myfile:
    # read the header seperately
    # decode it as 'utf-8', remove any special characters, and split it on the comma (or deliminator)
    header = myfile.readline().decode('utf-8').replace('\r\n', '').split(',')
    # read the rest of the data
    for line in myfile:
        row = line.decode('utf-8', errors='ignore').replace('\r\n', '').split(',')
        data.append(row)

# save the data as a dataframe
df = pd.DataFrame(data=data, columns = header)


Hope this helps people encountering this issue for the first time.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
3
4
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复