_csv.Error: field larger than field limit (131072)

后端未结

关注

 8  1926

I have a script reading in a csv file with very huge fields:

# example from http://docs.python.org/3.3/library/csv.html?highlight=csv%20dictreader#examples
i


                      
              相关标签:


      
      
        
          8条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  情话喂你        
                
              
                            
                2020-11-28 19:31
              
            
            
                                                                       
I just had this happen to me on a 'plain' CSV file. Some people might call it an invalid formatted file. No escape characters, no double quotes and delimiter was a semicolon.

A sample line from this file would look like this:


  First cell; Second " Cell with one double quote and leading
  space;'Partially quoted' cell;Last cell


the single quote in the second cell would throw the parser off its rails. What worked was:

csv.reader(inputfile, delimiter=';', doublequote='False', quotechar='', quoting=csv.QUOTE_NONE)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  傲寒        
                
              
                            
                2020-11-28 19:33
              
            
            
                                                                       
Sometimes, a row contain double quote column. When csv reader try read this row, not understood end of column and fire this raise.
Solution is below:

reader = csv.reader(cf, quoting=csv.QUOTE_MINIMAL)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  夕颜        
                
              
                            
                2020-11-28 19:42
              
            
            
                                                                       
csv field sizes are controlled via [Python 3.Docs]: csv.field_size_limit([new_limit]):


  Returns the current maximum field size allowed by the parser. If new_limit is given, this becomes the new limit.


It is set by default to 128k or 0x20000 (131072), which should be enough for any decent .csv:




>>> import csv
>>>
>>> limit0 = csv.field_size_limit()
>>> limit0
131072
>>> "0x{0:016X}".format(limit0)
'0x0000000000020000'



However, when dealing with a .csv file (with the correct quoting and delimiter) having (at least) one field longer than this size, the error pops up. 
To get rid of the error, the size limit should be increased (to avoid any worries, the maximum possible value is attempted).

Behind the scenes (check [GitHub]: python/cpython - (master) cpython/Modules/_csv.c for implementation details), the variable that holds this value is a C long ([Wikipedia]: C data types), whose size varies depending on CPU architecture and OS (ILP). The classical difference: for a 64bit OS (Python build), the long type size (in bits) is:


Nix: 64
Win: 32


When attempting to set it, the new value is checked to be in the long boundaries, that's why in some cases another exception pops up (this case is common on Win):


>>> import sys
>>>
>>> sys.platform, sys.maxsize
('win32', 9223372036854775807)
>>>
>>> csv.field_size_limit(sys.maxsize)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: Python int too large to convert to C long



To avoid running into this problem, set the (maximum possible) limit (LONG_MAX) using an artifice (thanks to [Python 3.Docs]: ctypes - A foreign function library for Python). It should work on Python 3 and Python 2, on any CPU / OS.


>>> import ctypes as ct
>>>
>>> csv.field_size_limit(int(ct.c_ulong(-1).value // 2))
131072
>>> limit1 = csv.field_size_limit()
>>> limit1
2147483647
>>> "0x{0:016X}".format(limit1)
'0x000000007FFFFFFF'



64bit Python on a Nix like OS:


>>> import sys, csv, ctypes as ct
>>>
>>> sys.platform, sys.maxsize
('linux', 9223372036854775807)
>>>
>>> csv.field_size_limit()
131072
>>>
>>> csv.field_size_limit(int(ct.c_ulong(-1).value // 2))
131072
>>> limit1 = csv.field_size_limit()
>>> limit1
9223372036854775807
>>> "0x{0:016X}".format(limit1)
'0x7FFFFFFFFFFFFFFF'



For 32bit Python, things are uniform: it's the behavior encountered on Win. 

Check the following resources for more details on:


Playing with C types boundaries from Python: [SO]: Maximum and minimum value of C types integers from Python (@CristiFati's answer)
Python 32bit vs 64bit differences: [SO]: How do I determine if my python shell is executing in 32bit or 64bit mode on OS X? (@CristiFati's answer)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  借酒劲吻你        
                
              
                            
                2020-11-28 19:46
              
            
            
                                                                       
You can use read_csv from pandas to skip these lines.

import pandas as pd

data_df = pd.read_csv('data.csv', error_bad_lines=False)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  傲寒        
                
              
                            
                2020-11-28 19:47
              
            
            
                                                                       
The csv file might contain very huge fields, therefore increase the field_size_limit:

import sys
import csv

csv.field_size_limit(sys.maxsize)


sys.maxsize works for Python 2.x and 3.x. sys.maxint would only work with Python 2.x (SO: what-is-sys-maxint-in-python-3)

Update

As Geoff pointed out, the code above might result in the following error: OverflowError: Python int too large to convert to C long. 
To circumvent this, you could use the following quick and dirty code (which should work on every system with Python 2 and Python 3):

import sys
import csv
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  悲哀的现实        
                
              
                            
                2020-11-28 19:54
              
            
            
                                                                       
This could be because your CSV file has embedded single or double quotes. If your CSV file is tab-delimited try opening it as:

c = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复