How to open a huge excel file efficiently

前端未结

关注

 11  757

佛祖请我去吃肉 2021-01-30 21:29

I have a 150MB one-sheet excel file that takes about 7 minutes to open on a very powerful machine using the following:

# using python
import xlrd
wb = xlrd.open_


      
      
        
          11条回答        

        
                    
            
            
                         
                
              
              
                
                   借酒劲吻你
                                             
                
                
                (楼主)
            
              
              
                2021-01-30 21:53
              

            
            
                        
Python's Pandas library could be used to hold and process your data, but using it to directly load the .xlsx file will be quite slow, e.g. using read_excel().

One approach would be to use Python to automate the conversion of your file into CSV using Excel itself and to then use Pandas to load the resulting CSV file using read_csv(). This will give you a good speed up, but not under 30 seconds:

import win32com.client as win32        
import pandas as pd    
from datetime import datetime    

print ("Starting")
start = datetime.now()

# Use Excel to load the xlsx file and save it in csv format
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(r'c:\full path\BigSpreadsheet.xlsx')
excel.DisplayAlerts = False
wb.DoNotPromptForConvert = True
wb.CheckCompatibility = False

print('Saving')
wb.SaveAs(r'c:\full path\temp.csv', FileFormat=6, ConflictResolution=2) 
excel.Application.Quit()

# Use Pandas to load the resulting CSV file
print('Loading CSV')
df = pd.read_csv(r'c:\full path\temp.csv', dtype=str)

print(df.shape)
print("Done", datetime.now() - start)


Column types

The types for your columns can be specified by passing dtype and converters and parse_dates:

df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[8], infer_datetime_format=True)


You should also specify infer_datetime_format=True, as this will greatly speed up the date conversion.


  nfer_datetime_format : boolean, default False
  
  If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be
  inferred, switch to a faster method of parsing them. In some cases
  this can increase the parsing speed by 5-10x.


Also add dayfirst=True if dates are in the form DD/MM/YYYY.

Selective columns

If you actually only need to work on columns 1 9 11, then you could further reduce resources by specifying usecols=[0, 8, 10] as follows:

df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[1], dayfirst=True, infer_datetime_format=True, usecols=[0, 8, 10])


The resulting dataframe would then only contain those 3 columns of data.

RAM drive

Using a RAM drive to store the temporary CSV file to would further speed up the load time.

Note: This does assume you are using a Windows PC with Excel available.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它11个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复