How to open a huge excel file efficiently

前端未结

关注

 11  784

I have a 150MB one-sheet excel file that takes about 7 minutes to open on a very powerful machine using the following:

# using python
import xlrd
wb = xlrd.open_


                      
              相关标签:


      
      
        
          11条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  被撕碎了的回忆        
                
              
                            
                2021-01-30 21:45
              
            
            
                                                                       
Looks like it is hardly achievable in Python at all. If we unpack a sheet data file then it would take all required 30 seconds just to pass it through the C-based iterative SAX parser (using lxml, a very fast wrapper over libxml2):

from __future__ import print_function

from lxml import etree
import time


start_ts = time.time()

for data in etree.iterparse(open('xl/worksheets/sheet1.xml'), events=('start',), 
                            collect_ids=False, resolve_entities=False,
                            huge_tree=True):
    pass

print(time.time() - start_ts)


The sample output: 27.2134890556

By the way, the Excel itself needs about 40 seconds to load the workbook.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情书的邮戳        
                
              
                            
                2021-01-30 21:47
              
            
            
                                                                       
I'm using a Dell Precision T1700 workstation and using c# I was able to open the file and read it's contents in about 24 seconds just using standard code to open a workbook using interop services.  Using references to the Microsoft Excel 15.0 Object Library here is my code.

My using statements:

using System.Runtime.InteropServices;
using Excel = Microsoft.Office.Interop.Excel;


Code to open and read workbook:

public partial class MainWindow : Window {
    public MainWindow() {
        InitializeComponent();

        Excel.Application xlApp;
        Excel.Workbook wb;
        Excel.Worksheet ws;

        xlApp = new Excel.Application();
        xlApp.Visible = false;
        xlApp.ScreenUpdating = false;

        wb = xlApp.Workbooks.Open(@"Desired Path of workbook\Copy of BigSpreadsheet.xlsx");

        ws = wb.Sheets["Sheet1"];

        //string rng = ws.get_Range("A1").Value;
        MessageBox.Show(ws.get_Range("A1").Value);

        Marshal.FinalReleaseComObject(ws);

        wb.Close();
        Marshal.FinalReleaseComObject(wb);

        xlApp.Quit();
        Marshal.FinalReleaseComObject(xlApp);

        GC.Collect();
        GC.WaitForPendingFinalizers();
    }
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情话喂你        
                
              
                            
                2021-01-30 21:47
              
            
            
                                                                       
The c# and ole solution still have some bottleneck.So i test it by c++ and ado.

_bstr_t connStr(makeConnStr(excelFile, header).c_str());

TESTHR(pRec.CreateInstance(__uuidof(Recordset)));       
TESTHR(pRec->Open(sqlSelectSheet(connStr, sheetIndex).c_str(), connStr, adOpenStatic, adLockOptimistic, adCmdText));

while(!pRec->adoEOF)
{
    for(long i = 0; i < pRec->Fields->GetCount(); ++i)
    {   
        _variant_t v = pRec->Fields->GetItem(i)->Value;
        if(v.vt == VT_R8)
            num[i] = v.dblVal;
        if(v.vt == VT_BSTR)
            str[i] = v.bstrVal;          
        ++cellCount;
    }                                    
    pRec->MoveNext();
}


In i5-4460 and HDD machine,i find 500 thousands of cell in xls will take 1.5s.But same data in xlsx will take 2.829s.so it's possible for manipulating your data under 30s.

If you really need under 30s,use RAM Drive to reduce file IO.It will significantly improve your process.
I cannot download your data to test it,so please tell me the result.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦谈多话        
                
              
                            
                2021-01-30 21:47
              
            
            
                                                                       
Another way that should improve largely the load/operation time is a RAMDrive


create a RAMDrive with enough space for your file and a 10%..20% extra space...

copy the file for the RAMDrive...

Load the file from there... depending on your drive and filesystem
the speed improvement should be huge...

My favourite is IMDisk toolkit
 
(https://sourceforge.net/projects/imdisk-toolkit/)
here you have a powerfull command line to script everything...

I also recommend SoftPerfect ramdisk

(http://www.majorgeeks.com/files/details/softperfect_ram_disk.html)

but that also depends of your OS...
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉酒成梦        
                
              
                            
                2021-01-30 21:47
              
            
            
                                                                       
Have you tried loading the worksheet on demand, which available since version 0.7.1 of xlrd?

To do this you need to pass on_demand=True to open_workbook().


  xlrd.open_workbook(filename=None, logfile=<_io.TextIOWrapper
  name='' mode='w' encoding='UTF-8'>, verbosity=0, use_mmap=1,
  file_contents=None, encoding_override=None, formatting_info=False,
  on_demand=False, ragged_rows=False)




Other potential python solutions I found for reading an xlsx file:


Read the raw xml in 'xl/sharedStrings.xml' and 'xl/worksheets/sheet1.xml'
Try the openpyxl library's Read Only mode which claims too be optimized in memory usage for large files.  

from openpyxl import load_workbook wb = load_workbook(filename='large_file.xlsx', read_only=True) ws = wb['big_data']

for row in ws.rows:
    for cell in row:
        print(cell.value)

If you are running on Windows you could use PyWin32 and 'Excel.Application'

import time
import win32com.client as win32
def excel():
   xl = win32.gencache.EnsureDispatch('Excel.Application')
   ss = xl.Workbooks.Add()
...


                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  傲寒        
                
              
                            
                2021-01-30 21:49
              
            
            
                                                                       
I would like to have more info about the system where you
are opening the file... anyway:

look in your system for a Windows update called
 
"Office File Validation Add-In for Office ..."


if you have it... uninstall it...

the file should load much more quickly

specially if is loaded froma share

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复