How to open a huge excel file efficiently

前端 未结 11 754
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-30 21:29

I have a 150MB one-sheet excel file that takes about 7 minutes to open on a very powerful machine using the following:

# using python
import xlrd
wb = xlrd.open_         


        
相关标签:
11条回答
  • 2021-01-30 21:45

    Looks like it is hardly achievable in Python at all. If we unpack a sheet data file then it would take all required 30 seconds just to pass it through the C-based iterative SAX parser (using lxml, a very fast wrapper over libxml2):

    from __future__ import print_function
    
    from lxml import etree
    import time
    
    
    start_ts = time.time()
    
    for data in etree.iterparse(open('xl/worksheets/sheet1.xml'), events=('start',), 
                                collect_ids=False, resolve_entities=False,
                                huge_tree=True):
        pass
    
    print(time.time() - start_ts)
    

    The sample output: 27.2134890556

    By the way, the Excel itself needs about 40 seconds to load the workbook.

    0 讨论(0)
  • 2021-01-30 21:47

    I'm using a Dell Precision T1700 workstation and using c# I was able to open the file and read it's contents in about 24 seconds just using standard code to open a workbook using interop services. Using references to the Microsoft Excel 15.0 Object Library here is my code.

    My using statements:

    using System.Runtime.InteropServices;
    using Excel = Microsoft.Office.Interop.Excel;
    

    Code to open and read workbook:

    public partial class MainWindow : Window {
        public MainWindow() {
            InitializeComponent();
    
            Excel.Application xlApp;
            Excel.Workbook wb;
            Excel.Worksheet ws;
    
            xlApp = new Excel.Application();
            xlApp.Visible = false;
            xlApp.ScreenUpdating = false;
    
            wb = xlApp.Workbooks.Open(@"Desired Path of workbook\Copy of BigSpreadsheet.xlsx");
    
            ws = wb.Sheets["Sheet1"];
    
            //string rng = ws.get_Range("A1").Value;
            MessageBox.Show(ws.get_Range("A1").Value);
    
            Marshal.FinalReleaseComObject(ws);
    
            wb.Close();
            Marshal.FinalReleaseComObject(wb);
    
            xlApp.Quit();
            Marshal.FinalReleaseComObject(xlApp);
    
            GC.Collect();
            GC.WaitForPendingFinalizers();
        }
    }
    
    0 讨论(0)
  • 2021-01-30 21:47

    The c# and ole solution still have some bottleneck.So i test it by c++ and ado.

    _bstr_t connStr(makeConnStr(excelFile, header).c_str());
    
    TESTHR(pRec.CreateInstance(__uuidof(Recordset)));       
    TESTHR(pRec->Open(sqlSelectSheet(connStr, sheetIndex).c_str(), connStr, adOpenStatic, adLockOptimistic, adCmdText));
    
    while(!pRec->adoEOF)
    {
        for(long i = 0; i < pRec->Fields->GetCount(); ++i)
        {   
            _variant_t v = pRec->Fields->GetItem(i)->Value;
            if(v.vt == VT_R8)
                num[i] = v.dblVal;
            if(v.vt == VT_BSTR)
                str[i] = v.bstrVal;          
            ++cellCount;
        }                                    
        pRec->MoveNext();
    }
    

    In i5-4460 and HDD machine,i find 500 thousands of cell in xls will take 1.5s.But same data in xlsx will take 2.829s.so it's possible for manipulating your data under 30s.

    If you really need under 30s,use RAM Drive to reduce file IO.It will significantly improve your process. I cannot download your data to test it,so please tell me the result.

    0 讨论(0)
  • 2021-01-30 21:47

    Another way that should improve largely the load/operation time is a RAMDrive

    create a RAMDrive with enough space for your file and a 10%..20% extra space...
    copy the file for the RAMDrive...
    Load the file from there... depending on your drive and filesystem the speed improvement should be huge...

    My favourite is IMDisk toolkit
    (https://sourceforge.net/projects/imdisk-toolkit/) here you have a powerfull command line to script everything...

    I also recommend SoftPerfect ramdisk
    (http://www.majorgeeks.com/files/details/softperfect_ram_disk.html)

    but that also depends of your OS...

    0 讨论(0)
  • 2021-01-30 21:47

    Have you tried loading the worksheet on demand, which available since version 0.7.1 of xlrd?

    To do this you need to pass on_demand=True to open_workbook().

    xlrd.open_workbook(filename=None, logfile=<_io.TextIOWrapper name='' mode='w' encoding='UTF-8'>, verbosity=0, use_mmap=1, file_contents=None, encoding_override=None, formatting_info=False, on_demand=False, ragged_rows=False)


    Other potential python solutions I found for reading an xlsx file:

    • Read the raw xml in 'xl/sharedStrings.xml' and 'xl/worksheets/sheet1.xml'
    • Try the openpyxl library's Read Only mode which claims too be optimized in memory usage for large files.

      from openpyxl import load_workbook wb = load_workbook(filename='large_file.xlsx', read_only=True) ws = wb['big_data']
      
      for row in ws.rows:
          for cell in row:
              print(cell.value)
      
    • If you are running on Windows you could use PyWin32 and 'Excel.Application'

      import time
      import win32com.client as win32
      def excel():
         xl = win32.gencache.EnsureDispatch('Excel.Application')
         ss = xl.Workbooks.Add()
      ...
      
    0 讨论(0)
  • 2021-01-30 21:49

    I would like to have more info about the system where you are opening the file... anyway:

    look in your system for a Windows update called
    "Office File Validation Add-In for Office ..."

    if you have it... uninstall it...
    the file should load much more quickly
    specially if is loaded froma share

    0 讨论(0)
提交回复
热议问题