How to open a huge excel file efficiently

前端 未结 11 751
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-30 21:29

I have a 150MB one-sheet excel file that takes about 7 minutes to open on a very powerful machine using the following:

# using python
import xlrd
wb = xlrd.open_         


        
11条回答
  •  醉酒成梦
    2021-01-30 21:53

    I managed to read the file in about 30 seconds using .NET core and the Open XML SDK.

    The following example returns a list of objects containing all rows and cells with the matching types, it supports date, numeric and text cells. The project is available here: https://github.com/xferaa/BigSpreadSheetExample/ (Should work on Windows, Linux and Mac OS and does not require Excel or any Excel component to be installed).

    public List> ParseSpreadSheet()
    {
        List> rows = new List>();
    
        using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filePath, false))
        {
            WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
            WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
    
            OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
    
            Dictionary sharedStringCache = new Dictionary();
    
            int i = 0;
            foreach (var el in workbookPart.SharedStringTablePart.SharedStringTable.ChildElements)
            {
                sharedStringCache.Add(i++, el.InnerText);
            }
    
            while (reader.Read())
            {
                if(reader.ElementType == typeof(Row))
                {
                    reader.ReadFirstChild();
    
                    List cells = new List();
    
                    do
                    {
                        if (reader.ElementType == typeof(Cell))
                        {
                            Cell c = (Cell)reader.LoadCurrentElement();
    
                            if (c == null || c.DataType == null || !c.DataType.HasValue)
                                continue;
    
                            object value;
    
                            switch(c.DataType.Value)
                            {
                                case CellValues.Boolean:
                                    value = bool.Parse(c.CellValue.InnerText);
                                    break;
                                case CellValues.Date:
                                    value = DateTime.Parse(c.CellValue.InnerText);
                                    break;
                                case CellValues.Number:
                                    value = double.Parse(c.CellValue.InnerText);
                                    break;
                                case CellValues.InlineString:
                                case CellValues.String:
                                    value = c.CellValue.InnerText;
                                    break;
                                case CellValues.SharedString:
                                    value = sharedStringCache[int.Parse(c.CellValue.InnerText)];
                                    break;
                                default:
                                    continue;
                            }
    
                            if (value != null)
                                cells.Add(value);
                        }
    
                    } while (reader.ReadNextSibling());
    
                    if (cells.Any())
                        rows.Add(cells);
                }
            }
        }
    
        return rows;
    }
    
    
    

    I ran the program in a three year old Laptop with a SSD drive, 8GB of RAM and an Intel Core i7-4710 CPU @ 2.50GHz (two cores) on Windows 10 64 bits.

    Note that although opening and parsing the whole file as strings takes a bit less than 30 seconds, when using objects as in the example of my last edit, the time goes up to almost 50 seconds with my crappy laptop. You will probably get closer to 30 seconds in your server with Linux.

    The trick was to use the SAX approach as explained here:

    https://msdn.microsoft.com/en-us/library/office/gg575571.aspx

    提交回复
    热议问题