I have a 150MB one-sheet excel file that takes about 7 minutes to open on a very powerful machine using the following:
# using python
import xlrd
wb = xlrd.open_
I managed to read the file in about 30 seconds using .NET core and the Open XML SDK.
The following example returns a list of objects containing all rows and cells with the matching types, it supports date, numeric and text cells. The project is available here: https://github.com/xferaa/BigSpreadSheetExample/ (Should work on Windows, Linux and Mac OS and does not require Excel or any Excel component to be installed).
public List> ParseSpreadSheet()
{
List> rows = new List>();
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filePath, false))
{
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
Dictionary sharedStringCache = new Dictionary();
int i = 0;
foreach (var el in workbookPart.SharedStringTablePart.SharedStringTable.ChildElements)
{
sharedStringCache.Add(i++, el.InnerText);
}
while (reader.Read())
{
if(reader.ElementType == typeof(Row))
{
reader.ReadFirstChild();
List
I ran the program in a three year old Laptop with a SSD drive, 8GB of RAM and an Intel Core i7-4710 CPU @ 2.50GHz (two cores) on Windows 10 64 bits.
Note that although opening and parsing the whole file as strings takes a bit less than 30 seconds, when using objects as in the example of my last edit, the time goes up to almost 50 seconds with my crappy laptop. You will probably get closer to 30 seconds in your server with Linux.
The trick was to use the SAX approach as explained here:
https://msdn.microsoft.com/en-us/library/office/gg575571.aspx