I have a 150MB one-sheet excel file that takes about 7 minutes to open on a very powerful machine using the following:
# using python
import xlrd
wb = xlrd.open_
Python's Pandas library could be used to hold and process your data, but using it to directly load the .xlsx
file will be quite slow, e.g. using read_excel().
One approach would be to use Python to automate the conversion of your file into CSV using Excel itself and to then use Pandas to load the resulting CSV file using read_csv(). This will give you a good speed up, but not under 30 seconds:
import win32com.client as win32
import pandas as pd
from datetime import datetime
print ("Starting")
start = datetime.now()
# Use Excel to load the xlsx file and save it in csv format
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(r'c:\full path\BigSpreadsheet.xlsx')
excel.DisplayAlerts = False
wb.DoNotPromptForConvert = True
wb.CheckCompatibility = False
print('Saving')
wb.SaveAs(r'c:\full path\temp.csv', FileFormat=6, ConflictResolution=2)
excel.Application.Quit()
# Use Pandas to load the resulting CSV file
print('Loading CSV')
df = pd.read_csv(r'c:\full path\temp.csv', dtype=str)
print(df.shape)
print("Done", datetime.now() - start)
Column types
The types for your columns can be specified by passing dtype
and converters
and parse_dates
:
df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[8], infer_datetime_format=True)
You should also specify infer_datetime_format=True
, as this will greatly speed up the date conversion.
nfer_datetime_format
: boolean, default FalseIf True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.
Also add dayfirst=True
if dates are in the form DD/MM/YYYY
.
Selective columns
If you actually only need to work on columns 1 9 11
, then you could further reduce resources by specifying usecols=[0, 8, 10]
as follows:
df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[1], dayfirst=True, infer_datetime_format=True, usecols=[0, 8, 10])
The resulting dataframe would then only contain those 3 columns of data.
RAM drive
Using a RAM drive to store the temporary CSV file to would further speed up the load time.
Note: This does assume you are using a Windows PC with Excel available.