How to open a huge excel file efficiently

前端 未结 11 729
佛祖请我去吃肉
佛祖请我去吃肉 2021-01-30 21:29

I have a 150MB one-sheet excel file that takes about 7 minutes to open on a very powerful machine using the following:

# using python
import xlrd
wb = xlrd.open_         


        
11条回答
  •  借酒劲吻你
    2021-01-30 21:53

    Python's Pandas library could be used to hold and process your data, but using it to directly load the .xlsx file will be quite slow, e.g. using read_excel().

    One approach would be to use Python to automate the conversion of your file into CSV using Excel itself and to then use Pandas to load the resulting CSV file using read_csv(). This will give you a good speed up, but not under 30 seconds:

    import win32com.client as win32        
    import pandas as pd    
    from datetime import datetime    
    
    print ("Starting")
    start = datetime.now()
    
    # Use Excel to load the xlsx file and save it in csv format
    excel = win32.gencache.EnsureDispatch('Excel.Application')
    wb = excel.Workbooks.Open(r'c:\full path\BigSpreadsheet.xlsx')
    excel.DisplayAlerts = False
    wb.DoNotPromptForConvert = True
    wb.CheckCompatibility = False
    
    print('Saving')
    wb.SaveAs(r'c:\full path\temp.csv', FileFormat=6, ConflictResolution=2) 
    excel.Application.Quit()
    
    # Use Pandas to load the resulting CSV file
    print('Loading CSV')
    df = pd.read_csv(r'c:\full path\temp.csv', dtype=str)
    
    print(df.shape)
    print("Done", datetime.now() - start)
    

    Column types
    The types for your columns can be specified by passing dtype and converters and parse_dates:

    df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[8], infer_datetime_format=True)
    

    You should also specify infer_datetime_format=True, as this will greatly speed up the date conversion.

    nfer_datetime_format : boolean, default False

    If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.

    Also add dayfirst=True if dates are in the form DD/MM/YYYY.

    Selective columns
    If you actually only need to work on columns 1 9 11, then you could further reduce resources by specifying usecols=[0, 8, 10] as follows:

    df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[1], dayfirst=True, infer_datetime_format=True, usecols=[0, 8, 10])
    

    The resulting dataframe would then only contain those 3 columns of data.

    RAM drive
    Using a RAM drive to store the temporary CSV file to would further speed up the load time.

    Note: This does assume you are using a Windows PC with Excel available.

提交回复
热议问题