Processing large XLSX file in python

问题

I have a large xlsx Excel file (56mb, 550k rows) from which I tried to read the first 10 rows. I tried using xlrd, openpyxl, and pyexcel-xlsx, but they always take more than 35 mins because it loads the whole file in memory.

I unzipped the Excel file and found out that the xml which contains the data I need is 800mb unzipped.

When you load the same file in Excel it takes 30 seconds. I'm wondering why it takes that much time in Python?

回答1:

Here is it, i found a solution. The fastest way to read an xlsx sheet.

56mb file with over 500k rows and 4 sheets took 6s to proceed.

import zipfile
from bs4 import BeautifulSoup

paths = []
mySheet = 'Sheet Name'
filename = 'xlfile.xlsx'
file = zipfile.ZipFile(filename, "r")

for name in file.namelist():
    if name == 'xl/workbook.xml':
        data = BeautifulSoup(file.read(name), 'html.parser')
        sheets = data.find_all('sheet')
        for sheet in sheets:
            paths.append([sheet.get('name'), 'xl/worksheets/sheet' + str(sheet.get('sheetid')) + '.xml'])

for path in paths:
    if path[0] == mySheet:
        with file.open(path[1]) as reader:
            for row in reader:
                print(row)  ## do what ever you want with your data
        reader.close()

Enjoy and happy coding.

回答2:

Use openpyxl's read-only mode to do this.

You'll be able to work with the relevant worksheet instantly.

回答3:

The load time you're experiencing is directly related to the io speed of your memory chip.

When pandas loads an excel file, it makes several copies of the file -- since the file structure isn't serialized (excel uses a binary encoding). For a deeper dive, check out this article I've written: Loading Ridiculously Large Excel Files in Python

In terms of a solution: I'd suggest, as a workaround:

load your excel file through a virtual machine with specialized hardware (here's what AWS has to offer)
save your file to a csv format for local use.
For even better performance, use an optimized data structure such as parquet

来源：https://stackoverflow.com/questions/38208137/processing-large-xlsx-file-in-python

标签

python

excel

XLSX

openpyxl

xlrd