问题
I have a large xlsx Excel file (56mb, 550k rows) from which I tried to read the first 10 rows. I tried using xlrd
, openpyxl
, and pyexcel-xlsx
, but they always take more than 35 mins because it loads the whole file in memory.
I unzipped the Excel file and found out that the xml
which contains the data I need is 800mb unzipped.
When you load the same file in Excel it takes 30 seconds. I'm wondering why it takes that much time in Python?
回答1:
Here is it, i found a solution. The fastest way to read an xlsx sheet.
56mb file with over 500k rows and 4 sheets took 6s to proceed.
import zipfile
from bs4 import BeautifulSoup
paths = []
mySheet = 'Sheet Name'
filename = 'xlfile.xlsx'
file = zipfile.ZipFile(filename, "r")
for name in file.namelist():
if name == 'xl/workbook.xml':
data = BeautifulSoup(file.read(name), 'html.parser')
sheets = data.find_all('sheet')
for sheet in sheets:
paths.append([sheet.get('name'), 'xl/worksheets/sheet' + str(sheet.get('sheetid')) + '.xml'])
for path in paths:
if path[0] == mySheet:
with file.open(path[1]) as reader:
for row in reader:
print(row) ## do what ever you want with your data
reader.close()
Enjoy and happy coding.
回答2:
Use openpyxl's read-only mode to do this.
You'll be able to work with the relevant worksheet instantly.
回答3:
The load time you're experiencing is directly related to the io speed of your memory chip.
When pandas
loads an excel file, it makes several copies of the file -- since the file structure isn't serialized (excel uses a binary encoding). For a deeper dive, check out this article I've written: Loading Ridiculously Large Excel Files in Python
In terms of a solution: I'd suggest, as a workaround:
- load your excel file through a virtual machine with specialized hardware (here's what AWS has to offer)
- save your file to a csv format for local use.
- For even better performance, use an optimized data structure such as parquet
来源:https://stackoverflow.com/questions/38208137/processing-large-xlsx-file-in-python