Processing large XLSX file in python

别说谁变了你拦得住时间么 提交于 2020-12-10 08:57:37

问题


I have a large xlsx Excel file (56mb, 550k rows) from which I tried to read the first 10 rows. I tried using xlrd, openpyxl, and pyexcel-xlsx, but they always take more than 35 mins because it loads the whole file in memory.

I unzipped the Excel file and found out that the xml which contains the data I need is 800mb unzipped.

When you load the same file in Excel it takes 30 seconds. I'm wondering why it takes that much time in Python?


回答1:


Here is it, i found a solution. The fastest way to read an xlsx sheet.

56mb file with over 500k rows and 4 sheets took 6s to proceed.

import zipfile
from bs4 import BeautifulSoup

paths = []
mySheet = 'Sheet Name'
filename = 'xlfile.xlsx'
file = zipfile.ZipFile(filename, "r")

for name in file.namelist():
    if name == 'xl/workbook.xml':
        data = BeautifulSoup(file.read(name), 'html.parser')
        sheets = data.find_all('sheet')
        for sheet in sheets:
            paths.append([sheet.get('name'), 'xl/worksheets/sheet' + str(sheet.get('sheetid')) + '.xml'])

for path in paths:
    if path[0] == mySheet:
        with file.open(path[1]) as reader:
            for row in reader:
                print(row)  ## do what ever you want with your data
        reader.close()

Enjoy and happy coding.




回答2:


Use openpyxl's read-only mode to do this.

You'll be able to work with the relevant worksheet instantly.




回答3:


The load time you're experiencing is directly related to the io speed of your memory chip.

When pandas loads an excel file, it makes several copies of the file -- since the file structure isn't serialized (excel uses a binary encoding). For a deeper dive, check out this article I've written: Loading Ridiculously Large Excel Files in Python

In terms of a solution: I'd suggest, as a workaround:

  • load your excel file through a virtual machine with specialized hardware (here's what AWS has to offer)
  • save your file to a csv format for local use.
  • For even better performance, use an optimized data structure such as parquet


来源:https://stackoverflow.com/questions/38208137/processing-large-xlsx-file-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!