I have written a script which has to read lot of excel files from a folder (around 10,000). This script loads the excel file (some of them has more than 2,000 rows) and read
This approach worked for me, copying data from a SQLite DB into corresponding worksheets for each table Some of the tables have > 250,000 rows and I was running into a Memory Error from OpenPyXL. The trick is to incrementally save every 100K rows and then reopen the workbook - this seems to reduce memory usage. I do something very similar to what @sakiM is doing above. Here's part of my code that does this:
row_num = 2 # row 1 previously populated with column names
session = self.CreateDBSession() # SQL Alchemy connection to SQLite
for item in session.query(ormClass):
col_num = 1
for col_name in sorted(fieldsInDB): # list of columns from the table being put into XL columns
if col_name != "__mapper__": # Something SQL Alchemy apparently adds...
val = getattr(item, col_name)
sheet.cell(row=row_num, column=col_num).value = val
col_num += 1
row_num += 1
if row_num % self.MAX_ROW_CHUNK == 0: # MAX_ROW_CHUNK = 100000
self.WriteChunk()
# Write this chunk and reload the workbook to work around OpenPyXL memory issues
def WriteChunk(self):
print("Incremental save of %s" % self.XLSPath)
self.SaveXLWorkbook()
print("Reopening %s" % self.XLSPath)
self.OpenXLWorkbook()
# Open the XL Workbook we are updating
def OpenXLWorkbook(self):
if not self.workbook:
self.workbook = openpyxl.load_workbook(self.XLSPath)
return self.workbook
# Save the workbook
def SaveXLWorkbook(self):
if self.workbook:
self.workbook.save(self.XLSPath)
self.workbook = None
The default implementation of openpyxl will store all the accessed cells into memory. I will suggest you to use the Optimized reader (link - https://openpyxl.readthedocs.org/en/latest/optimized.html) instead
In code:-
wb = load_workbook(file_path, use_iterators = True)
While loading a workbook pass use_iterators = True
. Then access the sheet and cells like:
for row in sheet.iter_rows():
for cell in row:
cell_text = cell.value
This will reduce the memory footprint to 5-10%
UPDATE: In version 2.4.0 use_iterators = True
option is removed. In newer versions openpyxl.writer.write_only.WriteOnlyWorksheet
is introduced for dumping large amounts of data.
from openpyxl import Workbook
wb = Workbook(write_only=True)
ws = wb.create_sheet()
# now we'll fill it with 100 rows x 200 columns
for irow in range(100):
ws.append(['%d' % i for i in range(200)])
# save the file
wb.save('new_big_file.xlsx')
Not tested the below code just copied from the above link.
Thanks @SdaliM for the information.
With recent versions of openpyxl one has to load and read huge source workbook with read_only=True
argument, and create / write huge destination workbook with write_only=True
mode:
https://openpyxl.readthedocs.io/en/latest/optimized.html
As @anuragal said
openpyxl will store all the accessed cells into memory
Another way to handle this huge memory problem while looping every cell is Divide-and-conquer. The point is after reading enough cell, save the excel by wb.save()
, then the past values will be removed from memory.
checkPointLine = 100 # choose a better number in your case.
excel = openpyxl.load_workbook(excelPath,data_only= True)
ws = excel.active
readingLine = 1
for rowNum in range(readingLine,max_row):
row = ws[rowNum]
first = row[0]
currentRow = first.row
#doing the things to this line content then mark `isDirty = True`
if currentRow%checkPointLine == 0:
if isDirty:
#write back only changed content
excel.save(excelPath)
isDirty = False
excel = openpyxl.load_workbook(excelPath)
ws = excel.active
readingLine = first.row