Memory error using openpyxl and large data excels

后端 未结 4 1730
失恋的感觉
失恋的感觉 2021-01-04 23:32

I have written a script which has to read lot of excel files from a folder (around 10,000). This script loads the excel file (some of them has more than 2,000 rows) and read

相关标签:
4条回答
  • 2021-01-04 23:43

    This approach worked for me, copying data from a SQLite DB into corresponding worksheets for each table Some of the tables have > 250,000 rows and I was running into a Memory Error from OpenPyXL. The trick is to incrementally save every 100K rows and then reopen the workbook - this seems to reduce memory usage. I do something very similar to what @sakiM is doing above. Here's part of my code that does this:

        row_num = 2   # row 1 previously populated with column names
        session = self.CreateDBSession()  # SQL Alchemy connection to SQLite
        for item in session.query(ormClass):
            col_num = 1
            for col_name in sorted(fieldsInDB):  # list of columns from the table being put into XL columns
                if col_name != "__mapper__":        # Something SQL Alchemy apparently adds...
                    val = getattr(item, col_name)
                    sheet.cell(row=row_num, column=col_num).value = val
                    col_num += 1
            row_num += 1
            if row_num % self.MAX_ROW_CHUNK == 0:   # MAX_ROW_CHUNK = 100000 
                self.WriteChunk()
    
    # Write this chunk and reload the workbook to work around OpenPyXL memory issues
    def WriteChunk(self):
        print("Incremental save of %s" % self.XLSPath)
        self.SaveXLWorkbook()
        print("Reopening %s" % self.XLSPath)
        self.OpenXLWorkbook()
    
    # Open the XL Workbook we are updating
    def OpenXLWorkbook(self):
        if not self.workbook:
            self.workbook = openpyxl.load_workbook(self.XLSPath)
        return self.workbook
    
    # Save the workbook
    def SaveXLWorkbook(self):
        if self.workbook:
            self.workbook.save(self.XLSPath)
            self.workbook = None
    
    0 讨论(0)
  • 2021-01-04 23:50

    The default implementation of openpyxl will store all the accessed cells into memory. I will suggest you to use the Optimized reader (link - https://openpyxl.readthedocs.org/en/latest/optimized.html) instead

    In code:-

    wb = load_workbook(file_path, use_iterators = True)
    

    While loading a workbook pass use_iterators = True. Then access the sheet and cells like:

    for row in sheet.iter_rows():
        for cell in row:
            cell_text = cell.value
    

    This will reduce the memory footprint to 5-10%

    UPDATE: In version 2.4.0 use_iterators = True option is removed. In newer versions openpyxl.writer.write_only.WriteOnlyWorksheet is introduced for dumping large amounts of data.

    from openpyxl import Workbook
    wb = Workbook(write_only=True)
    ws = wb.create_sheet()
    
    # now we'll fill it with 100 rows x 200 columns
    for irow in range(100):
        ws.append(['%d' % i for i in range(200)])
    
    # save the file
    wb.save('new_big_file.xlsx') 
    

    Not tested the below code just copied from the above link.

    Thanks @SdaliM for the information.

    0 讨论(0)
  • 2021-01-04 23:51

    With recent versions of openpyxl one has to load and read huge source workbook with read_only=True argument, and create / write huge destination workbook with write_only=True mode:

    https://openpyxl.readthedocs.io/en/latest/optimized.html

    0 讨论(0)
  • 2021-01-05 00:00

    As @anuragal said

    openpyxl will store all the accessed cells into memory

    Another way to handle this huge memory problem while looping every cell is Divide-and-conquer. The point is after reading enough cell, save the excel by wb.save(), then the past values will be removed from memory.

    checkPointLine = 100 # choose a better number in your case.
    
    excel = openpyxl.load_workbook(excelPath,data_only= True)
    ws = excel.active
    readingLine = 1
    
    for rowNum in range(readingLine,max_row):
        row = ws[rowNum]
        first = row[0]
        currentRow = first.row
        #doing the things to this line content then mark `isDirty = True`
    
        if currentRow%checkPointLine == 0:
            if isDirty:
                #write back only changed content
                excel.save(excelPath)
                isDirty = False
            excel = openpyxl.load_workbook(excelPath)
            ws = excel.active
        readingLine = first.row
    
    0 讨论(0)
提交回复
热议问题