How to obtain sheet names from XLS files without loading the whole file?

后端 未结 6 636
独厮守ぢ
独厮守ぢ 2020-11-29 22:36

I\'m currently using pandas to read an Excel file and present its sheet names to the user, so he can select which sheet he would like to use. The problem is that the files a

相关标签:
6条回答
  • 2020-11-29 22:40

    Python code adaptation with full pathlib path filename passed (e.g., ('c:\xml\file.xlsx')). From Dhwanil shah answer, without Django method used to create a temp dir.

    import xmltodict
    import shutil
    import zipfile
    
    
    def get_sheet_details(filename):
        sheets = []
        # Make a temporary directory with the file name
        directory_to_extract_to = (filename.with_suffix(''))
        directory_to_extract_to.mkdir(parents=True, exist_ok=True)
        # Extract the xlsx file as it is just a zip file
        zip_ref = zipfile.ZipFile(filename, 'r')
        zip_ref.extractall(directory_to_extract_to)
        zip_ref.close()
        # Open the workbook.xml which is very light and only has meta data, get sheets from it
        path_to_workbook = directory_to_extract_to / 'xl' / 'workbook.xml'
        with open(path_to_workbook, 'r') as f:
            xml = f.read()
            dictionary = xmltodict.parse(xml)
            for sheet in dictionary['workbook']['sheets']['sheet']:
                sheet_details = {
                    'id': sheet['@sheetId'],  # can be sheetId for some versions
                    'name': sheet['@name']  # can be name
                }
                sheets.append(sheet_details)
        # Delete the extracted files directory
        shutil.rmtree(directory_to_extract_to)
        return sheets
    
    0 讨论(0)
  • 2020-11-29 22:53

    By combining @Dhwanil shah's answer with the answer here I wrote code that is also compatible with xlsx files that have only one sheet:

    def get_sheet_ids(file_path):
    sheet_names = []
    with zipfile.ZipFile(file_path, 'r') as zip_ref:
        xml = zip_ref.open(r'xl/workbook.xml').read()
        dictionary = xmltodict.parse(xml)
    
        if not isinstance(dictionary['workbook']['sheets']['sheet'], list):
            sheet_names.append(dictionary['workbook']['sheets']['sheet']['@name'])
        else:
            for sheet in dictionary['workbook']['sheets']['sheet']:
                sheet_names.append(sheet['@name'])
    return sheet_names
    
    0 讨论(0)
  • 2020-11-29 22:55

    From my research with the standard / popular libs this hasn't been implemented as of 2020 for xlsx / xls but you can do this for xlsb. Either way these solutions should give you vast performance improvements. for xls, xlsx, xlsb.

    Below was benchmarked on a ~10Mb xlsx, xlsb file.

    xlsx, xls

    from openpyxl import load_workbook
    
    def get_sheetnames_xlsx(filepath):
        wb = load_workbook(filepath, read_only=True, keep_links=False)
        return wb.sheetnames
    

    Benchmarks: ~ 14x speed improvement

    # get_sheetnames_xlsx vs pd.read_excel
    225 ms ± 6.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    3.25 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    xlsb

    from pyxlsb import open_workbook
    
    def get_sheetnames_xlsb(filepath):
      with open_workbook(filepath) as wb:
         return wb.sheets
    

    Benchmarks: ~ 56x speed improvement

    # get_sheetnames_xlsb vs pd.read_excel
    96.4 ms ± 1.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    5.36 s ± 162 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Notes:

    • This is a good resource - http://www.python-excel.org/
    • xlrd is no longer maintained as of 2020
    0 讨论(0)
  • 2020-11-29 23:02

    you can use the xlrd library and open the workbook with the "on_demand=True" flag, so that the sheets won't be loaded automaticaly.

    Than you can retrieve the sheet names in a similar way to pandas:

    import xlrd
    xls = xlrd.open_workbook(r'<path_to_your_excel_file>', on_demand=True)
    print xls.sheet_names() # <- remeber: xlrd sheet_names is a function, not a property
    
    0 讨论(0)
  • 2020-11-29 23:02

    I have tried xlrd, pandas, openpyxl and other such libraries and all of them seem to take exponential time as the file size increase as it reads the entire file. The other solutions mentioned above where they used 'on_demand' did not work for me. The following function works for xlsx files.

    def get_sheet_details(file_path):
        sheets = []
        file_name = os.path.splitext(os.path.split(file_path)[-1])[0]
        # Make a temporary directory with the file name
        directory_to_extract_to = os.path.join(settings.MEDIA_ROOT, file_name)
        os.mkdir(directory_to_extract_to)
    
        # Extract the xlsx file as it is just a zip file
        zip_ref = zipfile.ZipFile(file_path, 'r')
        zip_ref.extractall(directory_to_extract_to)
        zip_ref.close()
    
        # Open the workbook.xml which is very light and only has meta data, get sheets from it
        path_to_workbook = os.path.join(directory_to_extract_to, 'xl', 'workbook.xml')
        with open(path_to_workbook, 'r') as f:
            xml = f.read()
            dictionary = xmltodict.parse(xml)
            for sheet in dictionary['workbook']['sheets']['sheet']:
                sheet_details = {
                    'id': sheet['sheetId'], # can be @sheetId for some versions
                    'name': sheet['name'] # can be @name
                }
                sheets.append(sheet_details)
    
        # Delete the extracted files directory
        shutil.rmtree(directory_to_extract_to)
        return sheets
    

    Since all xlsx are basically zipped files, we extract the underlying xml data and read sheet names from the workbook directly which takes a fraction of a second as compared to the library functions.

    Benchmarking: (On a 6mb xlsx file with 4 sheets)
    Pandas, xlrd: 12 seconds
    openpyxl: 24 seconds
    Proposed method: 0.4 seconds

    0 讨论(0)
  • 2020-11-29 23:05

    you can also use

    data=pd.read_excel('demanddata.xlsx',sheet_name='oil&gas')
    print(data)   
    

    Here demanddata is the name of your file oil&gas is one of your sheet name.Let there may be n number of sheet in your worksheet.Just Give the Name of the sheet which you like to fetch at Sheet_name="Name of Your required sheet"

    0 讨论(0)
提交回复
热议问题