How to convert OpenDocument spreadsheets to a pandas DataFrame?

前端 未结 11 1576
日久生厌
日久生厌 2020-12-23 19:40

The Python library pandas can read Excel spreadsheets and convert them to a pandas.DataFrame with pandas.read_excel(file) command. Under the hood,

相关标签:
11条回答
  • 2020-12-23 19:43

    Edit: Happily, this answer below is now out of date, if you can update to a recent Pandas version. If you'd still like to work from a Pandas version of your data, and update it from ODS only when needed, read on.


    It seems the answer is No! And I would characterize the tools to read in ODS still ragged. If you're on POSIX, maybe the strategy of exporting to xlsx on the fly before using Pandas' very nice importing tools for xlsx is an option:

    unoconv -f xlsx -o tmp.xlsx myODSfile.ods 
    

    Altogether, my code looks like:

    import pandas as pd
    import os
    if fileOlderThan('tmp.xlsx','myODSfile.ods'):
        os.system('unoconv -f xlsx -o tmp.xlsx myODSfile.ods ')
    xl_file = pd.ExcelFile('tmp.xlsx')
    dfs = {sheet_name: xl_file.parse(sheet_name) 
              for sheet_name in xl_file.sheet_names}
    df=dfs['Sheet1']
    

    Here fileOlderThan() is a function (see http://github.com/cpbl/cpblUtilities) which returns true if tmp.xlsx does not exist or is older than the .ods file.

    0 讨论(0)
  • 2020-12-23 19:45

    Based heavily on the answer by davidovitch (thank you), I have put together a package that reads in a .ods file and returns a DataFrame. It's not a full implementation in pandas itself, such as his PR, but it provides a simple read_ods function that does the job.

    You can install it with pip install pandas_ods_reader. It's also possible to specify whether the file contains a header row or not, and to specify custom column names.

    0 讨论(0)
  • 2020-12-23 19:46

    I've had good luck with pandas read_clipboard. Selecting cells and then copy from excel or opendocument. In python run the following.

    import pandas as pd
    data = pd.read_clipboard()
    

    Pandas will do a good job based on the cells copied.

    0 讨论(0)
  • 2020-12-23 19:52

    Another option: read-ods-with-odfpy. This module takes an OpenDocument Spreadsheet as input, and returns a list, out of which a DataFrame can be created.

    0 讨论(0)
  • 2020-12-23 19:55

    You can read ODF (Open Document Format .ods) documents in Python using the following modules:

    • odfpy / read-ods-with-odfpy
    • ezodf
    • pyexcel / pyexcel-ods
    • py-odftools
    • simpleodspy

    Using ezodf, a simple ODS-to-DataFrame converter could look like this:

    import pandas as pd
    import ezodf
    
    doc = ezodf.opendoc('some_odf_spreadsheet.ods')
    
    print("Spreadsheet contains %d sheet(s)." % len(doc.sheets))
    for sheet in doc.sheets:
        print("-"*40)
        print("   Sheet name : '%s'" % sheet.name)
        print("Size of Sheet : (rows=%d, cols=%d)" % (sheet.nrows(), sheet.ncols()) )
    
    # convert the first sheet to a pandas.DataFrame
    sheet = doc.sheets[0]
    df_dict = {}
    for i, row in enumerate(sheet.rows()):
        # row is a list of cells
        # assume the header is on the first row
        if i == 0:
            # columns as lists in a dictionary
            df_dict = {cell.value:[] for cell in row}
            # create index for the column headers
            col_index = {j:cell.value for j, cell in enumerate(row)}
            continue
        for j, cell in enumerate(row):
            # use header instead of column index
            df_dict[col_index[j]].append(cell.value)
    # and convert to a DataFrame
    df = pd.DataFrame(df_dict)
    

    P.S.

    • ODF spreadsheet (*.ods files) support has been requested on the pandas issue tracker: https://github.com/pydata/pandas/issues/2311, but it is still not implemented.

    • ezodf was used in the unfinished PR9070 to implement ODF support in pandas. That PR is now closed (read the PR for a technical discussion), but it is still available as an experimental feature in this pandas fork.

    • there are also some brute force methods to read directly from the XML code (here)
    0 讨论(0)
  • 2020-12-23 19:55

    If possible, save as CSV from the spreadsheet application and then use pandas.read_csv(). IIRC, an 'ods' spreadsheet file actually is an XML file which also contains quite some formatting information. So, if it's about tabular data, extract this raw data first to an intermediate file (CSV, in this case), which you can then parse with other programs, such as Python/pandas.

    0 讨论(0)
提交回复
热议问题