Create Pandas DataFrame from a string

后端 未结 5 1577
灰色年华
灰色年华 2020-11-22 09:01

In order to test some functionality I would like to create a DataFrame from a string. Let\'s say my test data looks like:

TESTDATA=\"\"\"col1;co         


        
相关标签:
5条回答
  • 2020-11-22 09:38

    A quick and easy solution for interactive work is to copy-and-paste the text by loading the data from the clipboard.

    Select the content of the string with your mouse:

    In the Python shell use read_clipboard()

    >>> pd.read_clipboard()
      col1;col2;col3
    0       1;4.4;99
    1      2;4.5;200
    2       3;4.7;65
    3      4;3.2;140
    

    Use the appropriate separator:

    >>> pd.read_clipboard(sep=';')
       col1  col2  col3
    0     1   4.4    99
    1     2   4.5   200
    2     3   4.7    65
    3     4   3.2   140
    
    >>> df = pd.read_clipboard(sep=';') # save to dataframe
    
    0 讨论(0)
  • 2020-11-22 09:40

    This answer applies when a string is manually entered, not when it's read from somewhere.

    A traditional variable-width CSV is unreadable for storing data as a string variable. Especially for use inside a .py file, consider fixed-width pipe-separated data instead. Various IDEs and editors may have a plugin to format pipe-separated text into a neat table.

    Using read_csv

    Store the following in a utility module, e.g. util/pandas.py. An example is included in the function's docstring.

    import io
    import re
    
    import pandas as pd
    
    
    def read_psv(str_input: str, **kwargs) -> pd.DataFrame:
        """Read a Pandas object from a pipe-separated table contained within a string.
    
        Input example:
            | int_score | ext_score | eligible |
            |           | 701       | True     |
            | 221.3     | 0         | False    |
            |           | 576       | True     |
            | 300       | 600       | True     |
    
        The leading and trailing pipes are optional, but if one is present,
        so must be the other.
    
        `kwargs` are passed to `read_csv`. They must not include `sep`.
    
        In PyCharm, the "Pipe Table Formatter" plugin has a "Format" feature that can 
        be used to neatly format a table.
    
        Ref: https://stackoverflow.com/a/46471952/
        """
    
        substitutions = [
            ('^ *', ''),  # Remove leading spaces
            (' *$', ''),  # Remove trailing spaces
            (r' *\| *', '|'),  # Remove spaces between columns
        ]
        if all(line.lstrip().startswith('|') and line.rstrip().endswith('|') for line in str_input.strip().split('\n')):
            substitutions.extend([
                (r'^\|', ''),  # Remove redundant leading delimiter
                (r'\|$', ''),  # Remove redundant trailing delimiter
            ])
        for pattern, replacement in substitutions:
            str_input = re.sub(pattern, replacement, str_input, flags=re.MULTILINE)
        return pd.read_csv(io.StringIO(str_input), sep='|', **kwargs)
    
    

    Non-working alternatives

    The code below doesn't work properly because it adds an empty column on both the left and right sides.

    df = pd.read_csv(io.StringIO(df_str), sep=r'\s*\|\s*', engine='python')
    

    As for read_fwf, it doesn't actually use so many of the optional kwargs that read_csv accepts and uses. As such, it shouldn't be used at all for pipe-separated data.

    0 讨论(0)
  • 2020-11-22 09:42

    A simple way to do this is to use StringIO.StringIO (python2) or io.StringIO (python3) and pass that to the pandas.read_csv function. E.g:

    import sys
    if sys.version_info[0] < 3: 
        from StringIO import StringIO
    else:
        from io import StringIO
    
    import pandas as pd
    
    TESTDATA = StringIO("""col1;col2;col3
        1;4.4;99
        2;4.5;200
        3;4.7;65
        4;3.2;140
        """)
    
    df = pd.read_csv(TESTDATA, sep=";")
    
    0 讨论(0)
  • 2020-11-22 09:46

    Split Method

    data = input_string
    df = pd.DataFrame([x.split(';') for x in data.split('\n')])
    print(df)
    
    0 讨论(0)
  • 2020-11-22 09:54

    Simplest way is to save it to temp file and then read it:

    import pandas as pd
    
    CSV_FILE_NAME = 'temp_file.csv'  # Consider creating temp file, look URL below
    with open(CSV_FILE_NAME, 'w') as outfile:
        outfile.write(TESTDATA)
    df = pd.read_csv(CSV_FILE_NAME, sep=';')
    

    Right way of creating temp file: How can I create a tmp file in Python?

    0 讨论(0)
提交回复
热议问题