Pandas seems to ignore first column name when reading tab-delimited data, gives KeyError

前端 未结 4 1486
[愿得一人]
[愿得一人] 2021-01-04 17:37

I am using pandas 0.12.0 in ipython3 on Ubuntu 13.10, in order to wrangle large tab-delimited datasets in txt files. Using read_table to create a DataFrame from the txt app

相关标签:
4条回答
  • 2021-01-04 18:05

    Sounds like you just need to conditionally remove the BOM from the start of your files. You can do this with a wrapper around the file like so:

    def remove_bom(filename):
        fp = open(filename, 'rbU')
        if fp.read(2) != b'\xfe\xff':
            fp.seek(0, 0)
        return fp
    
    # read_table also accepts a file pointer, so we can remove the bom first
    samples = pd.read_table(remove_bom('~/datafile.txt'))
    
    print(samples['RECORDING_SESSION_LABEL'])
    
    0 讨论(0)
  • 2021-01-04 18:16

    I think the issue you're having is just that the "tabs" in datafile.txt aren't actually tabs. (When I read it in using your code, the dataframe has 1 column and 15 rows.) You could do a regex search-and-replace, or, alternately, just parse it as-is:

    import pandas as pd
    from numpy import transpose
    
    with open('~/datafile.txt', 'r') as datafile:
        data = datafile.read()
    while '  ' in data:
        data = data.replace('  ', ' ')
    data = transpose([row.split(' ') for row in data.strip().split('\n')])
    datadict = {}
    for col in data:
        datadict[col[0]] = col[1:]
    samples = pd.DataFrame(datadict)
    print(samples['RECORDING_SESSION_LABEL'])
    

    This works ok for me on your datafile.txt: the resulting dataframe has 15 rows x 7 columns.

    0 讨论(0)
  • 2021-01-04 18:22

    This seems to be (related to) a known issue, see GH #4793. Using 'utf-8-sig' as the encoding seems to work. Without it, we have:

    >>> df = pd.read_table("datafile.txt")
    >>> df.columns
    Index([u'RECORDING_SESSION_LABEL', u'LEFT_GAZE_X', u'LEFT_GAZE_Y', u'RIGHT_GAZE_X', u'RIGHT_GAZE_Y', u'VIDEO_FRAME_INDEX', u'VIDEO_NAME'], dtype='object')
    >>> df.columns[0]
    '\xef\xbb\xbfRECORDING_SESSION_LABEL'
    

    but with it, we have

    >>> df = pd.read_table("datafile.txt", encoding="utf-8-sig")
    >>> df.columns
    Index([u'RECORDING_SESSION_LABEL', u'LEFT_GAZE_X', u'LEFT_GAZE_Y', u'RIGHT_GAZE_X', u'RIGHT_GAZE_Y', u'VIDEO_FRAME_INDEX', u'VIDEO_NAME'], dtype='object')
    >>> df.columns[0]
    u'RECORDING_SESSION_LABEL'
    >>> df["RECORDING_SESSION_LABEL"].max()
    u'73_1'
    

    (Used Python 2 for the above, but the same happens with Python 3.)

    0 讨论(0)
  • 2021-01-04 18:32

    I also stumbled upon similar problem. When I was reading as df = pandas.read_csv(csvfile, sep), the first column had this strange format in name:

    df.columns[0]
    

    returned this result:

    '\xef\xbb\xbfColName'
    

    When I tried selecting this column, I got an error:

    df.ColName
    

    returned

    AttributeError: 'DataFrame' object has no attribute 'ColName'
    

    After reading this I just used my external program Sublime to change the encoding and save the file as a new file (save with encoding UTF-8, but without BOM).

    Afterwards pandas reads the first column name correctly and I am able to select it withdf.ColName and it returns correct value. Such a small thing that took 45 minutes to solve.

    TLDR: Save file with encoding without BOM.

    0 讨论(0)
提交回复
热议问题