问题
I am using pandas 0.12.0 in ipython3 on Ubuntu 13.10, in order to wrangle large tab-delimited datasets in txt files. Using read_table to create a DataFrame from the txt appears to work, and the first row is read as a header, but attempting to access the first column using its name as an index throws a KeyError. I don't understand why this happens, given that the column names all appear to have been read correctly, and every other column can be indexed in this way.
The data looks like this:
RECORDING_SESSION_LABEL LEFT_GAZE_X LEFT_GAZE_Y RIGHT_GAZE_X RIGHT_GAZE_Y VIDEO_FRAME_INDEX VIDEO_NAME
73_1 . . 395.1 302 . .
73_1 . . 395 301.9 . .
73_1 . . 394.9 301.7 . .
73_1 . . 394.8 301.5 . .
73_1 . . 394.6 301.3 . .
73_1 . . 394.7 300.9 . .
73_1 . . 394.9 301.3 . .
73_1 . . 395.2 302 1 1_1_just_act.avi
73_1 . . 395.3 302.3 1 1_1_just_act.avi
73_1 . . 395.4 301.9 1 1_1_just_act.avi
73_1 . . 395.7 301.5 1 1_1_just_act.avi
73_1 . . 395.9 301.5 1 1_1_just_act.avi
73_1 . . 396 301.5 1 1_1_just_act.avi
73_1 . . 395.9 301.5 1 1_1_just_act.avi
15_1 395.4 301.7 . . . .
The delimiter is definitely tabs, and there is no trailing or leading whitespace.
The error occurs with this minimal program:
import pandas as pd
samples = pd.read_table('~/datafile.txt')
print(samples['RECORDING_SESSION_LABEL'])
which gives the error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-65-137d3c16b931> in <module>()
----> 1 print(samples['RECORDING_SESSION_LABEL'])
/usr/lib/python3/dist-packages/pandas/core/frame.py in __getitem__(self, key)
2001 # get column
2002 if self.columns.is_unique:
-> 2003 return self._get_item_cache(key)
2004
2005 # duplicate columns
/usr/lib/python3/dist-packages/pandas/core/generic.py in _get_item_cache(self, item)
665 return cache[item]
666 except Exception:
--> 667 values = self._data.get(item)
668 res = self._box_item_values(item, values)
669 cache[item] = res
/usr/lib/python3/dist-packages/pandas/core/internals.py in get(self, item)
1654 def get(self, item):
1655 if self.items.is_unique:
-> 1656 _, block = self._find_block(item)
1657 return block.get(item)
1658 else:
/usr/lib/python3/dist-packages/pandas/core/internals.py in _find_block(self, item)
1934
1935 def _find_block(self, item):
-> 1936 self._check_have(item)
1937 for i, block in enumerate(self.blocks):
1938 if item in block:
/usr/lib/python3/dist-packages/pandas/core/internals.py in _check_have(self, item)
1941 def _check_have(self, item):
1942 if item not in self.items:
-> 1943 raise KeyError('no item named %s' % com.pprint_thing(item))
1944
1945 def reindex_axis(self, new_axis, method=None, axis=0, copy=True):
KeyError: 'no item named RECORDING_SESSION_LABEL'
Simply doing print(samples)
gives the expected output of printing the whole table, complete with the first column and its header. Trying to print any other column (ie; the exact same code, but with 'RECORDING_SESSION_LABEL' replaced with 'LEFT_GAZE_X') works as it should. Furthermore, the header seems to have been read in correctly, and pandas recognizes 'RECORDING_SESSION_LABEL' as a column name. This is evidenced by using the .info() method and viewing the .columns attribute of samples, after it's been read in:
>samples.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 28 entries, 0 to 27
Data columns (total 7 columns):
RECORDING_SESSION_LABEL 28 non-null values
LEFT_GAZE_X 28 non-null values
LEFT_GAZE_Y 28 non-null values
RIGHT_GAZE_X 28 non-null values
RIGHT_GAZE_Y 28 non-null values
VIDEO_FRAME_INDEX 28 non-null values
VIDEO_NAME 28 non-null values
dtypes: object(7)
>print(samples.columns)
Index(['RECORDING_SESSION_LABEL', 'LEFT_GAZE_X', 'LEFT_GAZE_Y', 'RIGHT_GAZE_X', 'RIGHT_GAZE_Y', 'VIDEO_FRAME_INDEX', 'VIDEO_NAME'], dtype=object)
Another error behaviour that I feel is related occurs when using ipython's tab completion, which allows me to access the columns of samples as if they were attributes. It works for every column except the first. ie; hitting the tab key with >samples.R
only suggests samples.RIGHT_GAZE_X samples.RIGHT_GAZE_Y
.
So why is it behaving normally when looking at the whole dataframe, but failing when trying to access the first column by its name, even though it appears to have correctly read in that name?
回答1:
Sounds like you just need to conditionally remove the BOM from the start of your files. You can do this with a wrapper around the file like so:
def remove_bom(filename):
fp = open(filename, 'rbU')
if fp.read(2) != b'\xfe\xff':
fp.seek(0, 0)
return fp
# read_table also accepts a file pointer, so we can remove the bom first
samples = pd.read_table(remove_bom('~/datafile.txt'))
print(samples['RECORDING_SESSION_LABEL'])
回答2:
This seems to be (related to) a known issue, see GH #4793. Using 'utf-8-sig'
as the encoding seems to work. Without it, we have:
>>> df = pd.read_table("datafile.txt")
>>> df.columns
Index([u'RECORDING_SESSION_LABEL', u'LEFT_GAZE_X', u'LEFT_GAZE_Y', u'RIGHT_GAZE_X', u'RIGHT_GAZE_Y', u'VIDEO_FRAME_INDEX', u'VIDEO_NAME'], dtype='object')
>>> df.columns[0]
'\xef\xbb\xbfRECORDING_SESSION_LABEL'
but with it, we have
>>> df = pd.read_table("datafile.txt", encoding="utf-8-sig")
>>> df.columns
Index([u'RECORDING_SESSION_LABEL', u'LEFT_GAZE_X', u'LEFT_GAZE_Y', u'RIGHT_GAZE_X', u'RIGHT_GAZE_Y', u'VIDEO_FRAME_INDEX', u'VIDEO_NAME'], dtype='object')
>>> df.columns[0]
u'RECORDING_SESSION_LABEL'
>>> df["RECORDING_SESSION_LABEL"].max()
u'73_1'
(Used Python 2 for the above, but the same happens with Python 3.)
回答3:
I also stumbled upon similar problem. When I was reading as df = pandas.read_csv(csvfile, sep), the first column had this strange format in name:
df.columns[0]
returned this result:
'\xef\xbb\xbfColName'
When I tried selecting this column, I got an error:
df.ColName
returned
AttributeError: 'DataFrame' object has no attribute 'ColName'
After reading this I just used my external program Sublime to change the encoding and save the file as a new file (save with encoding UTF-8, but without BOM).
Afterwards pandas reads the first column name correctly and I am able to select it withdf.ColName
and it returns correct value. Such a small thing that took 45 minutes to solve.
TLDR: Save file with encoding without BOM.
回答4:
I think the issue you're having is just that the "tabs" in datafile.txt aren't actually tabs. (When I read it in using your code, the dataframe has 1 column and 15 rows.) You could do a regex search-and-replace, or, alternately, just parse it as-is:
import pandas as pd
from numpy import transpose
with open('~/datafile.txt', 'r') as datafile:
data = datafile.read()
while ' ' in data:
data = data.replace(' ', ' ')
data = transpose([row.split(' ') for row in data.strip().split('\n')])
datadict = {}
for col in data:
datadict[col[0]] = col[1:]
samples = pd.DataFrame(datadict)
print(samples['RECORDING_SESSION_LABEL'])
This works ok for me on your datafile.txt
: the resulting dataframe has 15 rows x 7 columns.
来源:https://stackoverflow.com/questions/21806496/pandas-seems-to-ignore-first-column-name-when-reading-tab-delimited-data-gives