I have an input file with known columns, let\'s say two columns Name
and Sex
. Sometimes it has the header line Name,Sex
, and sometimes it
I've come up with a way of detecting the header without prior knowledge of its names:
if any(df.iloc[0].apply(lambda x: isinstance(x, str))):
df = df[1:].reset_index(drop=True)
And by changing it slightly, it can update the current header with the detected one:
if any(df.iloc[0].apply(lambda x: isinstance(x, str))):
df = df[1:].reset_index(drop=True).rename(columns=df.iloc[0])
This would allow easily selecting the desired behavior:
update_header = True
if any(df.iloc[0].apply(lambda x: isinstance(x, str))):
new_header = df.iloc[0]
df = df[1:].reset_index(drop=True)
if update_header:
df.rename(columns=new_header, inplace=True)
Pros:
Cons:
if any()
to require all elements to be strings might help, unless data also contains entire rows of strings.using new feature - selection by callable:
cols = ['Name','Sex']
df = (pd.read_csv(filename, header=None, names=cols)
[lambda x: np.ones(len(x)).astype(bool)
if (x.iloc[0] != cols).all()
else np.concatenate([[False], np.ones(len(x)-1).astype(bool)])]
)
using .query() method:
df = (pd.read_csv(filename, header=None, names=cols)
.query('Name != "Name" and Sex != "Sex"'))
i'm not sure that this is the most elegant way, but this should work as well:
df = pd.read_csv(filename, header=None, names=cols)
if (df.iloc[0] == cols).all():
df = df[1:].reset_index(drop=True)